Pandas: DataFrame Operations, Merging, GroupBy, Performance

The Interview Question

ℹ️

Question: You have two DataFrames — orders with columns [order_id, customer_id, product_category, amount, order_date] and customers with columns [customer_id, name, email, signup_date, tier]. Write optimized Pandas code to:

Merge these DataFrames and find the top 5 customers by total spending per product category
Handle duplicate columns and missing values appropriately
Optimize the operation for a dataset with 100M+ rows

Detailed Answer

1. Understanding DataFrame Merging

Pandas provides multiple merge strategies that mirror SQL join operations. Understanding the nuances between them is critical for both correctness and performance.

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Create sample data
np.random.seed(42)
n_orders = 1000000

orders = pd.DataFrame({
    'order_id': range(1, n_orders + 1),
    'customer_id': np.random.randint(1, 50001, n_orders),
    'product_category': np.random.choice(
        ['Electronics', 'Clothing', 'Food', 'Books', 'Home'], 
        n_orders
    ),
    'amount': np.round(np.random.exponential(50, n_orders), 2),
    'order_date': pd.date_range('2023-01-01', periods=n_orders, freq='T')
})

customers = pd.DataFrame({
    'customer_id': range(1, 50001),
    'name': [f'Customer_{i}' for i in range(1, 50001)],
    'email': [f'customer_{i}@email.com' for i in range(1, 50001)],
    'signup_date': pd.date_range('2020-01-01', periods=50000, freq='T'),
    'tier': np.random.choice(['Gold', 'Silver', 'Bronze', 'Platinum'], 50000)
})

💡

Pro Tip: Always verify column data types before merging. Mismatched dtypes cause silent failures or performance degradation.

2. Merge Strategies and When to Use Each

# Left merge - preserves all orders (primary use case here)
merged_df = pd.merge(
    orders, 
    customers, 
    on='customer_id', 
    how='left',
    validate='many_to_one'  # Ensures customer_id is unique in customers
)

# Handle duplicates explicitly
merged_df = merged_df.drop_duplicates(subset=['order_id'])

# Handle missing values from failed joins
merged_df['tier'] = merged_df['tier'].fillna('Unknown')

Merge Types Comparison:

Merge Type	Use Case	Row Count
`left`	Keep all records from left DataFrame	Same as left
`right`	Keep all records from right DataFrame	Same as right
`inner`	Only matching records from both	Intersection
`outer`	All records from both DataFrames	Union

3. GroupBy with Multi-Aggregation

# Step 1: Basic GroupBy
category_spending = merged_df.groupby(['product_category', 'customer_id'])['amount'].agg(['sum', 'count', 'mean'])

# Step 2: Reset index for easier manipulation
category_spending = category_spending.reset_index()

# Step 3: Rename columns for clarity
category_spending.columns = ['product_category', 'customer_id', 'total_spent', 'order_count', 'avg_order_value']

# Step 4: Sort and get top 5 per category
top_customers = (
    category_spending
    .sort_values(['product_category', 'total_spent'], ascending=[True, False])
    .groupby('product_category')
    .head(5)
)

4. Performance Optimization for 100M+ Rows

When dealing with massive datasets, standard Pandas operations become memory-intensive. Here are optimization strategies:

# Strategy 1: Optimize data types
def optimize_dtypes(df):
    """Reduce memory usage by downcasting numeric types"""
    for col in df.select_dtypes(include=['int']).columns:
        df[col] = pd.to_numeric(df[col], downcast='integer')
    for col in df.select_dtypes(include=['float']).columns:
        df[col] = pd.to_numeric(df[col], downcast='float')
    for col in df.select_dtypes(include=['object']).columns:
        if df[col].nunique() / len(df) < 0.5:  # Low cardinality
            df[col] = df[col].astype('category')
    return df

orders = optimize_dtypes(orders)
customers = optimize_dtypes(customers)

# Strategy 2: Use chunked processing
def process_in_chunks(orders, customers, chunk_size=1000000):
    results = []
    for chunk in pd.read_csv('large_orders.csv', chunksize=chunk_size):
        merged_chunk = pd.merge(chunk, customers, on='customer_id', how='left')
        chunk_agg = merged_chunk.groupby(['product_category', 'customer_id'])['amount'].sum().reset_index()
        results.append(chunk_agg)
    return pd.concat(results).groupby(['product_category', 'customer_id'])['amount'].sum().reset_index()

# Strategy 3: Use efficient merge with sort parameter
merged_efficient = pd.merge(
    orders, 
    customers, 
    on='customer_id', 
    how='left',
    sort=False  # Skip sorting for faster merge
)

⚠️

Memory Warning: A DataFrame with 100M rows and 10 columns can easily consume 8-15GB of RAM. Always check memory usage with df.info(memory_usage='deep') before operations.

5. Advanced GroupBy Patterns

# Rolling calculations within groups
merged_df['rolling_avg'] = (
    merged_df
    .sort_values('order_date')
    .groupby('customer_id')['amount']
    .transform(lambda x: x.rolling(7).mean())
)

# Cumulative operations
merged_df['cumulative_spend'] = (
    merged_df
    .sort_values('order_date')
    .groupby('customer_id')['amount']
    .cumsum()
)

# Window functions equivalent
merged_df['rank_in_category'] = (
    merged_df
    .groupby('product_category')['amount']
    .rank(method='dense', ascending=False)
)

6. Real-World Application: Customer Segmentation

# RFM Analysis using Pandas
def rfm_analysis(df):
    """Recency, Frequency, Monetary analysis"""
    snapshot_date = df['order_date'].max() + pd.Timedelta(days=1)
    
    rfm = df.groupby('customer_id').agg({
        'order_date': lambda x: (snapshot_date - x.max()).days,  # Recency
        'order_id': 'count',  # Frequency
        'amount': 'sum'  # Monetary
    })
    
    rfm.columns = ['recency', 'frequency', 'monetary']
    
    # Score each metric (1-5)
    rfm['r_score'] = pd.qcut(rfm['recency'], 5, labels=[5,4,3,2,1])
    rfm['f_score'] = pd.qcut(rfm['frequency'].rank(method='first'), 5, labels=[1,2,3,4,5])
    rfm['m_score'] = pd.qcut(rfm['monetary'], 5, labels=[1,2,3,4,5])
    
    rfm['rfm_score'] = rfm['r_score'].astype(int) + rfm['f_score'].astype(int) + rfm['m_score'].astype(int)
    
    return rfm

rfm_results = rfm_analysis(merged_df)

7. Common Follow-Up Questions

Follow-up 1: How would you handle duplicate customer_ids in the customers table?

# Deduplicate customers, keeping the most recent record
customers_deduped = (
    customers
    .sort_values('signup_date', ascending=False)
    .drop_duplicates(subset='customer_id', keep='first')
)

# Alternative: Aggregate to keep all information
customers_agg = customers.groupby('customer_id').agg({
    'name': 'first',
    'email': 'first',
    'signup_date': 'min',
    'tier': lambda x: x.mode()[0]  # Most frequent tier
}).reset_index()

Follow-up 2: How would you optimize memory usage?

# Check current memory usage
print(f"Orders memory: {orders.memory_usage(deep=True).sum() / 1e9:.2f} GB")
print(f"Customers memory: {customers.memory_usage(deep=True).sum() / 1e9:.2f} GB")

# Optimization strategies
def reduce_memory(df):
    """Reduce DataFrame memory by 50-70%"""
    start_mem = df.memory_usage(deep=True).sum() / 1e9
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
    
    end_mem = df.memory_usage(deep=True).sum() / 1e9
    print(f'Memory reduced from {start_mem:.2f} GB to {end_mem:.2f} GB '
          f'({100 * (start_mem - end_mem) / start_mem:.1f}% reduction)')
    
    return df

Company-Specific Tips

ℹ️

Amazon Tips:

Amazon heavily tests on apply() vs vectorized operations — always prefer vectorized
Know pd.cut() and pd.qcut() for binning data
Understand merge() parameter validate for data integrity checks
Practice handling NaN vs None vs NaT differences

Google Tips:

Google often asks about time complexity of Pandas operations
Be prepared to discuss when to use NumPy over Pandas
Know the difference between inplace=True and assignment
Understand how Pandas handles index alignment

Pandas: DataFrame Operations, Merging, GroupBy, Performance

Pandas: DataFrame Operations, Merging, GroupBy, Performance

The Interview Question

Detailed Answer

1. Understanding DataFrame Merging

2. Merge Strategies and When to Use Each

3. GroupBy with Multi-Aggregation

4. Performance Optimization for 100M+ Rows

5. Advanced GroupBy Patterns

6. Real-World Application: Customer Segmentation

7. Common Follow-Up Questions

Company-Specific Tips

Quiz Section

Related Topics