πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Pandas: DataFrame Operations, Merging, GroupBy, Performance

Data Science Interview PremiumPython & Pandas⭐ Premium

Advertisement

AMAZON & GOOGLE INTERVIEW QUESTION

Pandas: DataFrame Operations, Merging, GroupBy, Performance

Data Structures & Manipulation

The Interview Question

ℹ️

Question: You have two DataFrames β€” orders with columns [order_id, customer_id, product_category, amount, order_date] and customers with columns [customer_id, name, email, signup_date, tier]. Write optimized Pandas code to:

  1. Merge these DataFrames and find the top 5 customers by total spending per product category
  2. Handle duplicate columns and missing values appropriately
  3. Optimize the operation for a dataset with 100M+ rows

Detailed Answer

1. Understanding DataFrame Merging

Pandas provides multiple merge strategies that mirror SQL join operations. Understanding the nuances between them is critical for both correctness and performance.

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Create sample data
np.random.seed(42)
n_orders = 1000000

orders = pd.DataFrame({
    'order_id': range(1, n_orders + 1),
    'customer_id': np.random.randint(1, 50001, n_orders),
    'product_category': np.random.choice(
        ['Electronics', 'Clothing', 'Food', 'Books', 'Home'], 
        n_orders
    ),
    'amount': np.round(np.random.exponential(50, n_orders), 2),
    'order_date': pd.date_range('2023-01-01', periods=n_orders, freq='T')
})

customers = pd.DataFrame({
    'customer_id': range(1, 50001),
    'name': [f'Customer_{i}' for i in range(1, 50001)],
    'email': [f'customer_{i}@email.com' for i in range(1, 50001)],
    'signup_date': pd.date_range('2020-01-01', periods=50000, freq='T'),
    'tier': np.random.choice(['Gold', 'Silver', 'Bronze', 'Platinum'], 50000)
})

πŸ’‘

Pro Tip: Always verify column data types before merging. Mismatched dtypes cause silent failures or performance degradation.

2. Merge Strategies and When to Use Each

# Left merge - preserves all orders (primary use case here)
merged_df = pd.merge(
    orders, 
    customers, 
    on='customer_id', 
    how='left',
    validate='many_to_one'  # Ensures customer_id is unique in customers
)

# Handle duplicates explicitly
merged_df = merged_df.drop_duplicates(subset=['order_id'])

# Handle missing values from failed joins
merged_df['tier'] = merged_df['tier'].fillna('Unknown')

Merge Types Comparison:

Merge TypeUse CaseRow Count
leftKeep all records from left DataFrameSame as left
rightKeep all records from right DataFrameSame as right
innerOnly matching records from bothIntersection
outerAll records from both DataFramesUnion

3. GroupBy with Multi-Aggregation

# Step 1: Basic GroupBy
category_spending = merged_df.groupby(['product_category', 'customer_id'])['amount'].agg(['sum', 'count', 'mean'])

# Step 2: Reset index for easier manipulation
category_spending = category_spending.reset_index()

# Step 3: Rename columns for clarity
category_spending.columns = ['product_category', 'customer_id', 'total_spent', 'order_count', 'avg_order_value']

# Step 4: Sort and get top 5 per category
top_customers = (
    category_spending
    .sort_values(['product_category', 'total_spent'], ascending=[True, False])
    .groupby('product_category')
    .head(5)
)

4. Performance Optimization for 100M+ Rows

When dealing with massive datasets, standard Pandas operations become memory-intensive. Here are optimization strategies:

# Strategy 1: Optimize data types
def optimize_dtypes(df):
    """Reduce memory usage by downcasting numeric types"""
    for col in df.select_dtypes(include=['int']).columns:
        df[col] = pd.to_numeric(df[col], downcast='integer')
    for col in df.select_dtypes(include=['float']).columns:
        df[col] = pd.to_numeric(df[col], downcast='float')
    for col in df.select_dtypes(include=['object']).columns:
        if df[col].nunique() / len(df) < 0.5:  # Low cardinality
            df[col] = df[col].astype('category')
    return df

orders = optimize_dtypes(orders)
customers = optimize_dtypes(customers)

# Strategy 2: Use chunked processing
def process_in_chunks(orders, customers, chunk_size=1000000):
    results = []
    for chunk in pd.read_csv('large_orders.csv', chunksize=chunk_size):
        merged_chunk = pd.merge(chunk, customers, on='customer_id', how='left')
        chunk_agg = merged_chunk.groupby(['product_category', 'customer_id'])['amount'].sum().reset_index()
        results.append(chunk_agg)
    return pd.concat(results).groupby(['product_category', 'customer_id'])['amount'].sum().reset_index()

# Strategy 3: Use efficient merge with sort parameter
merged_efficient = pd.merge(
    orders, 
    customers, 
    on='customer_id', 
    how='left',
    sort=False  # Skip sorting for faster merge
)

⚠️

Memory Warning: A DataFrame with 100M rows and 10 columns can easily consume 8-15GB of RAM. Always check memory usage with df.info(memory_usage='deep') before operations.

5. Advanced GroupBy Patterns

# Rolling calculations within groups
merged_df['rolling_avg'] = (
    merged_df
    .sort_values('order_date')
    .groupby('customer_id')['amount']
    .transform(lambda x: x.rolling(7).mean())
)

# Cumulative operations
merged_df['cumulative_spend'] = (
    merged_df
    .sort_values('order_date')
    .groupby('customer_id')['amount']
    .cumsum()
)

# Window functions equivalent
merged_df['rank_in_category'] = (
    merged_df
    .groupby('product_category')['amount']
    .rank(method='dense', ascending=False)
)

6. Real-World Application: Customer Segmentation

# RFM Analysis using Pandas
def rfm_analysis(df):
    """Recency, Frequency, Monetary analysis"""
    snapshot_date = df['order_date'].max() + pd.Timedelta(days=1)
    
    rfm = df.groupby('customer_id').agg({
        'order_date': lambda x: (snapshot_date - x.max()).days,  # Recency
        'order_id': 'count',  # Frequency
        'amount': 'sum'  # Monetary
    })
    
    rfm.columns = ['recency', 'frequency', 'monetary']
    
    # Score each metric (1-5)
    rfm['r_score'] = pd.qcut(rfm['recency'], 5, labels=[5,4,3,2,1])
    rfm['f_score'] = pd.qcut(rfm['frequency'].rank(method='first'), 5, labels=[1,2,3,4,5])
    rfm['m_score'] = pd.qcut(rfm['monetary'], 5, labels=[1,2,3,4,5])
    
    rfm['rfm_score'] = rfm['r_score'].astype(int) + rfm['f_score'].astype(int) + rfm['m_score'].astype(int)
    
    return rfm

rfm_results = rfm_analysis(merged_df)

7. Common Follow-Up Questions

Follow-up 1: How would you handle duplicate customer_ids in the customers table?

# Deduplicate customers, keeping the most recent record
customers_deduped = (
    customers
    .sort_values('signup_date', ascending=False)
    .drop_duplicates(subset='customer_id', keep='first')
)

# Alternative: Aggregate to keep all information
customers_agg = customers.groupby('customer_id').agg({
    'name': 'first',
    'email': 'first',
    'signup_date': 'min',
    'tier': lambda x: x.mode()[0]  # Most frequent tier
}).reset_index()

Follow-up 2: How would you optimize memory usage?

# Check current memory usage
print(f"Orders memory: {orders.memory_usage(deep=True).sum() / 1e9:.2f} GB")
print(f"Customers memory: {customers.memory_usage(deep=True).sum() / 1e9:.2f} GB")

# Optimization strategies
def reduce_memory(df):
    """Reduce DataFrame memory by 50-70%"""
    start_mem = df.memory_usage(deep=True).sum() / 1e9
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
    
    end_mem = df.memory_usage(deep=True).sum() / 1e9
    print(f'Memory reduced from {start_mem:.2f} GB to {end_mem:.2f} GB '
          f'({100 * (start_mem - end_mem) / start_mem:.1f}% reduction)')
    
    return df

Company-Specific Tips

ℹ️

Amazon Tips:

  • Amazon heavily tests on apply() vs vectorized operations β€” always prefer vectorized
  • Know pd.cut() and pd.qcut() for binning data
  • Understand merge() parameter validate for data integrity checks
  • Practice handling NaN vs None vs NaT differences

Google Tips:

  • Google often asks about time complexity of Pandas operations
  • Be prepared to discuss when to use NumPy over Pandas
  • Know the difference between inplace=True and assignment
  • Understand how Pandas handles index alignment

Quiz Section


Related Topics

Advertisement