The Interview Question
βΉοΈ
Question: You have two DataFrames β orders with columns [order_id, customer_id, product_category, amount, order_date] and customers with columns [customer_id, name, email, signup_date, tier]. Write optimized Pandas code to:
- Merge these DataFrames and find the top 5 customers by total spending per product category
- Handle duplicate columns and missing values appropriately
- Optimize the operation for a dataset with 100M+ rows
Detailed Answer
1. Understanding DataFrame Merging
Pandas provides multiple merge strategies that mirror SQL join operations. Understanding the nuances between them is critical for both correctness and performance.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Create sample data
np.random.seed(42)
n_orders = 1000000
orders = pd.DataFrame({
'order_id': range(1, n_orders + 1),
'customer_id': np.random.randint(1, 50001, n_orders),
'product_category': np.random.choice(
['Electronics', 'Clothing', 'Food', 'Books', 'Home'],
n_orders
),
'amount': np.round(np.random.exponential(50, n_orders), 2),
'order_date': pd.date_range('2023-01-01', periods=n_orders, freq='T')
})
customers = pd.DataFrame({
'customer_id': range(1, 50001),
'name': [f'Customer_{i}' for i in range(1, 50001)],
'email': [f'customer_{i}@email.com' for i in range(1, 50001)],
'signup_date': pd.date_range('2020-01-01', periods=50000, freq='T'),
'tier': np.random.choice(['Gold', 'Silver', 'Bronze', 'Platinum'], 50000)
})
π‘
Pro Tip: Always verify column data types before merging. Mismatched dtypes cause silent failures or performance degradation.
2. Merge Strategies and When to Use Each
# Left merge - preserves all orders (primary use case here)
merged_df = pd.merge(
orders,
customers,
on='customer_id',
how='left',
validate='many_to_one' # Ensures customer_id is unique in customers
)
# Handle duplicates explicitly
merged_df = merged_df.drop_duplicates(subset=['order_id'])
# Handle missing values from failed joins
merged_df['tier'] = merged_df['tier'].fillna('Unknown')
Merge Types Comparison:
| Merge Type | Use Case | Row Count |
|---|---|---|
left | Keep all records from left DataFrame | Same as left |
right | Keep all records from right DataFrame | Same as right |
inner | Only matching records from both | Intersection |
outer | All records from both DataFrames | Union |
3. GroupBy with Multi-Aggregation
# Step 1: Basic GroupBy
category_spending = merged_df.groupby(['product_category', 'customer_id'])['amount'].agg(['sum', 'count', 'mean'])
# Step 2: Reset index for easier manipulation
category_spending = category_spending.reset_index()
# Step 3: Rename columns for clarity
category_spending.columns = ['product_category', 'customer_id', 'total_spent', 'order_count', 'avg_order_value']
# Step 4: Sort and get top 5 per category
top_customers = (
category_spending
.sort_values(['product_category', 'total_spent'], ascending=[True, False])
.groupby('product_category')
.head(5)
)
4. Performance Optimization for 100M+ Rows
When dealing with massive datasets, standard Pandas operations become memory-intensive. Here are optimization strategies:
# Strategy 1: Optimize data types
def optimize_dtypes(df):
"""Reduce memory usage by downcasting numeric types"""
for col in df.select_dtypes(include=['int']).columns:
df[col] = pd.to_numeric(df[col], downcast='integer')
for col in df.select_dtypes(include=['float']).columns:
df[col] = pd.to_numeric(df[col], downcast='float')
for col in df.select_dtypes(include=['object']).columns:
if df[col].nunique() / len(df) < 0.5: # Low cardinality
df[col] = df[col].astype('category')
return df
orders = optimize_dtypes(orders)
customers = optimize_dtypes(customers)
# Strategy 2: Use chunked processing
def process_in_chunks(orders, customers, chunk_size=1000000):
results = []
for chunk in pd.read_csv('large_orders.csv', chunksize=chunk_size):
merged_chunk = pd.merge(chunk, customers, on='customer_id', how='left')
chunk_agg = merged_chunk.groupby(['product_category', 'customer_id'])['amount'].sum().reset_index()
results.append(chunk_agg)
return pd.concat(results).groupby(['product_category', 'customer_id'])['amount'].sum().reset_index()
# Strategy 3: Use efficient merge with sort parameter
merged_efficient = pd.merge(
orders,
customers,
on='customer_id',
how='left',
sort=False # Skip sorting for faster merge
)
β οΈ
Memory Warning: A DataFrame with 100M rows and 10 columns can easily consume 8-15GB of RAM. Always check memory usage with df.info(memory_usage='deep') before operations.
5. Advanced GroupBy Patterns
# Rolling calculations within groups
merged_df['rolling_avg'] = (
merged_df
.sort_values('order_date')
.groupby('customer_id')['amount']
.transform(lambda x: x.rolling(7).mean())
)
# Cumulative operations
merged_df['cumulative_spend'] = (
merged_df
.sort_values('order_date')
.groupby('customer_id')['amount']
.cumsum()
)
# Window functions equivalent
merged_df['rank_in_category'] = (
merged_df
.groupby('product_category')['amount']
.rank(method='dense', ascending=False)
)
6. Real-World Application: Customer Segmentation
# RFM Analysis using Pandas
def rfm_analysis(df):
"""Recency, Frequency, Monetary analysis"""
snapshot_date = df['order_date'].max() + pd.Timedelta(days=1)
rfm = df.groupby('customer_id').agg({
'order_date': lambda x: (snapshot_date - x.max()).days, # Recency
'order_id': 'count', # Frequency
'amount': 'sum' # Monetary
})
rfm.columns = ['recency', 'frequency', 'monetary']
# Score each metric (1-5)
rfm['r_score'] = pd.qcut(rfm['recency'], 5, labels=[5,4,3,2,1])
rfm['f_score'] = pd.qcut(rfm['frequency'].rank(method='first'), 5, labels=[1,2,3,4,5])
rfm['m_score'] = pd.qcut(rfm['monetary'], 5, labels=[1,2,3,4,5])
rfm['rfm_score'] = rfm['r_score'].astype(int) + rfm['f_score'].astype(int) + rfm['m_score'].astype(int)
return rfm
rfm_results = rfm_analysis(merged_df)
7. Common Follow-Up Questions
Follow-up 1: How would you handle duplicate customer_ids in the customers table?
# Deduplicate customers, keeping the most recent record
customers_deduped = (
customers
.sort_values('signup_date', ascending=False)
.drop_duplicates(subset='customer_id', keep='first')
)
# Alternative: Aggregate to keep all information
customers_agg = customers.groupby('customer_id').agg({
'name': 'first',
'email': 'first',
'signup_date': 'min',
'tier': lambda x: x.mode()[0] # Most frequent tier
}).reset_index()
Follow-up 2: How would you optimize memory usage?
# Check current memory usage
print(f"Orders memory: {orders.memory_usage(deep=True).sum() / 1e9:.2f} GB")
print(f"Customers memory: {customers.memory_usage(deep=True).sum() / 1e9:.2f} GB")
# Optimization strategies
def reduce_memory(df):
"""Reduce DataFrame memory by 50-70%"""
start_mem = df.memory_usage(deep=True).sum() / 1e9
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
else:
if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
end_mem = df.memory_usage(deep=True).sum() / 1e9
print(f'Memory reduced from {start_mem:.2f} GB to {end_mem:.2f} GB '
f'({100 * (start_mem - end_mem) / start_mem:.1f}% reduction)')
return df
Company-Specific Tips
βΉοΈ
Amazon Tips:
- Amazon heavily tests on
apply()vs vectorized operations β always prefer vectorized - Know
pd.cut()andpd.qcut()for binning data - Understand
merge()parametervalidatefor data integrity checks - Practice handling
NaNvsNonevsNaTdifferences
Google Tips:
- Google often asks about time complexity of Pandas operations
- Be prepared to discuss when to use NumPy over Pandas
- Know the difference between
inplace=Trueand assignment - Understand how Pandas handles index alignment
Quiz Section
Related Topics
- SQL Window Functions β Pandas GroupBy equivalent in SQL
- Feature Engineering β Creating features using Pandas
- Data Cleaning β Handling missing values and data quality
- Performance Optimization β Scaling Pandas with Dask and Spark