The Interview Question
βΉοΈ
Question: Given a table user_actions with columns [user_id, action_type, timestamp, device_type] and a table subscriptions with columns [user_id, plan, start_date, end_date, monthly_fee]:
- Find the 7-day rolling average of active users per device type
- Identify users who upgraded their subscription within 30 days of signup
- Calculate monthly revenue retention rate by plan tier
- Optimize all queries for tables with 500M+ rows
Detailed Answer
1. Window Functions Deep Dive
Window functions perform calculations across a set of rows related to the current row without collapsing them. Unlike GROUP BY, they preserve all rows.
-- Basic window function syntax
SELECT
user_id,
action_type,
timestamp,
-- Running total of actions
COUNT(*) OVER (
PARTITION BY user_id
ORDER BY timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as cumulative_actions,
-- Rank within each device type
ROW_NUMBER() OVER (
PARTITION BY device_type
ORDER BY timestamp
) as rank_in_device
FROM user_actions;
Window Function Types:
| Function | Purpose | Example Use Case |
|---|---|---|
ROW_NUMBER() | Sequential numbering | Deduplication, pagination |
RANK() | Ranking with gaps | Top-N per category |
DENSE_RANK() | Ranking without gaps | Leaderboard ranking |
LAG/LEAD() | Access previous/next rows | Growth calculations |
NTILE() | Divide into buckets | Percentile analysis |
SUM/AVG/COUNT() OVER() | Aggregate with window | Rolling averages |
2. 7-Day Rolling Average Calculation
-- CTE for daily active users
WITH daily_active_users AS (
SELECT
DATE(timestamp) as action_date,
device_type,
COUNT(DISTINCT user_id) as daily_active
FROM user_actions
WHERE timestamp >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
GROUP BY DATE(timestamp), device_type
),
-- Window function for 7-day rolling average
rolling_avg AS (
SELECT
action_date,
device_type,
daily_active,
AVG(daily_active) OVER (
PARTITION BY device_type
ORDER BY action_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as rolling_7day_avg,
-- Also calculate 30-day for comparison
AVG(daily_active) OVER (
PARTITION BY device_type
ORDER BY action_date
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
) as rolling_30day_avg
FROM daily_active_users
)
SELECT
action_date,
device_type,
daily_active,
ROUND(rolling_7day_avg, 2) as rolling_7day_avg,
ROUND(rolling_30day_avg, 2) as rolling_30day_avg,
-- Calculate trend
ROUND(
(rolling_7day_avg - rolling_30day_avg) / rolling_30day_avg * 100,
2
) as trend_pct
FROM rolling_avg
ORDER BY action_date DESC, device_type;
π‘
Pro Tip: The ROWS BETWEEN clause defines the window frame. UNBOUNDED PRECEDING means from the first row, CURRENT ROW means up to and including the current row.
3. Identifying Subscription Upgrades
-- Method 1: Self join to find upgrades within 30 days
WITH user_upgrades AS (
SELECT
s1.user_id,
s1.plan as original_plan,
s1.start_date as original_start,
s2.plan as new_plan,
s2.start_date as upgrade_date,
DATEDIFF(s2.start_date, s1.start_date) as days_between
FROM subscriptions s1
INNER JOIN subscriptions s2
ON s1.user_id = s2.user_id
AND s2.start_date > s1.start_date
AND DATEDIFF(s2.start_date, s1.start_date) <= 30
WHERE s1.start_date = (
SELECT MIN(start_date)
FROM subscriptions
WHERE user_id = s1.user_id
)
)
SELECT
user_id,
original_plan,
new_plan,
days_between,
CASE
WHEN new_plan = 'Premium' AND original_plan = 'Basic' THEN 'Premium Upgrade'
WHEN new_plan = 'Premium' AND original_plan = 'Pro' THEN 'Premium Upgrade'
WHEN new_plan = 'Pro' AND original_plan = 'Basic' THEN 'Pro Upgrade'
ELSE 'Other Upgrade'
END as upgrade_type
FROM user_upgrades;
-- Method 2: Using LAG() window function (more efficient)
WITH subscription_changes AS (
SELECT
user_id,
plan,
start_date,
LAG(plan) OVER (PARTITION BY user_id ORDER BY start_date) as prev_plan,
LAG(start_date) OVER (PARTITION BY user_id ORDER BY start_date) as prev_start_date
FROM subscriptions
)
SELECT
user_id,
prev_plan as original_plan,
plan as new_plan,
DATEDIFF(start_date, prev_start_date) as days_between
FROM subscription_changes
WHERE prev_plan IS NOT NULL
AND plan != prev_plan
AND DATEDIFF(start_date, prev_start_date) <= 30
AND prev_plan IN ('Basic', 'Pro')
AND plan = 'Premium';
4. Monthly Revenue Retention Rate
-- Calculate MRR (Monthly Recurring Revenue) retention
WITH monthly_revenue AS (
SELECT
DATE_FORMAT(start_date, '%Y-%m') as cohort_month,
plan,
SUM(monthly_fee) as mrr
FROM subscriptions
WHERE start_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 24 MONTH)
GROUP BY DATE_FORMAT(start_date, '%Y-%m'), plan
),
retention AS (
SELECT
mr1.cohort_month,
mr1.plan,
mr1.mrr as cohort_mrr,
mr2.mrr as retained_mrr,
-- Calculate retention rate
CASE
WHEN mr1.mrr > 0
THEN ROUND(mr2.mrr / mr1.mrr * 100, 2)
ELSE 0
END as retention_rate_pct
FROM monthly_revenue mr1
LEFT JOIN monthly_revenue mr2
ON mr1.plan = mr2.plan
AND mr2.cohort_month = DATE_FORMAT(
DATE_ADD(
STR_TO_DATE(CONCAT(mr1.cohort_month, '-01'), '%Y-%m-%d'),
INTERVAL 1 MONTH
),
'%Y-%m'
)
)
SELECT
cohort_month,
plan,
cohort_mrr,
COALESCE(retained_mrr, 0) as retained_mrr,
COALESCE(retention_rate_pct, 0) as retention_rate_pct
FROM retention
ORDER BY cohort_month DESC, plan;
5. Query Optimization for Large Tables
-- Partition pruning strategy
-- Assuming user_actions is partitioned by timestamp
SELECT
DATE(timestamp) as action_date,
COUNT(DISTINCT user_id) as daily_active
FROM user_actions
WHERE timestamp >= '2024-01-01' -- Partition pruning
AND timestamp < '2024-02-01'
GROUP BY DATE(timestamp);
-- Composite index strategy
-- Create covering index for common queries
CREATE INDEX idx_user_actions_composite
ON user_actions (user_id, action_type, timestamp);
-- Avoid SELECT * in production
-- Bad: SELECT * FROM user_actions WHERE user_id = 123;
-- Good:
SELECT user_id, action_type, timestamp
FROM user_actions
WHERE user_id = 123;
-- Use EXPLAIN to analyze query plans
EXPLAIN ANALYZE
SELECT
DATE(timestamp) as action_date,
COUNT(DISTINCT user_id) as daily_active
FROM user_actions
WHERE timestamp >= '2024-01-01'
GROUP BY DATE(timestamp);
β οΈ
Performance Warning: Window functions with ORDER BY can be expensive on large datasets. Consider materializing intermediate results when running complex analytical queries.
6. Advanced CTE Patterns
-- Recursive CTE for hierarchical data
WITH RECURSIVE category_hierarchy AS (
-- Base case: top-level categories
SELECT
category_id,
category_name,
parent_id,
1 as level
FROM categories
WHERE parent_id IS NULL
UNION ALL
-- Recursive case: child categories
SELECT
c.category_id,
c.category_name,
c.parent_id,
ch.level + 1
FROM categories c
INNER JOIN category_hierarchy ch ON c.parent_id = ch.category_id
)
SELECT * FROM category_hierarchy
ORDER BY level, category_name;
-- Multiple CTEs for complex analysis
WITH
-- CTE 1: Daily metrics
daily_metrics AS (
SELECT
DATE(timestamp) as metric_date,
COUNT(DISTINCT user_id) as dau,
COUNT(*) as total_actions,
AVG(CASE WHEN action_type = 'purchase' THEN 1 ELSE 0 END) as conversion_rate
FROM user_actions
WHERE timestamp >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY DATE(timestamp)
),
-- CTE 2: Weekly aggregates
weekly_metrics AS (
SELECT
DATE_FORMAT(metric_date, '%Y-%u') as year_week,
AVG(dau) as avg_dau,
SUM(total_actions) as total_weekly_actions,
AVG(conversion_rate) as avg_conversion
FROM daily_metrics
GROUP BY DATE_FORMAT(metric_date, '%Y-%u')
),
-- CTE 3: Month-over-month comparison
mom_comparison AS (
SELECT
*,
LAG(avg_dau, 4) OVER (ORDER BY year_week) as prev_month_avg_dau,
ROUND(
(avg_dau - LAG(avg_dau, 4) OVER (ORDER BY year_week)) /
LAG(avg_dau, 4) OVER (ORDER BY year_week) * 100,
2
) as dau_growth_pct
FROM weekly_metrics
)
SELECT * FROM mom_comparison
ORDER BY year_week DESC;
7. Real-World Application: User Engagement Funnel
-- Complete funnel analysis using window functions
WITH user_events AS (
SELECT
user_id,
timestamp,
action_type,
-- Create funnel stage order
CASE action_type
WHEN 'page_view' THEN 1
WHEN 'sign_up' THEN 2
WHEN 'add_to_cart' THEN 3
WHEN 'checkout' THEN 4
WHEN 'purchase' THEN 5
END as funnel_stage,
-- Track if user reached each stage
MAX(CASE WHEN action_type = 'page_view' THEN 1 ELSE 0 END) OVER (
PARTITION BY user_id
) as reached_view,
MAX(CASE WHEN action_type = 'sign_up' THEN 1 ELSE 0 END) OVER (
PARTITION BY user_id
) as reached_signup,
MAX(CASE WHEN action_type = 'add_to_cart' THEN 1 ELSE 0 END) OVER (
PARTITION BY user_id
) as reached_cart,
MAX(CASE WHEN action_type = 'checkout' THEN 1 ELSE 0 END) OVER (
PARTITION BY user_id
) as reached_checkout,
MAX(CASE WHEN action_type = 'purchase' THEN 1 ELSE 0 END) OVER (
PARTITION BY user_id
) as reached_purchase
FROM user_events
)
SELECT
'Page View' as stage,
COUNT(DISTINCT CASE WHEN reached_view = 1 THEN user_id END) as users,
100.0 as conversion_pct
FROM user_events
UNION ALL
SELECT
'Sign Up' as stage,
COUNT(DISTINCT CASE WHEN reached_signup = 1 THEN user_id END) as users,
ROUND(
COUNT(DISTINCT CASE WHEN reached_signup = 1 THEN user_id END) * 100.0 /
NULLIF(COUNT(DISTINCT CASE WHEN reached_view = 1 THEN user_id END), 0),
2
) as conversion_pct
FROM user_events
-- ... continue for each funnel stage
ORDER BY
CASE stage
WHEN 'Page View' THEN 1
WHEN 'Sign Up' THEN 2
WHEN 'Add to Cart' THEN 3
WHEN 'Checkout' THEN 4
WHEN 'Purchase' THEN 5
END;
8. Common Follow-Up Questions
Follow-up 1: How would you optimize a query that's running slowly?
-- Step 1: Analyze the query plan
EXPLAIN ANALYZE your_slow_query;
-- Step 2: Check for missing indexes
SELECT
indexname,
indexdef
FROM pg_indexes
WHERE tablename = 'your_table';
-- Step 3: Add appropriate indexes
CREATE INDEX CONCURRENTLY idx_your_table_column
ON your_table (frequently_queried_column);
-- Step 4: Consider materialized views for complex aggregations
CREATE MATERIALIZED VIEW mv_daily_metrics AS
SELECT ... -- complex query
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_daily_metrics;
Follow-up 2: When would you use a CTE vs a subquery?
| Use Case | CTE | Subquery |
|---|---|---|
| Readability | Better for complex queries | OK for simple ones |
| Reusability | Can reference multiple times | Defined inline |
| Performance | May or may not materialize | Usually optimized by planner |
| Recursive queries | Required | Not supported |
| Debugging | Easier to test pieces | Harder to isolate |
Company-Specific Tips
βΉοΈ
Meta Tips:
- Meta loves asking about window functions for ranking and running totals
- Know the difference between
RANK(),DENSE_RANK(), andROW_NUMBER() - Practice writing efficient queries for 1B+ row tables
- Be comfortable with HiveQL syntax differences
Amazon Tips:
- Amazon tests heavily on JOIN optimization
- Understand covering indexes and index-only scans
- Practice with PostgreSQL and MySQL syntax differences
- Know when to use UNION vs UNION ALL
Quiz Section
Related Topics
- Pandas GroupBy Operations β SQL GROUP BY equivalent in Python
- Window Functions in Practice β Rolling averages and time series
- Query Optimization β Scaling SQL with distributed systems
- Data Modeling β Schema design for performance