πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

SQL: Window Functions, CTEs, Joins, Optimization

Data Science Interview PremiumSQL⭐ Premium

Advertisement

META & AMAZON INTERVIEW QUESTION

SQL: Window Functions, CTEs, Joins, Optimization

Database Querying & Analytics

The Interview Question

ℹ️

Question: Given a table user_actions with columns [user_id, action_type, timestamp, device_type] and a table subscriptions with columns [user_id, plan, start_date, end_date, monthly_fee]:

  1. Find the 7-day rolling average of active users per device type
  2. Identify users who upgraded their subscription within 30 days of signup
  3. Calculate monthly revenue retention rate by plan tier
  4. Optimize all queries for tables with 500M+ rows

Detailed Answer

1. Window Functions Deep Dive

Window functions perform calculations across a set of rows related to the current row without collapsing them. Unlike GROUP BY, they preserve all rows.

-- Basic window function syntax
SELECT 
    user_id,
    action_type,
    timestamp,
    -- Running total of actions
    COUNT(*) OVER (
        PARTITION BY user_id 
        ORDER BY timestamp
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as cumulative_actions,
    -- Rank within each device type
    ROW_NUMBER() OVER (
        PARTITION BY device_type 
        ORDER BY timestamp
    ) as rank_in_device
FROM user_actions;

Window Function Types:

FunctionPurposeExample Use Case
ROW_NUMBER()Sequential numberingDeduplication, pagination
RANK()Ranking with gapsTop-N per category
DENSE_RANK()Ranking without gapsLeaderboard ranking
LAG/LEAD()Access previous/next rowsGrowth calculations
NTILE()Divide into bucketsPercentile analysis
SUM/AVG/COUNT() OVER()Aggregate with windowRolling averages

2. 7-Day Rolling Average Calculation

-- CTE for daily active users
WITH daily_active_users AS (
    SELECT 
        DATE(timestamp) as action_date,
        device_type,
        COUNT(DISTINCT user_id) as daily_active
    FROM user_actions
    WHERE timestamp >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
    GROUP BY DATE(timestamp), device_type
),
-- Window function for 7-day rolling average
rolling_avg AS (
    SELECT 
        action_date,
        device_type,
        daily_active,
        AVG(daily_active) OVER (
            PARTITION BY device_type 
            ORDER BY action_date
            ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        ) as rolling_7day_avg,
        -- Also calculate 30-day for comparison
        AVG(daily_active) OVER (
            PARTITION BY device_type 
            ORDER BY action_date
            ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
        ) as rolling_30day_avg
    FROM daily_active_users
)
SELECT 
    action_date,
    device_type,
    daily_active,
    ROUND(rolling_7day_avg, 2) as rolling_7day_avg,
    ROUND(rolling_30day_avg, 2) as rolling_30day_avg,
    -- Calculate trend
    ROUND(
        (rolling_7day_avg - rolling_30day_avg) / rolling_30day_avg * 100, 
        2
    ) as trend_pct
FROM rolling_avg
ORDER BY action_date DESC, device_type;

πŸ’‘

Pro Tip: The ROWS BETWEEN clause defines the window frame. UNBOUNDED PRECEDING means from the first row, CURRENT ROW means up to and including the current row.

3. Identifying Subscription Upgrades

-- Method 1: Self join to find upgrades within 30 days
WITH user_upgrades AS (
    SELECT 
        s1.user_id,
        s1.plan as original_plan,
        s1.start_date as original_start,
        s2.plan as new_plan,
        s2.start_date as upgrade_date,
        DATEDIFF(s2.start_date, s1.start_date) as days_between
    FROM subscriptions s1
    INNER JOIN subscriptions s2 
        ON s1.user_id = s2.user_id
        AND s2.start_date > s1.start_date
        AND DATEDIFF(s2.start_date, s1.start_date) <= 30
    WHERE s1.start_date = (
        SELECT MIN(start_date) 
        FROM subscriptions 
        WHERE user_id = s1.user_id
    )
)
SELECT 
    user_id,
    original_plan,
    new_plan,
    days_between,
    CASE 
        WHEN new_plan = 'Premium' AND original_plan = 'Basic' THEN 'Premium Upgrade'
        WHEN new_plan = 'Premium' AND original_plan = 'Pro' THEN 'Premium Upgrade'
        WHEN new_plan = 'Pro' AND original_plan = 'Basic' THEN 'Pro Upgrade'
        ELSE 'Other Upgrade'
    END as upgrade_type
FROM user_upgrades;

-- Method 2: Using LAG() window function (more efficient)
WITH subscription_changes AS (
    SELECT 
        user_id,
        plan,
        start_date,
        LAG(plan) OVER (PARTITION BY user_id ORDER BY start_date) as prev_plan,
        LAG(start_date) OVER (PARTITION BY user_id ORDER BY start_date) as prev_start_date
    FROM subscriptions
)
SELECT 
    user_id,
    prev_plan as original_plan,
    plan as new_plan,
    DATEDIFF(start_date, prev_start_date) as days_between
FROM subscription_changes
WHERE prev_plan IS NOT NULL
    AND plan != prev_plan
    AND DATEDIFF(start_date, prev_start_date) <= 30
    AND prev_plan IN ('Basic', 'Pro')
    AND plan = 'Premium';

4. Monthly Revenue Retention Rate

-- Calculate MRR (Monthly Recurring Revenue) retention
WITH monthly_revenue AS (
    SELECT 
        DATE_FORMAT(start_date, '%Y-%m') as cohort_month,
        plan,
        SUM(monthly_fee) as mrr
    FROM subscriptions
    WHERE start_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 24 MONTH)
    GROUP BY DATE_FORMAT(start_date, '%Y-%m'), plan
),
retention AS (
    SELECT 
        mr1.cohort_month,
        mr1.plan,
        mr1.mrr as cohort_mrr,
        mr2.mrr as retained_mrr,
        -- Calculate retention rate
        CASE 
            WHEN mr1.mrr > 0 
            THEN ROUND(mr2.mrr / mr1.mrr * 100, 2)
            ELSE 0 
        END as retention_rate_pct
    FROM monthly_revenue mr1
    LEFT JOIN monthly_revenue mr2 
        ON mr1.plan = mr2.plan
        AND mr2.cohort_month = DATE_FORMAT(
            DATE_ADD(
                STR_TO_DATE(CONCAT(mr1.cohort_month, '-01'), '%Y-%m-%d'),
                INTERVAL 1 MONTH
            ), 
            '%Y-%m'
        )
)
SELECT 
    cohort_month,
    plan,
    cohort_mrr,
    COALESCE(retained_mrr, 0) as retained_mrr,
    COALESCE(retention_rate_pct, 0) as retention_rate_pct
FROM retention
ORDER BY cohort_month DESC, plan;

5. Query Optimization for Large Tables

-- Partition pruning strategy
-- Assuming user_actions is partitioned by timestamp
SELECT 
    DATE(timestamp) as action_date,
    COUNT(DISTINCT user_id) as daily_active
FROM user_actions
WHERE timestamp >= '2024-01-01'  -- Partition pruning
    AND timestamp < '2024-02-01'
GROUP BY DATE(timestamp);

-- Composite index strategy
-- Create covering index for common queries
CREATE INDEX idx_user_actions_composite 
ON user_actions (user_id, action_type, timestamp);

-- Avoid SELECT * in production
-- Bad: SELECT * FROM user_actions WHERE user_id = 123;
-- Good: 
SELECT user_id, action_type, timestamp 
FROM user_actions 
WHERE user_id = 123;

-- Use EXPLAIN to analyze query plans
EXPLAIN ANALYZE
SELECT 
    DATE(timestamp) as action_date,
    COUNT(DISTINCT user_id) as daily_active
FROM user_actions
WHERE timestamp >= '2024-01-01'
GROUP BY DATE(timestamp);

⚠️

Performance Warning: Window functions with ORDER BY can be expensive on large datasets. Consider materializing intermediate results when running complex analytical queries.

6. Advanced CTE Patterns

-- Recursive CTE for hierarchical data
WITH RECURSIVE category_hierarchy AS (
    -- Base case: top-level categories
    SELECT 
        category_id,
        category_name,
        parent_id,
        1 as level
    FROM categories
    WHERE parent_id IS NULL
    
    UNION ALL
    
    -- Recursive case: child categories
    SELECT 
        c.category_id,
        c.category_name,
        c.parent_id,
        ch.level + 1
    FROM categories c
    INNER JOIN category_hierarchy ch ON c.parent_id = ch.category_id
)
SELECT * FROM category_hierarchy
ORDER BY level, category_name;

-- Multiple CTEs for complex analysis
WITH 
-- CTE 1: Daily metrics
daily_metrics AS (
    SELECT 
        DATE(timestamp) as metric_date,
        COUNT(DISTINCT user_id) as dau,
        COUNT(*) as total_actions,
        AVG(CASE WHEN action_type = 'purchase' THEN 1 ELSE 0 END) as conversion_rate
    FROM user_actions
    WHERE timestamp >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
    GROUP BY DATE(timestamp)
),
-- CTE 2: Weekly aggregates
weekly_metrics AS (
    SELECT 
        DATE_FORMAT(metric_date, '%Y-%u') as year_week,
        AVG(dau) as avg_dau,
        SUM(total_actions) as total_weekly_actions,
        AVG(conversion_rate) as avg_conversion
    FROM daily_metrics
    GROUP BY DATE_FORMAT(metric_date, '%Y-%u')
),
-- CTE 3: Month-over-month comparison
mom_comparison AS (
    SELECT 
        *,
        LAG(avg_dau, 4) OVER (ORDER BY year_week) as prev_month_avg_dau,
        ROUND(
            (avg_dau - LAG(avg_dau, 4) OVER (ORDER BY year_week)) / 
            LAG(avg_dau, 4) OVER (ORDER BY year_week) * 100, 
            2
        ) as dau_growth_pct
    FROM weekly_metrics
)
SELECT * FROM mom_comparison
ORDER BY year_week DESC;

7. Real-World Application: User Engagement Funnel

-- Complete funnel analysis using window functions
WITH user_events AS (
    SELECT 
        user_id,
        timestamp,
        action_type,
        -- Create funnel stage order
        CASE action_type
            WHEN 'page_view' THEN 1
            WHEN 'sign_up' THEN 2
            WHEN 'add_to_cart' THEN 3
            WHEN 'checkout' THEN 4
            WHEN 'purchase' THEN 5
        END as funnel_stage,
        -- Track if user reached each stage
        MAX(CASE WHEN action_type = 'page_view' THEN 1 ELSE 0 END) OVER (
            PARTITION BY user_id
        ) as reached_view,
        MAX(CASE WHEN action_type = 'sign_up' THEN 1 ELSE 0 END) OVER (
            PARTITION BY user_id
        ) as reached_signup,
        MAX(CASE WHEN action_type = 'add_to_cart' THEN 1 ELSE 0 END) OVER (
            PARTITION BY user_id
        ) as reached_cart,
        MAX(CASE WHEN action_type = 'checkout' THEN 1 ELSE 0 END) OVER (
            PARTITION BY user_id
        ) as reached_checkout,
        MAX(CASE WHEN action_type = 'purchase' THEN 1 ELSE 0 END) OVER (
            PARTITION BY user_id
        ) as reached_purchase
    FROM user_events
)
SELECT 
    'Page View' as stage,
    COUNT(DISTINCT CASE WHEN reached_view = 1 THEN user_id END) as users,
    100.0 as conversion_pct
FROM user_events

UNION ALL

SELECT 
    'Sign Up' as stage,
    COUNT(DISTINCT CASE WHEN reached_signup = 1 THEN user_id END) as users,
    ROUND(
        COUNT(DISTINCT CASE WHEN reached_signup = 1 THEN user_id END) * 100.0 / 
        NULLIF(COUNT(DISTINCT CASE WHEN reached_view = 1 THEN user_id END), 0),
        2
    ) as conversion_pct
FROM user_events

-- ... continue for each funnel stage
ORDER BY 
    CASE stage
        WHEN 'Page View' THEN 1
        WHEN 'Sign Up' THEN 2
        WHEN 'Add to Cart' THEN 3
        WHEN 'Checkout' THEN 4
        WHEN 'Purchase' THEN 5
    END;

8. Common Follow-Up Questions

Follow-up 1: How would you optimize a query that's running slowly?

-- Step 1: Analyze the query plan
EXPLAIN ANALYZE your_slow_query;

-- Step 2: Check for missing indexes
SELECT 
    indexname,
    indexdef
FROM pg_indexes
WHERE tablename = 'your_table';

-- Step 3: Add appropriate indexes
CREATE INDEX CONCURRENTLY idx_your_table_column 
ON your_table (frequently_queried_column);

-- Step 4: Consider materialized views for complex aggregations
CREATE MATERIALIZED VIEW mv_daily_metrics AS
SELECT ... -- complex query
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_daily_metrics;

Follow-up 2: When would you use a CTE vs a subquery?

Use CaseCTESubquery
ReadabilityBetter for complex queriesOK for simple ones
ReusabilityCan reference multiple timesDefined inline
PerformanceMay or may not materializeUsually optimized by planner
Recursive queriesRequiredNot supported
DebuggingEasier to test piecesHarder to isolate

Company-Specific Tips

ℹ️

Meta Tips:

  • Meta loves asking about window functions for ranking and running totals
  • Know the difference between RANK(), DENSE_RANK(), and ROW_NUMBER()
  • Practice writing efficient queries for 1B+ row tables
  • Be comfortable with HiveQL syntax differences

Amazon Tips:

  • Amazon tests heavily on JOIN optimization
  • Understand covering indexes and index-only scans
  • Practice with PostgreSQL and MySQL syntax differences
  • Know when to use UNION vs UNION ALL

Quiz Section


Related Topics

Advertisement