πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

SQL Live Coding: Complex Queries, Optimization, Real-time Problem

Data Scientist Role InterviewSQL Live Coding & Database Optimization⭐ Premium

Advertisement

πŸ’»

Asked at Meta & Amazon

SQL Live Coding

Complex Queries, Optimization, Real-time Problem

The Interview Question

"Given a table of user logins, write a query to find the second most active user by login count in each country."

SQL live coding tests whether you can think and write SQL under pressure β€” a critical skill for data scientists who spend significant time querying databases.


Why Companies Ask This

ℹ️

Meta and Amazon use SQL as the primary tool for data analysis. They need to verify you can write correct, efficient, and readable SQL β€” not just in theory, but in real-time with an interviewer watching.

Interviewers evaluate:

  1. SQL Proficiency β€” Can you write complex queries correctly?
  2. Problem Decomposition β€” Can you break a problem into logical steps?
  3. Optimization Thinking β€” Do you consider performance?
  4. Communication β€” Can you explain your approach as you code?
  5. Edge Case Awareness β€” Do you handle NULLs, ties, and edge cases?

Common SQL Problem Categories

1. Window Functions (Most Common)

2. Joins and Subqueries

3. Aggregation and Grouping

4. String and Date Manipulation

5. Complex Business Logic


Example Problem: Second Most Active User per Country

Step 1: Clarify the Problem

"Before I start, let me clarify: What should we do if there's a tie for second place? Should we return all tied users, or just one? Also, are we looking at total login count across all time, or a specific time period?"

Step 2: Think Through the Approach

-- Approach:
-- 1. Count logins per user per country
-- 2. Rank users within each country by login count
-- 3. Filter for rank = 2

-- Key decision: Use DENSE_RANK() to handle ties properly
-- DENSE_RANK: 1, 2, 2, 3 (no gaps for ties)
-- RANK: 1, 2, 2, 4 (gaps after ties)
-- ROW_NUMBER: 1, 2, 3, 4 (arbitrary tie-breaking)

Step 3: Write the Query

WITH user_login_counts AS (
    SELECT 
        country,
        user_id,
        COUNT(*) AS login_count
    FROM user_logins
    GROUP BY country, user_id
),

ranked_users AS (
    SELECT 
        country,
        user_id,
        login_count,
        DENSE_RANK() OVER (
            PARTITION BY country 
            ORDER BY login_count DESC
        ) AS rank_in_country
    FROM user_login_counts
)

SELECT 
    country,
    user_id,
    login_count
FROM ranked_users
WHERE rank_in_country = 2
ORDER BY country, login_count DESC;

Step 4: Explain Your Thinking

"I used a CTE to first aggregate login counts per user per country, then applied DENSE_RANK to rank users within each country. I chose DENSE_RANK over RANK or ROW_NUMBER because:

  • If two users tie for second, DENSE_RANK returns both (which is typically what the business wants)
  • RANK would skip to rank 4 after a tie at rank 2
  • ROW_NUMBER would arbitrarily pick one user"

Advanced SQL Patterns

Pattern 1: Running Totals and Moving Averages

-- Calculate 7-day moving average of daily active users
WITH daily_metrics AS (
    SELECT 
        date,
        COUNT(DISTINCT user_id) AS dau
    FROM user_activity
    GROUP BY date
)

SELECT 
    date,
    dau,
    AVG(dau) OVER (
        ORDER BY date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7d,
    SUM(dau) OVER (
        ORDER BY date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS cumulative_users
FROM daily_metrics
ORDER BY date;

Pattern 2: Sessionization

-- Group consecutive events into sessions (30-min gap = new session)
WITH events_with_lag AS (
    SELECT 
        user_id,
        event_time,
        LAG(event_time) OVER (
            PARTITION BY user_id 
            ORDER BY event_time
        ) AS prev_event_time
    FROM user_events
),

session_boundaries AS (
    SELECT 
        user_id,
        event_time,
        CASE 
            WHEN prev_event_time IS NULL 
                 OR event_time - prev_event_time > INTERVAL '30 minutes'
            THEN 1
            ELSE 0
        END AS new_session
    FROM events_with_lag
),

sessions AS (
    SELECT 
        user_id,
        event_time,
        SUM(new_session) OVER (
            PARTITION BY user_id 
            ORDER BY event_time
        ) AS session_id
    FROM session_boundaries
)

SELECT 
    user_id,
    session_id,
    MIN(event_time) AS session_start,
    MAX(event_time) AS session_end,
    COUNT(*) AS events_in_session,
    EXTRACT(EPOCH FROM (
        MAX(event_time) - MIN(event_time)
    )) / 60 AS session_duration_minutes
FROM sessions
GROUP BY user_id, session_id;

Pattern 3: Cohort Analysis

-- Monthly retention cohort analysis
WITH user_first_activity AS (
    SELECT 
        user_id,
        DATE_TRUNC('month', MIN(activity_date)) AS cohort_month
    FROM user_activity
    GROUP BY user_id
),

monthly_activity AS (
    SELECT 
        user_id,
        DATE_TRUNC('month', activity_date) AS activity_month
    FROM user_activity
    GROUP BY user_id, DATE_TRUNC('month', activity_date)
),

cohort_activity AS (
    SELECT 
        ufa.cohort_month,
        EXTRACT(MONTH FROM AGE(
            ma.activity_month, ufa.cohort_month
        )) AS months_since_signup,
        COUNT(DISTINCT ma.user_id) AS active_users
    FROM user_first_activity ufa
    JOIN monthly_activity ma ON ufa.user_id = ma.user_id
    GROUP BY ufa.cohort_month, months_since_signup
)

SELECT 
    cohort_month,
    MAX(CASE WHEN months_since_signup = 0 THEN active_users END) AS month_0,
    MAX(CASE WHEN months_since_signup = 1 THEN active_users END) AS month_1,
    MAX(CASE WHEN months_since_signup = 2 THEN active_users END) AS month_2,
    MAX(CASE WHEN months_since_signup = 3 THEN active_users END) AS month_3,
    ROUND(
        MAX(CASE WHEN months_since_signup = 1 THEN active_users END)::FLOAT /
        MAX(CASE WHEN months_since_signup = 0 THEN active_users END),
        3
    ) AS retention_rate_m1
FROM cohort_activity
GROUP BY cohort_month
ORDER BY cohort_month;

Pattern 4: Gap and Islands (Consecutive Days)

-- Find consecutive days where a user was active
WITH daily_active AS (
    SELECT DISTINCT 
        user_id, 
        activity_date
    FROM user_activity
),

dated_groups AS (
    SELECT 
        user_id,
        activity_date,
        activity_date - INTERVAL '1 day' * (
            ROW_NUMBER() OVER (
                PARTITION BY user_id 
                ORDER BY activity_date
            )
        ) AS group_date
    FROM daily_active
)

SELECT 
    user_id,
    MIN(activity_date) AS streak_start,
    MAX(activity_date) AS streak_end,
    COUNT(*) AS streak_length
FROM dated_groups
GROUP BY user_id, group_date
HAVING COUNT(*) >= 7  -- At least 7 consecutive days
ORDER BY streak_length DESC;

Performance Optimization Tips

πŸ’‘

Meta and Amazon care about query performance because their tables have billions of rows. A slow query isn't just inconvenient β€” it's expensive.

1. Filter Early

-- BAD: Aggregates all data, then filters
SELECT user_id, COUNT(*)
FROM events
GROUP BY user_id
HAVING COUNT(*) > 100;

-- GOOD: Filters partition first, then aggregates
SELECT user_id, COUNT(*)
FROM events
WHERE date >= '2026-01-01'  -- Filter partition first
GROUP BY user_id
HAVING COUNT(*) > 100;

2. Use Approximate Functions for Large Data

-- For billion-row tables, exact counts are expensive
-- Use HyperLogLog for approximate distinct counts
SELECT 
    date,
    approx_count_distinct(user_id) AS approx_dau
FROM events
GROUP BY date;

-- Use T-Digest for approximate percentiles
SELECT 
    date,
    approx_percentile(duration_ms, 0.95) AS p95_duration
FROM events
GROUP BY date;

3. Avoid Cartesian Products

-- BAD: Implicit cross join
SELECT *
FROM orders, customers;

-- GOOD: Explicit join with condition
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.id;

4. Use CTEs for Readability (But Watch Performance)

-- CTEs are readable but may not be optimized in all databases
-- In PostgreSQL, CTEs are optimization fences (materialized)
-- In some databases, CTEs are inlined

-- If performance matters, consider:
-- 1. Temporary tables for large intermediate results
-- 2. Inline subqueries for simple cases
-- 3. CTEs for readability when performance isn't critical

Common Mistakes in SQL Interviews

1. Not Handling NULLs

-- NULLs break aggregations and comparisons
-- Always consider: COUNT(*) vs COUNT(column), NULL comparisons

-- BAD: This returns NULL, not 0
SELECT user_id, COUNT(DISTINCT order_id)
FROM orders
GROUP BY user_id;

-- GOOD: COALESCE to handle NULLs
SELECT 
    user_id, 
    COALESCE(COUNT(DISTINCT order_id), 0) AS order_count
FROM orders
GROUP BY user_id;

2. Wrong Join Type

-- INNER JOIN excludes users with no orders
-- LEFT JOIN includes all users

-- BAD: Only returns users who have orders
SELECT u.user_id, COUNT(o.order_id)
FROM users u
INNER JOIN orders o ON u.user_id = o.user_id
GROUP BY u.user_id;

-- GOOD: Returns all users, even those with no orders
SELECT 
    u.user_id, 
    COALESCE(COUNT(o.order_id), 0) AS order_count
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
GROUP BY u.user_id;

3. Not Considering Data Types

-- Integer division truncates
SELECT 5 / 2;  -- Returns 2, not 2.5

-- GOOD: Cast to float
SELECT 5.0 / 2;  -- Returns 2.5
SELECT CAST(5 AS FLOAT) / 2;  -- Returns 2.5

Interview Strategy

Communication Tips

  1. Read the problem aloud β€” Show you understand it
  2. Ask clarifying questions β€” Shows thoroughness
  3. Explain your approach before coding β€” Shows structured thinking
  4. Talk through your code as you write it β€” Shows communication skills
  5. Test your query mentally β€” Shows attention to detail
  6. Discuss optimizations β€” Shows performance awareness

Time Management

  • Minutes 1-2: Understand the problem
  • Minutes 3-4: Design your approach
  • Minutes 5-12: Write the query
  • Minutes 13-15: Review and optimize

Quiz: Test Your Understanding


Related Topics

Advertisement