Data Engineering Interview: Comprehensive Preparation Guide
Data engineering interviews test your ability to design, build, and maintain data systems at scale.
Interview Structure
What Interviews Assess:
- SQL proficiency
- System design thinking
- Data modeling skills
- Real-world problem-solving ability
Interview Stages:
| Stage | Format | Duration | Focus |
|---|---|---|---|
| Phone Screen | Behavioral + Basic Technical | 30-45 min | Communication, basic skills |
| Technical Screen | SQL/Python Coding | 45-60 min | Coding proficiency |
| System Design | Whiteboard/Document | 45-60 min | Architecture thinking |
| Behavioral | STAR Format | 30-45 min | Culture fit, leadership |
| Final Round | Mixed | 30-60 min | Cross-functional fit |
Key Insight: Data engineering interviews assess SQL proficiency, system design thinking, data modeling skills, and real-world problem-solving ability.
SQL Interview Questions
Question 1: Top-N per Group
Window functions perform calculations across a set of rows related to the current row. They don't collapse rows like GROUP BY β they add computed columns while preserving the original row structure.
-- Question: Find the top 3 customers by revenue in each region
-- Approach 1: ROW_NUMBER()
SELECT *
FROM (
SELECT
region,
customer_name,
total_revenue,
ROW_NUMBER() OVER (PARTITION BY region ORDER BY total_revenue DESC) AS rank
FROM customer_revenue
) ranked
WHERE rank <= 3;
-- Approach 2: DENSE_RANK() (handles ties)
SELECT *
FROM (
SELECT
region,
customer_name,
total_revenue,
DENSE_RANK() OVER (PARTITION BY region ORDER BY total_revenue DESC) AS rank
FROM customer_revenue
) ranked
WHERE rank <= 3;
-- Key Difference:
-- ROW_NUMBER: 1, 2, 3, 4 (no ties)
-- RANK: 1, 2, 2, 4 (ties share rank, skip)
-- DENSE_RANK: 1, 2, 2, 3 (ties share rank, no skip)
Question 2: Running Total and Moving Average
-- Question: Calculate running total and 7-day moving average
SELECT
order_date,
daily_revenue,
SUM(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total,
AVG(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7day,
COUNT(*) OVER (
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS window_count
FROM daily_revenue
ORDER BY order_date;
-- Lag/Lead for period-over-period comparison
SELECT
order_date,
daily_revenue,
LAG(daily_revenue, 1) OVER (ORDER BY order_date) AS prev_day_revenue,
daily_revenue - LAG(daily_revenue, 1) OVER (ORDER BY order_date) AS day_over_day_change,
ROUND(
(daily_revenue - LAG(daily_revenue, 1) OVER (ORDER BY order_date)) * 100.0 /
NULLIF(LAG(daily_revenue, 1) OVER (ORDER BY order_date), 0),
2
) AS pct_change
FROM daily_revenue
ORDER BY order_date;
Question 3: Gaps and Islands
-- Question: Find consecutive days where a customer made purchases
WITH daily_purchases AS (
SELECT DISTINCT
customer_id,
DATE(purchase_timestamp) AS purchase_date
FROM purchases
),
numbered AS (
SELECT
customer_id,
purchase_date,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY purchase_date) AS row_num
FROM daily_purchases
),
islands AS (
SELECT
customer_id,
purchase_date,
DATEADD(day, -row_num, purchase_date) AS island_group
FROM numbered
)
SELECT
customer_id,
MIN(purchase_date) AS streak_start,
MAX(purchase_date) AS streak_end,
DATEDIFF(day, MIN(purchase_date), MAX(purchase_date)) + 1 AS streak_length
FROM islands
GROUP BY customer_id, island_group
HAVING DATEDIFF(day, MIN(purchase_date), MAX(purchase_date)) + 1 >= 3
ORDER BY streak_length DESC;
Question 4: Median and Percentiles
-- Question: Calculate median order value
-- Method 1: PERCENTILE_CONT (Snowflake, BigQuery)
SELECT
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY total_amount) AS median_amount,
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY total_amount) AS p25_amount,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY total_amount) AS p75_amount,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY total_amount) AS p99_amount
FROM fact_orders;
-- Method 2: Manual median calculation
WITH ordered AS (
SELECT
total_amount,
ROW_NUMBER() OVER (ORDER BY total_amount) AS row_num,
COUNT(*) OVER () AS total_rows
FROM fact_orders
)
SELECT AVG(total_amount) AS median_amount
FROM ordered
WHERE row_num IN (FLOOR((total_rows + 1) / 2), CEIL((total_rows + 1) / 2));
System Design Questions
Question: Design a Real-Time Analytics Pipeline
System design questions assess your ability to design scalable, reliable data systems. The framework: (1) Clarify requirements, (2) Estimate scale, (3) Design high-level architecture, (4) Deep-dive components, (5) Discuss trade-offs.
REQUIREMENTS CLARIFICATION:
- Sources: Web events (100M/day), DB CDC (10M/day), API (1M/day)
- Latency: Real-time dashboard (< 1 min), batch reports (daily)
- Users: 100 analysts, 10 data scientists
- Storage: 500 TB over 3 years
- Budget: $50K/month
<PipelineDiagram
title="Real-time ML Feature Pipeline"
stages={[
{ label: "Sources", sublabel: "CDC, Events", icon: "1", color: "#2563eb", bg: "#dbeafe" },
{ label: "Kafka", sublabel: "Buffer & Stream", icon: "2", color: "#7c3aed", bg: "#ede9fe" },
{ label: "Flink", sublabel: "Process & Window", icon: "3", color: "#d97706", bg: "#fef3c7" },
{ label: "Delta Lake", sublabel: "Storage", icon: "4", color: "#059669", bg: "#d1fae5" },
{ label: "Feature Store", sublabel: "Online / Transform", icon: "5", color: "#dc2626", bg: "#fee2e2" },
{ label: "Model Serving", sublabel: "API / BI Tools", icon: "6", color: "#7c3aed", bg: "#ede9fe" }
]}
width={750}
height={160}
/>
DEEP-DIVE COMPONENTS:
1. Ingestion: Kafka with schema registry, 3-day retention
2. Processing: Flink for windowed aggregations, exactly-once
3. Storage: Delta Lake on S3, partitioned by date
4. Serving: dbt for batch, Feature Store for real-time
5. Monitoring: Prometheus + Grafana for pipeline health
Question: Design a Data Warehouse for E-Commerce
REQUIREMENTS:
- 10M orders/month, 100M events/day
- 50 analysts, 5 data scientists
- Sub-second dashboard queries
- 3-year historical retention
STAR SCHEMA DESIGN:
+------------------+ +------------------+ +------------------+
| dim_customer | | dim_product | | dim_date |
| - customer_key | | - product_key | | - date_key |
| - customer_id | | - product_id | | - full_date |
| - name | | - name | | - day_of_week |
| - segment | | - category | | - month |
| - lifetime_value | | - brand | | - quarter |
| (SCD Type 2) | | - unit_cost | | - year |
+--------+---------+ +--------+---------+ +--------+---------+
| | |
+----------+-------------+------------------------+
|
+----------v---------+
| fact_orders |
| - order_key |
| - customer_key |
| - product_key |
| - date_key |
| - quantity |
| - net_amount |
| - profit |
+--------------------+
PERFORMANCE OPTIMIZATION:
- Partition fact_orders by date_key
- Cluster by customer_key, product_key
- Materialized views for common aggregations
- Columnar storage (Parquet/Delta Lake)
Behavioral Questions (STAR Format)
Question: Tell me about a time you improved pipeline performance.
STAR = Situation, Task, Action, Result. A structured way to answer behavioral interview questions by describing a specific situation, your task, the actions you took, and the measurable results.
SITUATION:
Our nightly ETL pipeline was taking 6 hours to complete, causing
SLA violations and delaying morning reports.
TASK:
As the lead data engineer, I was responsible for reducing pipeline
duration to under 2 hours while maintaining data quality.
ACTION:
1. Profiled the pipeline to identify bottlenecks
- Found: 80% of time spent in 3 large JOIN operations
2. Implemented incremental processing for fact tables
- Changed from full rebuild to MERGE-based incremental
3. Added partition pruning on date columns
- Reduced data scanned by 95%
4. Parallelized independent stages using Airflow TaskGroups
- Extract stage ran in parallel instead of sequential
5. Right-sized warehouse from Large to Medium
- Queue times were < 1 second, indicating over-provisioning
RESULT:
- Pipeline duration: 6 hours -> 45 minutes (87% reduction)
- Cost: $500/month -> $150/month (70% reduction)
- SLA compliance: 70% -> 100%
- No data quality incidents during the transition
Key Concepts Summary
| Topic | Common Questions | Key Concepts |
|---|---|---|
| SQL | Window functions, CTEs, joins | ROW_NUMBER, LAG, pivots |
| Data Modeling | Star vs snowflake, SCD | Grain, conformed dimensions |
| System Design | Pipeline architecture | Latency, throughput, fault tolerance |
| Big Data | Spark, Kafka, Flink | Distributed computing, partitioning |
| Cloud | Snowflake, BigQuery, Redshift | Compute-storage separation |
| Orchestration | Airflow, Dagster | DAGs, scheduling, retries |
| Testing | Data quality, pipeline testing | Assertions, monitoring |
| Behavioral | Leadership, conflict, impact | STAR framework |
Performance Metrics
| Interview Area | Prep Time | Success Rate | Priority |
|---|---|---|---|
| SQL Coding | 20-40 hours | 70-80% with practice | High |
| System Design | 30-50 hours | 60-70% with practice | High |
| Behavioral | 10-20 hours | 80-90% with practice | Medium |
| Data Modeling | 15-25 hours | 70-80% with practice | High |
| Cloud Platforms | 10-20 hours | 60-70% with practice | Medium |
| Coding (Python) | 20-40 hours | 60-70% with practice | Medium |
Common Mistakes in Data Engineering Interviews
| Mistake | Why It Happens | How to Avoid |
|---|---|---|
| Jumping to solution | Eager to impress | Spend 2-3 minutes understanding the problem first |
| Ignoring scale | Focused on correctness | Always ask about data volume, latency, and user count |
| Forgetting trade-offs | Single-track thinking | Present multiple approaches and discuss pros/cons |
| Poor communication | Technical focus | Explain your thought process out loud |
| Not asking questions | Assuming understanding | Clarify requirements before designing |
| Skipping edge cases | Optimistic thinking | Discuss failure modes, null handling, and boundaries |
| Over-engineering | Impressing with complexity | Start simple, then optimize |
| Under-documenting | Code-focused | Explain why, not just what |
Technical Deep-Dive Topics
The CAP theorem states that a distributed data store can provide at most two of three guarantees: Consistency (all reads return the most recent write), Availability (every request receives a response), and Partition Tolerance (system continues despite network partitions). In practice, distributed systems must handle partitions, choosing between consistency and availability.
- CP Systems (Consistent + Partition-tolerant): ZooKeeper, etcd, HBase
- Sacrifice availability during partitions
- AP Systems (Available + Partition-tolerant): Cassandra, DynamoDB, CouchDB
- Sacrifice consistency during partitions (eventual consistency)
- CA Systems (Consistent + Available): Single-node PostgreSQL
- No partition tolerance (not distributed)
- PACELC: Even when no partition, choose between Latency and Consistency
Eventual consistency guarantees that if no new updates are made, all replicas will eventually converge to the same value. The convergence time depends on the system's replication lag, typically measured in milliseconds to seconds.
-- Window functions interview patterns
-- Pattern 1: Top-N per group
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY group_col ORDER BY value DESC) AS rn
FROM table
) WHERE rn <= N;
-- Pattern 2: Running totals
SELECT *, SUM(value) OVER (ORDER BY date_col ROWS UNBOUNDED PRECEDING) AS running_total
FROM table;
-- Pattern 3: Year-over-year comparison
SELECT *,
value - LAG(value, 12) OVER (ORDER BY month) AS yoy_change,
ROUND((value - LAG(value, 12) OVER (ORDER BY month)) * 100.0 /
NULLIF(LAG(value, 12) OVER (ORDER BY month), 0), 2) AS yoy_pct
FROM monthly_metrics;
-- Pattern 4: Percentile calculations
SELECT *,
PERCENT_RANK() OVER (ORDER BY value) AS percentile,
NTILE(4) OVER (ORDER BY value) AS quartile
FROM table;
10 Best Practices
- Practice SQL daily β solve 2-3 problems on LeetCode/HackerRank per day
- Study system design patterns β know batch vs streaming, push vs pull
- Master window functions β they appear in 80% of SQL interviews
- Prepare STAR stories β have 5-7 stories ready covering different competencies
- Know your resume β be able to explain every project in detail
- Ask clarifying questions β system design requires requirements first
- Draw diagrams β visualize architecture before explaining components
- Discuss trade-offs β show you understand pros/cons of different approaches
- Practice aloud β mock interviews reveal communication gaps
- Research the company β understand their tech stack and data challenges
- SQL proficiency (window functions, CTEs, optimization) is tested in every interview
- System design requires structured thinking: requirements -> architecture -> trade-offs
- STAR framework provides a clear structure for behavioral answers
- Practice under timed conditions to build speed and confidence
- Understanding trade-offs (cost vs performance, consistency vs availability) demonstrates senior thinking
See Also
- Data Warehouse Concepts β Star/snowflake schema for modeling questions
- Performance Optimization β Query tuning for optimization questions
- Real-Time Analytics β Streaming system design patterns
- Data Mesh Architecture β Architecture discussion topics
- Portfolio Projects β Building interview-ready portfolio
- Capstone: End-to-End β Comprehensive project for portfolio