Why SQL Matters for Data Engineers
SQL is the lingua franca of data. Whether you're building ETL pipelines, writing transformations, debugging data quality issues, or optimizing queries β SQL is your primary tool. Data engineers who master SQL are significantly more productive and effective.
DfSQL Execution Order
SQL queries are processed in a specific order that differs from how they are written:
FROM β WHERE β GROUP BY β HAVING β SELECT β ORDER BY β LIMIT
Understanding this order is crucial for writing correct and efficient queries.
Basic SELECT Queries
Retrieving Data
-- Select specific columns
SELECT first_name, last_name, email
FROM customers;
-- Select all columns
SELECT *
FROM orders;
-- Select with aliases
SELECT
first_name AS "First Name",
last_name AS "Last Name",
email AS "Email Address"
FROM customers;
-- Select distinct values
SELECT DISTINCT country
FROM customers;
-- Select with LIMIT
SELECT *
FROM orders
ORDER BY order_date DESC
LIMIT 100;
Filtering with WHERE
-- Comparison operators
SELECT *
FROM orders
WHERE status = 'completed'
AND amount > 100.00;
-- IN operator
SELECT *
FROM products
WHERE category_id IN (1, 3, 5);
-- BETWEEN operator
SELECT *
FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31';
-- LIKE operator
SELECT *
FROM customers
WHERE email LIKE '%@gmail.com';
-- IS NULL check
SELECT *
FROM customers
WHERE phone_number IS NULL;
-- NOT operators
SELECT *
FROM orders
WHERE status != 'cancelled'
AND amount NOT BETWEEN 0 AND 10;
Sorting and Limiting
-- ORDER BY
SELECT *
FROM orders
ORDER BY order_date DESC, amount ASC;
-- LIMIT and OFFSET (pagination)
SELECT *
FROM orders
ORDER BY order_id
LIMIT 20 OFFSET 40; -- Page 3 (records 41-60)
-- TOP N per group (using window function)
SELECT *
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY category_id ORDER BY rating DESC) as rank
FROM products
) ranked
WHERE rank <= 3;
Pagination Formula
For paginated queries, calculate OFFSET as:
OFFSET = (Page_Number - 1) Γ Page_Size
Example: For page 3 with 20 records per page:
OFFSET = (3 - 1) Γ 20 = 40
Aggregate Functions
-- Basic aggregations
SELECT
COUNT(*) AS total_orders,
COUNT(DISTINCT customer_id) AS unique_customers,
SUM(amount) AS total_revenue,
AVG(amount) AS avg_order_value,
MIN(amount) AS min_order,
MAX(amount) AS max_order
FROM orders;
-- GROUP BY
SELECT
status,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount
FROM orders
GROUP BY status;
-- GROUP BY with HAVING (filter after aggregation)
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id
HAVING COUNT(*) >= 5
AND SUM(amount) > 1000;
-- Multiple grouping columns
SELECT
country,
city,
COUNT(*) AS customer_count
FROM customers
GROUP BY country, city
ORDER BY country, customer_count DESC;
Aggregate Functions
Common aggregate functions and their mathematical definitions:
- COUNT(column):
Count = Ξ£(1 for each non-null value) - SUM(column):
Sum = Ξ£(value_i for all i) - AVG(column):
Average = (Ξ£(value_i)) / Count - MIN(column):
Min = min(value_i for all i) - MAX(column):
Max = max(value_i for all i)
JOIN Operations
JOIN Types Visualized
JOIN Performance Comparison
| JOIN Type | Performance | Use Case | Index Recommendation |
|---|---|---|---|
| INNER JOIN | Excellent | Most common, filtered results | Index both join columns |
| LEFT JOIN | Good | Keep all rows from left table | Index right table join column |
| RIGHT JOIN | Good | Keep all rows from right table | Index left table join column |
| FULL OUTER JOIN | Moderate | All rows from both tables | Index both join columns |
| CROSS JOIN | Poor | Cartesian product (avoid large sets) | Rarely indexed |
| SELF JOIN | Variable | Compare rows within same table | Index the join column |
Practical JOIN Examples
-- INNER JOIN: Orders with customer information
SELECT
o.order_id,
o.order_date,
o.amount,
c.first_name,
c.last_name,
c.email
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id;
-- LEFT JOIN: All customers, even those without orders
SELECT
c.customer_id,
c.first_name,
c.last_name,
COUNT(o.order_id) AS total_orders,
COALESCE(SUM(o.amount), 0) AS total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name;
-- Multiple JOINs
SELECT
o.order_id,
c.first_name AS customer_name,
p.product_name,
oi.quantity,
oi.unit_price
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p ON oi.product_id = p.product_id;
-- Self JOIN: Find customers in the same city
SELECT
c1.first_name AS customer_1,
c2.first_name AS customer_2,
c1.city
FROM customers c1
INNER JOIN customers c2 ON c1.city = c2.city
WHERE c1.customer_id < c2.customer_id;
JOIN Performance Tip
For optimal JOIN performance:
- Always index join columns
- Filter data before joining when possible
- Use INNER JOIN when you only need matching rows
- Avoid joining large tables without filters
Subqueries
-- Scalar subquery: Get orders above average
SELECT *
FROM orders
WHERE amount > (SELECT AVG(amount) FROM orders);
-- IN subquery: Customers who placed orders
SELECT *
FROM customers
WHERE customer_id IN (SELECT DISTINCT customer_id FROM orders);
-- Correlated subquery: Latest order per customer
SELECT *
FROM orders o1
WHERE order_date = (
SELECT MAX(order_date)
FROM orders o2
WHERE o2.customer_id = o1.customer_id
);
-- EXISTS subquery: Customers with high-value orders
SELECT *
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.amount > 1000
);
-- Subquery in FROM clause
SELECT
customer_id,
total_spent,
RANK() OVER (ORDER BY total_spent DESC) AS spending_rank
FROM (
SELECT
customer_id,
SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id
) customer_totals;
Subquery vs JOIN Performance
# Example: Analyzing query performance
import pandas as pd
from typing import Dict
def analyze_query_performance(
query_type: str, # 'subquery' or 'join'
table_sizes: Dict[str, int], # Table sizes in rows
index_exists: bool = False # Whether indexes exist
) -> Dict:
"""Estimate relative query performance."""
# Base cost factors
subquery_base_cost = 1.0
join_base_cost = 0.8
# Size impact
size_factor = sum(table_sizes.values()) / 1000000 # Normalize to millions
# Index impact
index_factor = 0.3 if index_exists else 1.0
if query_type == 'subquery':
# Correlated subqueries scale with outer table size
estimated_cost = subquery_base_cost * size_factor * size_factor * index_factor
else:
# JOINs scale more linearly
estimated_cost = join_base_cost * size_factor * index_factor
return {
'query_type': query_type,
'estimated_cost': round(estimated_cost, 2),
'recommendation': 'Use JOIN for large datasets' if size_factor > 10 else 'Either approach is fine'
}
# Example usage
result = analyze_query_performance(
query_type='subquery',
table_sizes={'orders': 1000000, 'customers': 100000},
index_exists=True
)
print(f"Estimated cost: {result['estimated_cost']}")
print(f"Recommendation: {result['recommendation']}")
Window Functions
Window functions perform calculations across a set of rows related to the current row β without collapsing them like GROUP BY.
-- ROW_NUMBER: Unique sequential numbering
SELECT
order_id,
customer_id,
order_date,
amount,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY order_date DESC
) AS order_sequence
FROM orders;
-- RANK and DENSE_RANK: Ranking with ties
SELECT
product_name,
category,
total_sales,
RANK() OVER (PARTITION BY category ORDER BY total_sales DESC) AS rank,
DENSE_RANK() OVER (PARTITION BY category ORDER BY total_sales DESC) AS dense_rank
FROM product_sales;
-- Running total (cumulative sum)
SELECT
order_date,
daily_revenue,
SUM(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_revenue
FROM daily_sales;
-- Moving average (window frame)
SELECT
order_date,
daily_revenue,
AVG(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7day_avg
FROM daily_sales;
-- LAG and LEAD: Access previous/next rows
SELECT
month,
revenue,
LAG(revenue, 1) OVER (ORDER BY month) AS prev_month,
LEAD(revenue, 1) OVER (ORDER BY month) AS next_month,
revenue - LAG(revenue, 1) OVER (ORDER BY month) AS month_over_month
FROM monthly_revenue;
Window Function Reference
| Function | Description | Use Case | Performance Impact |
|---|---|---|---|
| ROW_NUMBER() | Sequential number, no ties | Deduplication, pagination | Low |
| RANK() | Rank with gaps for ties | Leaderboards | Low |
| DENSE_RANK() | Rank without gaps | Continuous ranking | Low |
| NTILE(n) | Divide into n buckets | Percentiles, quartiles | Low |
| LAG(col, n) | Value from n rows before | Time series comparison | Low |
| LEAD(col, n) | Value from n rows after | Time series comparison | Low |
| SUM() OVER | Cumulative sum | Running totals | Medium |
| AVG() OVER | Moving average | Trend analysis | Medium |
Window Function Framing
Window frame specifications define which rows are included in the calculation:
ROWS BETWEEN <start> AND <end>
Where:
- UNBOUNDED PRECEDING: Start from first row
- n PRECEDING: Start from n rows before current
- CURRENT ROW: Include only current row
- n FOLLOWING: End at n rows after current
- UNBOUNDED FOLLOWING: End at last row
Example: 7-day moving average
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
Common Table Expressions (CTEs)
CTEs make complex queries readable and reusable.
-- Basic CTE
WITH customer_orders AS (
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id
)
SELECT
c.first_name,
c.last_name,
co.order_count,
co.total_spent
FROM customers c
INNER JOIN customer_orders co ON c.customer_id = co.customer_id
WHERE co.order_count >= 5;
-- Multiple CTEs
WITH
monthly_sales AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
),
monthly_avg AS (
SELECT
AVG(revenue) AS avg_monthly_revenue
FROM monthly_sales
)
SELECT
ms.month,
ms.revenue,
ma.avg_monthly_revenue,
ms.revenue - ma.avg_monthly_revenue AS variance
FROM monthly_sales ms
CROSS JOIN monthly_avg ma;
-- CTE for data quality checks
WITH order_validation AS (
SELECT
order_id,
CASE
WHEN amount <= 0 THEN 'INVALID_AMOUNT'
WHEN order_date > CURRENT_DATE THEN 'FUTURE_DATE'
WHEN customer_id IS NULL THEN 'MISSING_CUSTOMER'
ELSE 'VALID'
END AS validation_status
FROM orders
)
SELECT
validation_status,
COUNT(*) AS record_count
FROM order_validation
GROUP BY validation_status;
CTE vs Subquery Performance
| Aspect | CTE | Subquery |
|---|---|---|
| Readability | Excellent | Poor for complex queries |
| Reusability | Can reference multiple times | Single use |
| Performance | Same (optimizer treats similarly) | Same |
| Materialization | Not materialized (PostgreSQL) | Not materialized |
| Recursion | Supports recursive CTEs | Does not support |
Indexes and Performance Basics
What is an Index?
An index is a data structure that speeds up data retrieval at the cost of additional storage and write overhead.
-- Create index on frequently queried columns
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_orders_order_date ON orders(order_date);
-- Composite index (multiple columns)
CREATE INDEX idx_orders_status_date ON orders(status, order_date);
-- Partial index (PostgreSQL)
CREATE INDEX idx_orders_pending ON orders(order_date)
WHERE status = 'pending';
-- Check if index is being used
EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 12345;
Index Types and Use Cases
| Index Type | Best For | Example |
|---|---|---|
| B-tree | Equality and range queries | WHERE id = 123, WHERE date > '2024-01-01' |
| Hash | Equality queries only | WHERE hash_col = 'abc' |
| GIN | Full-text search, arrays | WHERE tags @> '{python}' |
| GiST | Geometric, full-text | WHERE location @> point(1,1) |
| BRIN | Large, naturally ordered tables | WHERE timestamp > '2024-01-01' |
Index Selection Rule of Thumb
Create indexes on columns that:
- Appear in WHERE clauses frequently
- Are used in JOIN conditions
- Are used in ORDER BY or GROUP BY
- Have high cardinality (many unique values)
Avoid indexing:
- Small tables (full scan is faster)
- Columns with low cardinality (e.g., boolean)
- Columns that are frequently updated
EXPLAIN ANALYZE
-- Before optimization
EXPLAIN ANALYZE
SELECT COUNT(*)
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01';
-- Output might show:
-- Seq Scan on orders (cost=0.00..1234.00 rows=50000)
-- Filter: (order_date >= '2024-01-01')
-- Rows Removed by Filter: 950000
-- Hash Join (cost=234.56..1567.89 rows=50000)
-- -> Seq Scan on customers (cost=0.00..234.56 rows=10000)
EXPLAIN Output Interpretation
-- Example EXPLAIN ANALYZE output
EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 123;
/*
Index Scan using idx_orders_customer_id on orders (cost=0.43..8.45 rows=5 width=120)
(actual time=0.025..0.026 rows=5 loops=1)
Index Cond: (customer_id = 123)
Buffers: shared hit=4
Planning Time: 0.078 ms
Execution Time: 0.045 ms
*/
-- Key metrics to analyze:
-- 1. cost: Estimated cost (first number is startup, second is total)
-- 2. rows: Estimated vs actual rows
-- 3. actual time: Actual execution time
-- 4. loops: Number of times the node was executed
-- 5. Buffers: I/O operations (shared = in cache, hit = read from cache)
SQL Best Practices for Data Engineers
Naming Conventions
-- Use snake_case for identifiers
CREATE TABLE order_items ( -- Good
order_item_id SERIAL PRIMARY KEY,
order_id INTEGER REFERENCES orders(order_id),
product_id INTEGER REFERENCES products(product_id),
quantity INTEGER NOT NULL,
unit_price DECIMAL(10,2) NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Use descriptive names
total_revenue -- Good
revenue -- Acceptable
rev -- Bad (ambiguous)
Writing Readable SQL
-- Bad: All on one line
SELECT o.order_id,c.first_name,c.last_name,p.product_name,oi.quantity,oi.unit_price FROM orders o INNER JOIN customers c ON o.customer_id=c.customer_id INNER JOIN order_items oi ON o.order_id=oi.order_id INNER JOIN products p ON oi.product_id=p.product_id WHERE o.order_date BETWEEN '2024-01-01' AND '2024-12-31' AND o.status='completed';
-- Good: Formatted and readable
SELECT
o.order_id,
c.first_name,
c.last_name,
p.product_name,
oi.quantity,
oi.unit_price
FROM orders o
INNER JOIN customers c
ON o.customer_id = c.customer_id
INNER JOIN order_items oi
ON o.order_id = oi.order_id
INNER JOIN products p
ON oi.product_id = p.product_id
WHERE o.order_date BETWEEN '2024-01-01' AND '2024-12-31'
AND o.status = 'completed';
Common Performance Issues
| Issue | Symptom | Solution |
|---|---|---|
| Sequential scan | Full table scan on large tables | Add appropriate index |
| Missing join index | Slow JOINs on large tables | Index foreign keys |
| SELECT * | Fetching unnecessary columns | Select only needed columns |
| N+1 queries | Many small queries in loops | Use JOINs or batch queries |
| Unparameterized queries | Plan cache pollution | Use parameterized queries |
SQL Performance Formula
Query performance can be estimated by:
Execution_Time = (Rows_Scanned Γ Row_Size Γ I/O_Factor) + (Rows_Returned Γ Processing_Factor)
Where:
- Rows_Scanned: Number of rows examined
- Row_Size: Average size of each row
- I/O_Factor: Time to read from disk/cache
- Rows_Returned: Number of rows returned
- Processing_Factor: CPU time per row
To optimize: minimize Rows_Scanned (indexes) and Rows_Returned (filters).
Best Practices Checklist
# Example: SQL best practices checklist
from dataclasses import dataclass
from typing import List, Dict
from enum import Enum
class PracticeCategory(Enum):
PERFORMANCE = "performance"
READABILITY = "readability"
MAINTAINABILITY = "maintainability"
SECURITY = "security"
@dataclass
class BestPractice:
"""SQL best practice item."""
category: PracticeCategory
practice: str
description: str
check_query: str # Query to check if practice is followed
# Define best practices
best_practices = [
BestPractice(
category=PracticeCategory.PERFORMANCE,
practice="Avoid SELECT *",
description="Only select columns you need",
check_query="""
SELECT * FROM information_schema.columns
WHERE table_name = 'your_table'
AND column_name NOT IN ('needed_col1', 'needed_col2')
"""
),
BestPractice(
category=PracticeCategory.PERFORMANCE,
practice="Use appropriate indexes",
description="Index columns used in WHERE, JOIN, ORDER BY",
check_query="""
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'your_table'
"""
),
BestPractice(
category=PracticeCategory.READABILITY,
practice="Use consistent formatting",
description="Use consistent indentation and casing",
check_query="-- Manual review required"
),
BestPractice(
category=PracticeCategory.SECURITY,
practice="Use parameterized queries",
description="Never concatenate user input into queries",
check_query="-- Code review required"
),
]
# Generate checklist
def generate_checklist() -> str:
"""Generate a SQL best practices checklist."""
checklist = []
for practice in best_practices:
checklist.append(f"[ ] {practice.practice}")
checklist.append(f" Category: {practice.category.value}")
checklist.append(f" Description: {practice.description}")
checklist.append("")
return "\n".join(checklist)
print(generate_checklist())
Key Takeaways
- SQL is essential for every data engineer β it's the primary language for data manipulation
- Master the basics first: SELECT, WHERE, JOIN, GROUP BY, ORDER BY, LIMIT
- Window functions are powerful β learn ROW_NUMBER, RANK, LAG/LEAD, and cumulative aggregates
- CTEs improve readability β use them to break complex queries into logical steps
- Indexes are critical β understand when and how to index for performance
- Always use EXPLAIN ANALYZE β understand your query execution plans
- Write readable SQL β format consistently, use aliases, comment complex logic
Practice Exercises
-
Basic queries: Write a query to find the top 10 customers by total order amount in the last 30 days.
-
Joins: Write a query showing all products that have never been ordered, along with the number of times each product has been ordered.
-
Window functions: Write a query that calculates the running total of revenue by day, and the percentage change from the previous day.
-
CTEs: Using CTEs, write a query that identifies customers whose average order value is above the company-wide average.
-
Performance: Take a slow query from your work, run EXPLAIN ANALYZE, identify the bottleneck, and optimize it with indexes or query restructuring.
See Also
- Advanced SQL β Advanced SQL techniques
- Data Lifecycle β Understanding the data lifecycle
- What is Data Engineering β Introduction to data engineering
- Python for Data Engineers β Python libraries and patterns
- Databases Fundamentals β Relational vs NoSQL databases