SQL Fundamentals for Data Engineers

Why SQL Matters for Data Engineers

SQL is the lingua franca of data. Whether you're building ETL pipelines, writing transformations, debugging data quality issues, or optimizing queries — SQL is your primary tool. Data engineers who master SQL are significantly more productive and effective.

DfSQL Execution Order

SQL queries are processed in a specific order that differs from how they are written:

FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT

Understanding this order is crucial for writing correct and efficient queries.

Basic SELECT Queries

Retrieving Data

-- Select specific columns
SELECT first_name, last_name, email
FROM customers;

-- Select all columns
SELECT *
FROM orders;

-- Select with aliases
SELECT 
    first_name AS "First Name",
    last_name AS "Last Name",
    email AS "Email Address"
FROM customers;

-- Select distinct values
SELECT DISTINCT country
FROM customers;

-- Select with LIMIT
SELECT *
FROM orders
ORDER BY order_date DESC
LIMIT 100;

Filtering with WHERE

-- Comparison operators
SELECT *
FROM orders
WHERE status = 'completed'
  AND amount > 100.00;

-- IN operator
SELECT *
FROM products
WHERE category_id IN (1, 3, 5);

-- BETWEEN operator
SELECT *
FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31';

-- LIKE operator
SELECT *
FROM customers
WHERE email LIKE '%@gmail.com';

-- IS NULL check
SELECT *
FROM customers
WHERE phone_number IS NULL;

-- NOT operators
SELECT *
FROM orders
WHERE status != 'cancelled'
  AND amount NOT BETWEEN 0 AND 10;

Sorting and Limiting

-- ORDER BY
SELECT *
FROM orders
ORDER BY order_date DESC, amount ASC;

-- LIMIT and OFFSET (pagination)
SELECT *
FROM orders
ORDER BY order_id
LIMIT 20 OFFSET 40;  -- Page 3 (records 41-60)

-- TOP N per group (using window function)
SELECT *
FROM (
    SELECT 
        *,
        ROW_NUMBER() OVER (PARTITION BY category_id ORDER BY rating DESC) as rank
    FROM products
) ranked
WHERE rank <= 3;

Pagination Formula

For paginated queries, calculate OFFSET as:

OFFSET = (Page_Number - 1) × Page_Size

Example: For page 3 with 20 records per page: OFFSET = (3 - 1) × 20 = 40

Aggregate Functions

-- Basic aggregations
SELECT 
    COUNT(*) AS total_orders,
    COUNT(DISTINCT customer_id) AS unique_customers,
    SUM(amount) AS total_revenue,
    AVG(amount) AS avg_order_value,
    MIN(amount) AS min_order,
    MAX(amount) AS max_order
FROM orders;

-- GROUP BY
SELECT 
    status,
    COUNT(*) AS order_count,
    AVG(amount) AS avg_amount
FROM orders
GROUP BY status;

-- GROUP BY with HAVING (filter after aggregation)
SELECT 
    customer_id,
    COUNT(*) AS order_count,
    SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id
HAVING COUNT(*) >= 5
   AND SUM(amount) > 1000;

-- Multiple grouping columns
SELECT 
    country,
    city,
    COUNT(*) AS customer_count
FROM customers
GROUP BY country, city
ORDER BY country, customer_count DESC;

Aggregate Functions

Common aggregate functions and their mathematical definitions:

COUNT(column): Count = Σ(1 for each non-null value)
SUM(column): Sum = Σ(value_i for all i)
AVG(column): Average = (Σ(value_i)) / Count
MIN(column): Min = min(value_i for all i)
MAX(column): Max = max(value_i for all i)

JOIN Operations

JOIN Types Visualized

JOIN Performance Comparison

JOIN Type	Performance	Use Case	Index Recommendation
INNER JOIN	Excellent	Most common, filtered results	Index both join columns
LEFT JOIN	Good	Keep all rows from left table	Index right table join column
RIGHT JOIN	Good	Keep all rows from right table	Index left table join column
FULL OUTER JOIN	Moderate	All rows from both tables	Index both join columns
CROSS JOIN	Poor	Cartesian product (avoid large sets)	Rarely indexed
SELF JOIN	Variable	Compare rows within same table	Index the join column

Practical JOIN Examples

-- INNER JOIN: Orders with customer information
SELECT 
    o.order_id,
    o.order_date,
    o.amount,
    c.first_name,
    c.last_name,
    c.email
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id;

-- LEFT JOIN: All customers, even those without orders
SELECT 
    c.customer_id,
    c.first_name,
    c.last_name,
    COUNT(o.order_id) AS total_orders,
    COALESCE(SUM(o.amount), 0) AS total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name;

-- Multiple JOINs
SELECT 
    o.order_id,
    c.first_name AS customer_name,
    p.product_name,
    oi.quantity,
    oi.unit_price
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p ON oi.product_id = p.product_id;

-- Self JOIN: Find customers in the same city
SELECT 
    c1.first_name AS customer_1,
    c2.first_name AS customer_2,
    c1.city
FROM customers c1
INNER JOIN customers c2 ON c1.city = c2.city
WHERE c1.customer_id < c2.customer_id;

JOIN Performance Tip

For optimal JOIN performance:

Always index join columns
Filter data before joining when possible
Use INNER JOIN when you only need matching rows
Avoid joining large tables without filters

Subqueries

-- Scalar subquery: Get orders above average
SELECT *
FROM orders
WHERE amount > (SELECT AVG(amount) FROM orders);

-- IN subquery: Customers who placed orders
SELECT *
FROM customers
WHERE customer_id IN (SELECT DISTINCT customer_id FROM orders);

-- Correlated subquery: Latest order per customer
SELECT *
FROM orders o1
WHERE order_date = (
    SELECT MAX(order_date)
    FROM orders o2
    WHERE o2.customer_id = o1.customer_id
);

-- EXISTS subquery: Customers with high-value orders
SELECT *
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
      AND o.amount > 1000
);

-- Subquery in FROM clause
SELECT 
    customer_id,
    total_spent,
    RANK() OVER (ORDER BY total_spent DESC) AS spending_rank
FROM (
    SELECT 
        customer_id,
        SUM(amount) AS total_spent
    FROM orders
    GROUP BY customer_id
) customer_totals;

Subquery vs JOIN Performance

# Example: Analyzing query performance
import pandas as pd
from typing import Dict

def analyze_query_performance(
    query_type: str,           # 'subquery' or 'join'
    table_sizes: Dict[str, int],  # Table sizes in rows
    index_exists: bool = False    # Whether indexes exist
) -> Dict:
    """Estimate relative query performance."""
    
    # Base cost factors
    subquery_base_cost = 1.0
    join_base_cost = 0.8
    
    # Size impact
    size_factor = sum(table_sizes.values()) / 1000000  # Normalize to millions
    
    # Index impact
    index_factor = 0.3 if index_exists else 1.0
    
    if query_type == 'subquery':
        # Correlated subqueries scale with outer table size
        estimated_cost = subquery_base_cost * size_factor * size_factor * index_factor
    else:
        # JOINs scale more linearly
        estimated_cost = join_base_cost * size_factor * index_factor
    
    return {
        'query_type': query_type,
        'estimated_cost': round(estimated_cost, 2),
        'recommendation': 'Use JOIN for large datasets' if size_factor > 10 else 'Either approach is fine'
    }

# Example usage
result = analyze_query_performance(
    query_type='subquery',
    table_sizes={'orders': 1000000, 'customers': 100000},
    index_exists=True
)
print(f"Estimated cost: {result['estimated_cost']}")
print(f"Recommendation: {result['recommendation']}")

Window Functions

Window functions perform calculations across a set of rows related to the current row — without collapsing them like GROUP BY.

-- ROW_NUMBER: Unique sequential numbering
SELECT 
    order_id,
    customer_id,
    order_date,
    amount,
    ROW_NUMBER() OVER (
        PARTITION BY customer_id 
        ORDER BY order_date DESC
    ) AS order_sequence
FROM orders;

-- RANK and DENSE_RANK: Ranking with ties
SELECT 
    product_name,
    category,
    total_sales,
    RANK() OVER (PARTITION BY category ORDER BY total_sales DESC) AS rank,
    DENSE_RANK() OVER (PARTITION BY category ORDER BY total_sales DESC) AS dense_rank
FROM product_sales;

-- Running total (cumulative sum)
SELECT 
    order_date,
    daily_revenue,
    SUM(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS cumulative_revenue
FROM daily_sales;

-- Moving average (window frame)
SELECT 
    order_date,
    daily_revenue,
    AVG(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS rolling_7day_avg
FROM daily_sales;

-- LAG and LEAD: Access previous/next rows
SELECT 
    month,
    revenue,
    LAG(revenue, 1) OVER (ORDER BY month) AS prev_month,
    LEAD(revenue, 1) OVER (ORDER BY month) AS next_month,
    revenue - LAG(revenue, 1) OVER (ORDER BY month) AS month_over_month
FROM monthly_revenue;

Window Function Reference

Function	Description	Use Case	Performance Impact
ROW_NUMBER()	Sequential number, no ties	Deduplication, pagination	Low
RANK()	Rank with gaps for ties	Leaderboards	Low
DENSE_RANK()	Rank without gaps	Continuous ranking	Low
NTILE(n)	Divide into n buckets	Percentiles, quartiles	Low
LAG(col, n)	Value from n rows before	Time series comparison	Low
LEAD(col, n)	Value from n rows after	Time series comparison	Low
SUM() OVER	Cumulative sum	Running totals	Medium
AVG() OVER	Moving average	Trend analysis	Medium

Window Function Framing

Window frame specifications define which rows are included in the calculation:

ROWS BETWEEN <start> AND <end>

Where:

UNBOUNDED PRECEDING: Start from first row
n PRECEDING: Start from n rows before current
CURRENT ROW: Include only current row
n FOLLOWING: End at n rows after current
UNBOUNDED FOLLOWING: End at last row

Example: 7-day moving average ROWS BETWEEN 6 PRECEDING AND CURRENT ROW

Common Table Expressions (CTEs)

CTEs make complex queries readable and reusable.

-- Basic CTE
WITH customer_orders AS (
    SELECT 
        customer_id,
        COUNT(*) AS order_count,
        SUM(amount) AS total_spent
    FROM orders
    GROUP BY customer_id
)
SELECT 
    c.first_name,
    c.last_name,
    co.order_count,
    co.total_spent
FROM customers c
INNER JOIN customer_orders co ON c.customer_id = co.customer_id
WHERE co.order_count >= 5;

-- Multiple CTEs
WITH 
monthly_sales AS (
    SELECT 
        DATE_TRUNC('month', order_date) AS month,
        SUM(amount) AS revenue
    FROM orders
    GROUP BY DATE_TRUNC('month', order_date)
),
monthly_avg AS (
    SELECT 
        AVG(revenue) AS avg_monthly_revenue
    FROM monthly_sales
)
SELECT 
    ms.month,
    ms.revenue,
    ma.avg_monthly_revenue,
    ms.revenue - ma.avg_monthly_revenue AS variance
FROM monthly_sales ms
CROSS JOIN monthly_avg ma;

-- CTE for data quality checks
WITH order_validation AS (
    SELECT 
        order_id,
        CASE 
            WHEN amount <= 0 THEN 'INVALID_AMOUNT'
            WHEN order_date > CURRENT_DATE THEN 'FUTURE_DATE'
            WHEN customer_id IS NULL THEN 'MISSING_CUSTOMER'
            ELSE 'VALID'
        END AS validation_status
    FROM orders
)
SELECT 
    validation_status,
    COUNT(*) AS record_count
FROM order_validation
GROUP BY validation_status;

CTE vs Subquery Performance

Aspect	CTE	Subquery
Readability	Excellent	Poor for complex queries
Reusability	Can reference multiple times	Single use
Performance	Same (optimizer treats similarly)	Same
Materialization	Not materialized (PostgreSQL)	Not materialized
Recursion	Supports recursive CTEs	Does not support

Indexes and Performance Basics

What is an Index?

An index is a data structure that speeds up data retrieval at the cost of additional storage and write overhead.

-- Create index on frequently queried columns
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_orders_order_date ON orders(order_date);

-- Composite index (multiple columns)
CREATE INDEX idx_orders_status_date ON orders(status, order_date);

-- Partial index (PostgreSQL)
CREATE INDEX idx_orders_pending ON orders(order_date) 
WHERE status = 'pending';

-- Check if index is being used
EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 12345;

Index Types and Use Cases

Index Type	Best For	Example
B-tree	Equality and range queries	`WHERE id = 123`, `WHERE date > '2024-01-01'`
Hash	Equality queries only	`WHERE hash_col = 'abc'`
GIN	Full-text search, arrays	`WHERE tags @> '{python}'`
GiST	Geometric, full-text	`WHERE location @> point(1,1)`
BRIN	Large, naturally ordered tables	`WHERE timestamp > '2024-01-01'`

Index Selection Rule of Thumb

Create indexes on columns that:

Appear in WHERE clauses frequently
Are used in JOIN conditions
Are used in ORDER BY or GROUP BY
Have high cardinality (many unique values)

Avoid indexing:

Small tables (full scan is faster)
Columns with low cardinality (e.g., boolean)
Columns that are frequently updated

EXPLAIN ANALYZE

-- Before optimization
EXPLAIN ANALYZE
SELECT COUNT(*)
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01';

-- Output might show:
-- Seq Scan on orders (cost=0.00..1234.00 rows=50000)
--   Filter: (order_date >= '2024-01-01')
--   Rows Removed by Filter: 950000
-- Hash Join (cost=234.56..1567.89 rows=50000)
--   -> Seq Scan on customers (cost=0.00..234.56 rows=10000)

EXPLAIN Output Interpretation

-- Example EXPLAIN ANALYZE output
EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 123;

/*
Index Scan using idx_orders_customer_id on orders  (cost=0.43..8.45 rows=5 width=120)
  (actual time=0.025..0.026 rows=5 loops=1)
  Index Cond: (customer_id = 123)
  Buffers: shared hit=4
  Planning Time: 0.078 ms
  Execution Time: 0.045 ms
*/

-- Key metrics to analyze:
-- 1. cost: Estimated cost (first number is startup, second is total)
-- 2. rows: Estimated vs actual rows
-- 3. actual time: Actual execution time
-- 4. loops: Number of times the node was executed
-- 5. Buffers: I/O operations (shared = in cache, hit = read from cache)

SQL Best Practices for Data Engineers

Naming Conventions

-- Use snake_case for identifiers
CREATE TABLE order_items (  -- Good
    order_item_id SERIAL PRIMARY KEY,
    order_id INTEGER REFERENCES orders(order_id),
    product_id INTEGER REFERENCES products(product_id),
    quantity INTEGER NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Use descriptive names
total_revenue  -- Good
revenue        -- Acceptable
rev            -- Bad (ambiguous)

Writing Readable SQL

-- Bad: All on one line
SELECT o.order_id,c.first_name,c.last_name,p.product_name,oi.quantity,oi.unit_price FROM orders o INNER JOIN customers c ON o.customer_id=c.customer_id INNER JOIN order_items oi ON o.order_id=oi.order_id INNER JOIN products p ON oi.product_id=p.product_id WHERE o.order_date BETWEEN '2024-01-01' AND '2024-12-31' AND o.status='completed';

-- Good: Formatted and readable
SELECT 
    o.order_id,
    c.first_name,
    c.last_name,
    p.product_name,
    oi.quantity,
    oi.unit_price
FROM orders o
INNER JOIN customers c 
    ON o.customer_id = c.customer_id
INNER JOIN order_items oi 
    ON o.order_id = oi.order_id
INNER JOIN products p 
    ON oi.product_id = p.product_id
WHERE o.order_date BETWEEN '2024-01-01' AND '2024-12-31'
  AND o.status = 'completed';

Common Performance Issues

Issue	Symptom	Solution
Sequential scan	Full table scan on large tables	Add appropriate index
Missing join index	Slow JOINs on large tables	Index foreign keys
SELECT *	Fetching unnecessary columns	Select only needed columns
N+1 queries	Many small queries in loops	Use JOINs or batch queries
Unparameterized queries	Plan cache pollution	Use parameterized queries

SQL Performance Formula

Query performance can be estimated by:

Execution_Time = (Rows_Scanned × Row_Size × I/O_Factor) + (Rows_Returned × Processing_Factor)

Where:

Rows_Scanned: Number of rows examined
Row_Size: Average size of each row
I/O_Factor: Time to read from disk/cache
Rows_Returned: Number of rows returned
Processing_Factor: CPU time per row

To optimize: minimize Rows_Scanned (indexes) and Rows_Returned (filters).

Best Practices Checklist

# Example: SQL best practices checklist
from dataclasses import dataclass
from typing import List, Dict
from enum import Enum

class PracticeCategory(Enum):
    PERFORMANCE = "performance"
    READABILITY = "readability"
    MAINTAINABILITY = "maintainability"
    SECURITY = "security"

@dataclass
class BestPractice:
    """SQL best practice item."""
    category: PracticeCategory
    practice: str
    description: str
    check_query: str  # Query to check if practice is followed

# Define best practices
best_practices = [
    BestPractice(
        category=PracticeCategory.PERFORMANCE,
        practice="Avoid SELECT *",
        description="Only select columns you need",
        check_query="""
        SELECT * FROM information_schema.columns 
        WHERE table_name = 'your_table' 
        AND column_name NOT IN ('needed_col1', 'needed_col2')
        """
    ),
    BestPractice(
        category=PracticeCategory.PERFORMANCE,
        practice="Use appropriate indexes",
        description="Index columns used in WHERE, JOIN, ORDER BY",
        check_query="""
        SELECT indexname, indexdef 
        FROM pg_indexes 
        WHERE tablename = 'your_table'
        """
    ),
    BestPractice(
        category=PracticeCategory.READABILITY,
        practice="Use consistent formatting",
        description="Use consistent indentation and casing",
        check_query="-- Manual review required"
    ),
    BestPractice(
        category=PracticeCategory.SECURITY,
        practice="Use parameterized queries",
        description="Never concatenate user input into queries",
        check_query="-- Code review required"
    ),
]

# Generate checklist
def generate_checklist() -> str:
    """Generate a SQL best practices checklist."""
    checklist = []
    for practice in best_practices:
        checklist.append(f"[ ] {practice.practice}")
        checklist.append(f"    Category: {practice.category.value}")
        checklist.append(f"    Description: {practice.description}")
        checklist.append("")
    
    return "\n".join(checklist)

print(generate_checklist())

Key Takeaways

SQL is essential for every data engineer — it's the primary language for data manipulation
Master the basics first: SELECT, WHERE, JOIN, GROUP BY, ORDER BY, LIMIT
Window functions are powerful — learn ROW_NUMBER, RANK, LAG/LEAD, and cumulative aggregates
CTEs improve readability — use them to break complex queries into logical steps
Indexes are critical — understand when and how to index for performance
Always use EXPLAIN ANALYZE — understand your query execution plans
Write readable SQL — format consistently, use aliases, comment complex logic

Practice Exercises

Basic queries: Write a query to find the top 10 customers by total order amount in the last 30 days.
Joins: Write a query showing all products that have never been ordered, along with the number of times each product has been ordered.
Window functions: Write a query that calculates the running total of revenue by day, and the percentage change from the previous day.
CTEs: Using CTEs, write a query that identifies customers whose average order value is above the company-wide average.
Performance: Take a slow query from your work, run EXPLAIN ANALYZE, identify the bottleneck, and optimize it with indexes or query restructuring.

SQL Fundamentals for Data Engineers

Why SQL Matters for Data Engineers

DfSQL Execution Order

Basic SELECT Queries

Retrieving Data

Filtering with WHERE

Sorting and Limiting

Aggregate Functions

Aggregate Functions

JOIN Operations

JOIN Types Visualized

JOIN Performance Comparison

Practical JOIN Examples

Subqueries

Subquery vs JOIN Performance

Window Functions

Window Function Reference

Window Function Framing

Common Table Expressions (CTEs)

CTE vs Subquery Performance

Indexes and Performance Basics

What is an Index?

Index Types and Use Cases

EXPLAIN ANALYZE

EXPLAIN Output Interpretation

SQL Best Practices for Data Engineers

Naming Conventions

Writing Readable SQL

Common Performance Issues

SQL Performance Formula

Best Practices Checklist

Key Takeaways

Practice Exercises

See Also

Premium Content

Need Expert Data Engineering Help?