πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

SQL Fundamentals for Data Engineers

Data Engineering FoundationsSQL Basics🟒 Free Lesson

Advertisement

Why SQL Matters for Data Engineers

SQL is the lingua franca of data. Whether you're building ETL pipelines, writing transformations, debugging data quality issues, or optimizing queries β€” SQL is your primary tool. Data engineers who master SQL are significantly more productive and effective.

SQL Query Execution Order1. FROMIdentify tables2. WHEREFilter rows3. GROUP BYAggregate rows4. HAVINGFilter groups5. SELECTChoose columns6. ORDER BYSort results7. LIMITRestrict rowsExecution Order: FROM/JOIN β†’ WHERE β†’ GROUP BY β†’ HAVING β†’ SELECT β†’ ORDER BY β†’ LIMIT

DfSQL Execution Order

SQL queries are processed in a specific order that differs from how they are written:

FROM β†’ WHERE β†’ GROUP BY β†’ HAVING β†’ SELECT β†’ ORDER BY β†’ LIMIT

Understanding this order is crucial for writing correct and efficient queries.

Basic SELECT Queries

Retrieving Data

-- Select specific columns
SELECT first_name, last_name, email
FROM customers;

-- Select all columns
SELECT *
FROM orders;

-- Select with aliases
SELECT 
    first_name AS "First Name",
    last_name AS "Last Name",
    email AS "Email Address"
FROM customers;

-- Select distinct values
SELECT DISTINCT country
FROM customers;

-- Select with LIMIT
SELECT *
FROM orders
ORDER BY order_date DESC
LIMIT 100;

Filtering with WHERE

-- Comparison operators
SELECT *
FROM orders
WHERE status = 'completed'
  AND amount > 100.00;

-- IN operator
SELECT *
FROM products
WHERE category_id IN (1, 3, 5);

-- BETWEEN operator
SELECT *
FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31';

-- LIKE operator
SELECT *
FROM customers
WHERE email LIKE '%@gmail.com';

-- IS NULL check
SELECT *
FROM customers
WHERE phone_number IS NULL;

-- NOT operators
SELECT *
FROM orders
WHERE status != 'cancelled'
  AND amount NOT BETWEEN 0 AND 10;

Sorting and Limiting

-- ORDER BY
SELECT *
FROM orders
ORDER BY order_date DESC, amount ASC;

-- LIMIT and OFFSET (pagination)
SELECT *
FROM orders
ORDER BY order_id
LIMIT 20 OFFSET 40;  -- Page 3 (records 41-60)

-- TOP N per group (using window function)
SELECT *
FROM (
    SELECT 
        *,
        ROW_NUMBER() OVER (PARTITION BY category_id ORDER BY rating DESC) as rank
    FROM products
) ranked
WHERE rank <= 3;

Pagination Formula

For paginated queries, calculate OFFSET as:

OFFSET = (Page_Number - 1) Γ— Page_Size

Example: For page 3 with 20 records per page: OFFSET = (3 - 1) Γ— 20 = 40

Aggregate Functions

-- Basic aggregations
SELECT 
    COUNT(*) AS total_orders,
    COUNT(DISTINCT customer_id) AS unique_customers,
    SUM(amount) AS total_revenue,
    AVG(amount) AS avg_order_value,
    MIN(amount) AS min_order,
    MAX(amount) AS max_order
FROM orders;

-- GROUP BY
SELECT 
    status,
    COUNT(*) AS order_count,
    AVG(amount) AS avg_amount
FROM orders
GROUP BY status;

-- GROUP BY with HAVING (filter after aggregation)
SELECT 
    customer_id,
    COUNT(*) AS order_count,
    SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id
HAVING COUNT(*) >= 5
   AND SUM(amount) > 1000;

-- Multiple grouping columns
SELECT 
    country,
    city,
    COUNT(*) AS customer_count
FROM customers
GROUP BY country, city
ORDER BY country, customer_count DESC;

Aggregate Functions

Common aggregate functions and their mathematical definitions:

  • COUNT(column): Count = Ξ£(1 for each non-null value)
  • SUM(column): Sum = Ξ£(value_i for all i)
  • AVG(column): Average = (Ξ£(value_i)) / Count
  • MIN(column): Min = min(value_i for all i)
  • MAX(column): Max = max(value_i for all i)

JOIN Operations

JOIN Types Visualized

JOIN Types Venn Diagram ComparisonTable ATable BINNERINNER JOINTable ATable BLEFTLEFT JOINTable ATable BRIGHTRIGHT JOIN

JOIN Performance Comparison

JOIN TypePerformanceUse CaseIndex Recommendation
INNER JOINExcellentMost common, filtered resultsIndex both join columns
LEFT JOINGoodKeep all rows from left tableIndex right table join column
RIGHT JOINGoodKeep all rows from right tableIndex left table join column
FULL OUTER JOINModerateAll rows from both tablesIndex both join columns
CROSS JOINPoorCartesian product (avoid large sets)Rarely indexed
SELF JOINVariableCompare rows within same tableIndex the join column

Practical JOIN Examples

-- INNER JOIN: Orders with customer information
SELECT 
    o.order_id,
    o.order_date,
    o.amount,
    c.first_name,
    c.last_name,
    c.email
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id;

-- LEFT JOIN: All customers, even those without orders
SELECT 
    c.customer_id,
    c.first_name,
    c.last_name,
    COUNT(o.order_id) AS total_orders,
    COALESCE(SUM(o.amount), 0) AS total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name;

-- Multiple JOINs
SELECT 
    o.order_id,
    c.first_name AS customer_name,
    p.product_name,
    oi.quantity,
    oi.unit_price
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p ON oi.product_id = p.product_id;

-- Self JOIN: Find customers in the same city
SELECT 
    c1.first_name AS customer_1,
    c2.first_name AS customer_2,
    c1.city
FROM customers c1
INNER JOIN customers c2 ON c1.city = c2.city
WHERE c1.customer_id < c2.customer_id;

JOIN Performance Tip

For optimal JOIN performance:

  1. Always index join columns
  2. Filter data before joining when possible
  3. Use INNER JOIN when you only need matching rows
  4. Avoid joining large tables without filters

Subqueries

-- Scalar subquery: Get orders above average
SELECT *
FROM orders
WHERE amount > (SELECT AVG(amount) FROM orders);

-- IN subquery: Customers who placed orders
SELECT *
FROM customers
WHERE customer_id IN (SELECT DISTINCT customer_id FROM orders);

-- Correlated subquery: Latest order per customer
SELECT *
FROM orders o1
WHERE order_date = (
    SELECT MAX(order_date)
    FROM orders o2
    WHERE o2.customer_id = o1.customer_id
);

-- EXISTS subquery: Customers with high-value orders
SELECT *
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
      AND o.amount > 1000
);

-- Subquery in FROM clause
SELECT 
    customer_id,
    total_spent,
    RANK() OVER (ORDER BY total_spent DESC) AS spending_rank
FROM (
    SELECT 
        customer_id,
        SUM(amount) AS total_spent
    FROM orders
    GROUP BY customer_id
) customer_totals;

Subquery vs JOIN Performance

# Example: Analyzing query performance
import pandas as pd
from typing import Dict

def analyze_query_performance(
    query_type: str,           # 'subquery' or 'join'
    table_sizes: Dict[str, int],  # Table sizes in rows
    index_exists: bool = False    # Whether indexes exist
) -> Dict:
    """Estimate relative query performance."""
    
    # Base cost factors
    subquery_base_cost = 1.0
    join_base_cost = 0.8
    
    # Size impact
    size_factor = sum(table_sizes.values()) / 1000000  # Normalize to millions
    
    # Index impact
    index_factor = 0.3 if index_exists else 1.0
    
    if query_type == 'subquery':
        # Correlated subqueries scale with outer table size
        estimated_cost = subquery_base_cost * size_factor * size_factor * index_factor
    else:
        # JOINs scale more linearly
        estimated_cost = join_base_cost * size_factor * index_factor
    
    return {
        'query_type': query_type,
        'estimated_cost': round(estimated_cost, 2),
        'recommendation': 'Use JOIN for large datasets' if size_factor > 10 else 'Either approach is fine'
    }

# Example usage
result = analyze_query_performance(
    query_type='subquery',
    table_sizes={'orders': 1000000, 'customers': 100000},
    index_exists=True
)
print(f"Estimated cost: {result['estimated_cost']}")
print(f"Recommendation: {result['recommendation']}")

Window Functions

Window functions perform calculations across a set of rows related to the current row β€” without collapsing them like GROUP BY.

-- ROW_NUMBER: Unique sequential numbering
SELECT 
    order_id,
    customer_id,
    order_date,
    amount,
    ROW_NUMBER() OVER (
        PARTITION BY customer_id 
        ORDER BY order_date DESC
    ) AS order_sequence
FROM orders;

-- RANK and DENSE_RANK: Ranking with ties
SELECT 
    product_name,
    category,
    total_sales,
    RANK() OVER (PARTITION BY category ORDER BY total_sales DESC) AS rank,
    DENSE_RANK() OVER (PARTITION BY category ORDER BY total_sales DESC) AS dense_rank
FROM product_sales;

-- Running total (cumulative sum)
SELECT 
    order_date,
    daily_revenue,
    SUM(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS cumulative_revenue
FROM daily_sales;

-- Moving average (window frame)
SELECT 
    order_date,
    daily_revenue,
    AVG(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS rolling_7day_avg
FROM daily_sales;

-- LAG and LEAD: Access previous/next rows
SELECT 
    month,
    revenue,
    LAG(revenue, 1) OVER (ORDER BY month) AS prev_month,
    LEAD(revenue, 1) OVER (ORDER BY month) AS next_month,
    revenue - LAG(revenue, 1) OVER (ORDER BY month) AS month_over_month
FROM monthly_revenue;

Window Function Reference

FunctionDescriptionUse CasePerformance Impact
ROW_NUMBER()Sequential number, no tiesDeduplication, paginationLow
RANK()Rank with gaps for tiesLeaderboardsLow
DENSE_RANK()Rank without gapsContinuous rankingLow
NTILE(n)Divide into n bucketsPercentiles, quartilesLow
LAG(col, n)Value from n rows beforeTime series comparisonLow
LEAD(col, n)Value from n rows afterTime series comparisonLow
SUM() OVERCumulative sumRunning totalsMedium
AVG() OVERMoving averageTrend analysisMedium

Window Function Framing

Window frame specifications define which rows are included in the calculation:

ROWS BETWEEN <start> AND <end>

Where:

  • UNBOUNDED PRECEDING: Start from first row
  • n PRECEDING: Start from n rows before current
  • CURRENT ROW: Include only current row
  • n FOLLOWING: End at n rows after current
  • UNBOUNDED FOLLOWING: End at last row

Example: 7-day moving average ROWS BETWEEN 6 PRECEDING AND CURRENT ROW

Common Table Expressions (CTEs)

CTEs make complex queries readable and reusable.

-- Basic CTE
WITH customer_orders AS (
    SELECT 
        customer_id,
        COUNT(*) AS order_count,
        SUM(amount) AS total_spent
    FROM orders
    GROUP BY customer_id
)
SELECT 
    c.first_name,
    c.last_name,
    co.order_count,
    co.total_spent
FROM customers c
INNER JOIN customer_orders co ON c.customer_id = co.customer_id
WHERE co.order_count >= 5;

-- Multiple CTEs
WITH 
monthly_sales AS (
    SELECT 
        DATE_TRUNC('month', order_date) AS month,
        SUM(amount) AS revenue
    FROM orders
    GROUP BY DATE_TRUNC('month', order_date)
),
monthly_avg AS (
    SELECT 
        AVG(revenue) AS avg_monthly_revenue
    FROM monthly_sales
)
SELECT 
    ms.month,
    ms.revenue,
    ma.avg_monthly_revenue,
    ms.revenue - ma.avg_monthly_revenue AS variance
FROM monthly_sales ms
CROSS JOIN monthly_avg ma;

-- CTE for data quality checks
WITH order_validation AS (
    SELECT 
        order_id,
        CASE 
            WHEN amount <= 0 THEN 'INVALID_AMOUNT'
            WHEN order_date > CURRENT_DATE THEN 'FUTURE_DATE'
            WHEN customer_id IS NULL THEN 'MISSING_CUSTOMER'
            ELSE 'VALID'
        END AS validation_status
    FROM orders
)
SELECT 
    validation_status,
    COUNT(*) AS record_count
FROM order_validation
GROUP BY validation_status;

CTE vs Subquery Performance

AspectCTESubquery
ReadabilityExcellentPoor for complex queries
ReusabilityCan reference multiple timesSingle use
PerformanceSame (optimizer treats similarly)Same
MaterializationNot materialized (PostgreSQL)Not materialized
RecursionSupports recursive CTEsDoes not support

Indexes and Performance Basics

What is an Index?

An index is a data structure that speeds up data retrieval at the cost of additional storage and write overhead.

-- Create index on frequently queried columns
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_orders_order_date ON orders(order_date);

-- Composite index (multiple columns)
CREATE INDEX idx_orders_status_date ON orders(status, order_date);

-- Partial index (PostgreSQL)
CREATE INDEX idx_orders_pending ON orders(order_date) 
WHERE status = 'pending';

-- Check if index is being used
EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 12345;

Index Types and Use Cases

Index TypeBest ForExample
B-treeEquality and range queriesWHERE id = 123, WHERE date > '2024-01-01'
HashEquality queries onlyWHERE hash_col = 'abc'
GINFull-text search, arraysWHERE tags @> '{python}'
GiSTGeometric, full-textWHERE location @> point(1,1)
BRINLarge, naturally ordered tablesWHERE timestamp > '2024-01-01'

Index Selection Rule of Thumb

Create indexes on columns that:

  1. Appear in WHERE clauses frequently
  2. Are used in JOIN conditions
  3. Are used in ORDER BY or GROUP BY
  4. Have high cardinality (many unique values)

Avoid indexing:

  1. Small tables (full scan is faster)
  2. Columns with low cardinality (e.g., boolean)
  3. Columns that are frequently updated

EXPLAIN ANALYZE

-- Before optimization
EXPLAIN ANALYZE
SELECT COUNT(*)
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01';

-- Output might show:
-- Seq Scan on orders (cost=0.00..1234.00 rows=50000)
--   Filter: (order_date >= '2024-01-01')
--   Rows Removed by Filter: 950000
-- Hash Join (cost=234.56..1567.89 rows=50000)
--   -> Seq Scan on customers (cost=0.00..234.56 rows=10000)

EXPLAIN Output Interpretation

-- Example EXPLAIN ANALYZE output
EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 123;

/*
Index Scan using idx_orders_customer_id on orders  (cost=0.43..8.45 rows=5 width=120)
  (actual time=0.025..0.026 rows=5 loops=1)
  Index Cond: (customer_id = 123)
  Buffers: shared hit=4
  Planning Time: 0.078 ms
  Execution Time: 0.045 ms
*/

-- Key metrics to analyze:
-- 1. cost: Estimated cost (first number is startup, second is total)
-- 2. rows: Estimated vs actual rows
-- 3. actual time: Actual execution time
-- 4. loops: Number of times the node was executed
-- 5. Buffers: I/O operations (shared = in cache, hit = read from cache)

SQL Best Practices for Data Engineers

Naming Conventions

-- Use snake_case for identifiers
CREATE TABLE order_items (  -- Good
    order_item_id SERIAL PRIMARY KEY,
    order_id INTEGER REFERENCES orders(order_id),
    product_id INTEGER REFERENCES products(product_id),
    quantity INTEGER NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Use descriptive names
total_revenue  -- Good
revenue        -- Acceptable
rev            -- Bad (ambiguous)

Writing Readable SQL

-- Bad: All on one line
SELECT o.order_id,c.first_name,c.last_name,p.product_name,oi.quantity,oi.unit_price FROM orders o INNER JOIN customers c ON o.customer_id=c.customer_id INNER JOIN order_items oi ON o.order_id=oi.order_id INNER JOIN products p ON oi.product_id=p.product_id WHERE o.order_date BETWEEN '2024-01-01' AND '2024-12-31' AND o.status='completed';

-- Good: Formatted and readable
SELECT 
    o.order_id,
    c.first_name,
    c.last_name,
    p.product_name,
    oi.quantity,
    oi.unit_price
FROM orders o
INNER JOIN customers c 
    ON o.customer_id = c.customer_id
INNER JOIN order_items oi 
    ON o.order_id = oi.order_id
INNER JOIN products p 
    ON oi.product_id = p.product_id
WHERE o.order_date BETWEEN '2024-01-01' AND '2024-12-31'
  AND o.status = 'completed';

Common Performance Issues

IssueSymptomSolution
Sequential scanFull table scan on large tablesAdd appropriate index
Missing join indexSlow JOINs on large tablesIndex foreign keys
SELECT *Fetching unnecessary columnsSelect only needed columns
N+1 queriesMany small queries in loopsUse JOINs or batch queries
Unparameterized queriesPlan cache pollutionUse parameterized queries

SQL Performance Formula

Query performance can be estimated by:

Execution_Time = (Rows_Scanned Γ— Row_Size Γ— I/O_Factor) + (Rows_Returned Γ— Processing_Factor)

Where:

  • Rows_Scanned: Number of rows examined
  • Row_Size: Average size of each row
  • I/O_Factor: Time to read from disk/cache
  • Rows_Returned: Number of rows returned
  • Processing_Factor: CPU time per row

To optimize: minimize Rows_Scanned (indexes) and Rows_Returned (filters).

Best Practices Checklist

# Example: SQL best practices checklist
from dataclasses import dataclass
from typing import List, Dict
from enum import Enum

class PracticeCategory(Enum):
    PERFORMANCE = "performance"
    READABILITY = "readability"
    MAINTAINABILITY = "maintainability"
    SECURITY = "security"

@dataclass
class BestPractice:
    """SQL best practice item."""
    category: PracticeCategory
    practice: str
    description: str
    check_query: str  # Query to check if practice is followed

# Define best practices
best_practices = [
    BestPractice(
        category=PracticeCategory.PERFORMANCE,
        practice="Avoid SELECT *",
        description="Only select columns you need",
        check_query="""
        SELECT * FROM information_schema.columns 
        WHERE table_name = 'your_table' 
        AND column_name NOT IN ('needed_col1', 'needed_col2')
        """
    ),
    BestPractice(
        category=PracticeCategory.PERFORMANCE,
        practice="Use appropriate indexes",
        description="Index columns used in WHERE, JOIN, ORDER BY",
        check_query="""
        SELECT indexname, indexdef 
        FROM pg_indexes 
        WHERE tablename = 'your_table'
        """
    ),
    BestPractice(
        category=PracticeCategory.READABILITY,
        practice="Use consistent formatting",
        description="Use consistent indentation and casing",
        check_query="-- Manual review required"
    ),
    BestPractice(
        category=PracticeCategory.SECURITY,
        practice="Use parameterized queries",
        description="Never concatenate user input into queries",
        check_query="-- Code review required"
    ),
]

# Generate checklist
def generate_checklist() -> str:
    """Generate a SQL best practices checklist."""
    checklist = []
    for practice in best_practices:
        checklist.append(f"[ ] {practice.practice}")
        checklist.append(f"    Category: {practice.category.value}")
        checklist.append(f"    Description: {practice.description}")
        checklist.append("")
    
    return "\n".join(checklist)

print(generate_checklist())

Key Takeaways

  1. SQL is essential for every data engineer β€” it's the primary language for data manipulation
  2. Master the basics first: SELECT, WHERE, JOIN, GROUP BY, ORDER BY, LIMIT
  3. Window functions are powerful β€” learn ROW_NUMBER, RANK, LAG/LEAD, and cumulative aggregates
  4. CTEs improve readability β€” use them to break complex queries into logical steps
  5. Indexes are critical β€” understand when and how to index for performance
  6. Always use EXPLAIN ANALYZE β€” understand your query execution plans
  7. Write readable SQL β€” format consistently, use aliases, comment complex logic

Practice Exercises

  1. Basic queries: Write a query to find the top 10 customers by total order amount in the last 30 days.

  2. Joins: Write a query showing all products that have never been ordered, along with the number of times each product has been ordered.

  3. Window functions: Write a query that calculates the running total of revenue by day, and the percentage change from the previous day.

  4. CTEs: Using CTEs, write a query that identifies customers whose average order value is above the company-wide average.

  5. Performance: Take a slow query from your work, run EXPLAIN ANALYZE, identify the bottleneck, and optimize it with indexes or query restructuring.

See Also

⭐

Premium Content

SQL Fundamentals for Data Engineers

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Data Engineering Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement