Amazon Redshift Fundamentals: MPP Data Warehouse

Redshift: Massively Parallel Processing Data Warehouse

Amazon Redshift is a fully managed MPP data warehouse built on columnar storage and parallel query execution.

Why Redshift Matters

Key Benefits:

Cost-effective — $0.25/hour for ra3.4xlarge
AWS ecosystem — tight integration with S3, Glue, and AWS services
Leader-node architecture — distributes queries across compute nodes for parallel execution
Natural choice — for AWS-centric data platforms

Key Insight: Redshift provides the most cost-effective cloud data warehouse at scale.

Architecture Overview

Redshift MPP Architecture

Distribution Key Comparison

Node Types and Distribution

Redshift distributes data and query execution across compute nodes using a leader node that coordinates and a set of compute nodes that process data in parallel. Each compute node contains multiple slices that handle a portion of the data.

Redshift Distribution Model

Nodes: N = number of compute nodes in cluster
Slices per Node: S = 2 (DC2), 4 (RA3), 8 (RA3 XL+)
Total Slices: T = N × S
Data per Slice = Total_Table_Size / T
Distribution Key determines which slice stores each row: slice = HASH(distribution_key) mod T
Even Distribution: Each row goes to round-robin slice (no skew)
All Distribution: Full copy on every node (small dimensions only)

-- Create cluster via AWS CLI
-- aws redshift create-cluster \
--   --cluster-identifier my-cluster \
--   --node-type ra3.4xlarge \
--   --number-of-nodes 4 \
--   --master-username admin \
--   --master-user-password xxx \
--   --db-name analytics

-- Create table with distribution key and sort key
CREATE TABLE fact_orders (
    order_id        BIGINT NOT NULL,
    customer_key    INT NOT NULL,
    product_key     INT NOT NULL,
    order_date      DATE NOT NULL,
    quantity        INT,
    unit_price      DECIMAL(12,2),
    net_amount      DECIMAL(14,2),
    order_status    VARCHAR(20),
    region          VARCHAR(20)
)
DISTSTYLE KEY
DISTKEY(customer_key)           -- Distribute by customer
COMPOUND SORTKEY(order_date, order_status);  -- Sort within slices

-- Dimension table with ALL distribution (small table)
CREATE TABLE dim_date (
    date_key        INT NOT NULL,
    full_date       DATE,
    day_of_week     SMALLINT,
    month_name      VARCHAR(10),
    quarter         SMALLINT,
    year            INT
)
DISTSTYLE ALL
SORTKEY(full_date);

-- Auto distribution (Redshift Spectrum managed)
CREATE TABLE external_orders (
    order_id        BIGINT,
    customer_id     INT,
    order_date      DATE,
    amount          DECIMAL(12,2)
)
STYLE AUTO  -- Redshift automatically selects distribution
EXTERNAL DISTRIBUTION;

-- COPY from S3
COPY fact_orders
FROM 's3://data-lake/orders/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftLoadRole'
FORMAT AS PARQUET
REGION 'us-east-1';

-- COPY with options
COPY fact_orders
FROM 's3://data-lake/orders/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftLoadRole'
FORMAT AS CSV
IGNOREHEADER 1
DATEFORMAT 'YYYY-MM-DD'
EMPTYASNULL
BLANKSASNULL
MAXERROR 100
COMPUPDATE OFF
STATUPDATE ON;

Distribution Styles

Distribution Style determines how rows are distributed across compute nodes. The choice directly impacts query performance by minimizing data movement (shuffling) during JOINs.

Dist Style	Behavior	Best For	Drawback
KEY	HASH on specified column	Large fact tables	Skew risk
EVEN	Round-robin distribution	No clear join pattern	Poor join performance
ALL	Full copy on every node	Small dimension tables	Storage overhead
AUTO	Redshift-manages distribution	Evolving workloads	Less control
COMPOUND	Composite distribution	Multi-tenant analytics	Complex tuning

Distribution Skew Analysis

Ideal: Each slice has equal data volume -> V_ideal = Total_Size / T
Skew Factor: SF = V_max / V_ideal where V_max is the largest slice
Skew Threshold: SF < 1.1 = good, SF > 1.5 = problematic, SF > 3.0 = critical
Remediation: Change distribution key to higher cardinality column

-- Analyze distribution skew
SELECT
    stv_partitions.node,
    stv_partitions.slice,
    SUM(stv_partitions.rows) AS row_count,
    SUM(stv_partitions.bytes) / 1024 / 1024 AS mb_used
FROM stv_partitions
WHERE tbl = (SELECT oid FROM pg_class WHERE relname = 'fact_orders')
GROUP BY stv_partitions.node, stv_partitions.slice
ORDER BY node, slice;

-- Identify skew with svv_table_info
SELECT
    tbl,
    tbl_rows,
    skew_sortkey1,
    skew_rows,
    CASE
        WHEN skew_rows > 10 THEN 'HIGH SKEW - CHANGE DISTKEY'
        WHEN skew_rows > 5 THEN 'MODERATE SKEW'
        ELSE 'ACCEPTABLE'
    END AS skew_assessment
FROM svv_table_info
WHERE schema = 'public';

-- Compare distribution effectiveness
EXPLAIN
SELECT
    c.customer_name,
    COUNT(*) AS order_count,
    SUM(f.net_amount) AS total_spent
FROM fact_orders f
JOIN dim_customer c ON f.customer_key = c.customer_key
GROUP BY c.customer_name;

Sort Keys

Sort Keys define the physical sort order of data within each slice. Compound sort keys sort by the first key, then second, etc. Interleaved sort keys give equal weight to all key columns, improving multi-column filter performance.

-- Compound sort key (best for single-column filters)
CREATE TABLE fact_clickstream (
    click_id        BIGINT,
    session_id      BIGINT,
    event_timestamp TIMESTAMP,
    page_url        VARCHAR(500),
    user_agent      VARCHAR(500)
)
DISTSTYLE KEY
DISTKEY(session_id)
COMPOUND SORTKEY(event_timestamp);

-- Interleaved sort key (best for multi-column filters)
CREATE TABLE fact_transactions (
    transaction_id  BIGINT,
    account_id      INT,
    transaction_date DATE,
    merchant_id     INT,
    category        VARCHAR(50),
    amount          DECIMAL(12,2)
)
DISTSTYLE KEY
DISTKEY(account_id)
INTERLEAVED SORTKEY(transaction_date, merchant_id, category);

-- Analyze sort key effectiveness
SELECT
    tbl,
    skew_sortkey1,
    CASE
        WHEN skew_sortkey1 < 1.5 THEN 'EFFECTIVE'
        WHEN skew_sortkey1 < 3.0 THEN 'MODERATE'
        ELSE 'INEFFECTIVE'
    END AS sort_effectiveness
FROM svv_table_info;

-- Vacuum to re-sort after bulk loads
VACUUM SORT ONLY fact_orders;
VACUUM FULL fact_orders;
VACUUM DELETE ONLY fact_orders;

Redshift Spectrum

Redshift Spectrum extends Redshift queries to data stored in S3 without loading it into Redshift. It uses an external catalog (Glue/Hive) to query Parquet, ORC, JSON, and CSV files directly from S3.

Spectrum Cost Model

Cost: $5/TB scanned (no compute cost — Spectrum runs on AWS-managed infrastructure)
vs. Loading Data: Spectrum is cheaper for infrequently queried data (<1 query/month)
vs. Loading Data: Loading into Redshift is cheaper for frequently queried data (>4 queries/month)
Break-Even Point: Cost_Spectrum × Queries = Cost_Load + Storage

-- Create external schema pointing to Glue catalog
CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG
DATABASE 'analytics_db'
IAM_ROLE 'arn:aws:iam::123456789012:role/SpectrumRole'
REGION 'us-east-1';

-- Query external tables (Parquet in S3)
CREATE EXTERNAL TABLE spectrum_schema.events (
    event_id    BIGINT,
    event_type  VARCHAR(50),
    user_id     BIGINT,
    event_ts    TIMESTAMP,
    properties  STRING  -- JSON string
)
PARTITIONED BY (event_date STRING)
STORED AS PARQUET
LOCATION 's3://data-lake/events/'
TABLE PROPERTIES ('parquet.compression' = 'SNAPPY');

-- Query combining internal and external data
SELECT
    c.customer_name,
    COUNT(e.event_id) AS event_count,
    SUM(CASE WHEN e.event_type = 'purchase' THEN 1 ELSE 0 END) AS purchases
FROM spectrum_schema.events e
JOIN dim_customer c ON e.user_id = c.customer_id
WHERE e.event_date = '2025-01-15'
GROUP BY c.customer_name;

-- Federated query to Aurora PostgreSQL
CREATE EXTERNAL SCHEMA aurora_schema
FROM POSTGRES
DATABASE 'app_db'
IAM_ROLE 'arn:aws:iam::123456789012:role/FederatedRole'
URI 'my-aurora-cluster.cluster-xxxxx.us-east-1.rds.amazonaws.com:5432'
SECRET_ARN 'arn:aws:secretsmanager:us-east-1:123456789012:secret:redshift-aurora';

SELECT * FROM aurora_schema.users WHERE created_at > '2025-01-01';

Key Concepts Summary

Concept	Description	Benefit	When to Use
Leader Node	Query coordination	Parallel optimization	All queries
Compute Node	Data processing	Parallel execution	All queries
Slice	Unit of parallelism	Fine-grained distribution	Data distribution
Distribution Key	Row placement	Minimize data shuffle	JOINs
Sort Key	Data ordering	Block skipping	Filter/group queries
COPY	Bulk load from S3	Parallel load	ETL pipelines
UNLOAD	Export to S3	Parallel export	Data extraction
WLM	Workload Management	Queue isolation	Concurrent users
Materialized Views	Pre-computed results	Query acceleration	Dashboards
Redshift ML	In-database ML	No data movement	Predictive analytics
Aqua	Advanced caching	10x faster scans	Frequent full scans
RA3 Nodes	Managed storage	Auto-scaling storage	Growing datasets

Performance Metrics

Metric	DC2.8xlarge	RA3.4xlarge	RA3.16xlarge	Serverless
vCPU	32	12	48	128
Memory (GB)	244	96	384	384
Storage	2.56 TB SSD	32 TB managed	128 TB managed	Per-query
Price/Hour	$4.80 \|$ 3.26	$13.04 \|$ 0.375/slot-hr
Max Rows	~5 billion	~20 billion	~80 billion	Unlimited
Query Latency (1TB)	60-180 sec	90-240 sec	30-90 sec	30-120 sec

10 Best Practices

Choose distribution keys wisely — pick columns frequently used in JOINs with high cardinality
Use COMPOUND sort keys for single-column filters, INTERLEAVED for multi-column
Run VACUUM regularly after bulk loads to maintain sort order and reclaim space
Use COPY over INSERT for bulk loads — COPY uses parallel loading from S3
Enable automatic statistics (STATUPDATE ON in COPY) to keep query planner informed
Use WLM queues to separate ETL workloads from ad-hoc queries
Leverage Redshift Spectrum for cold data in S3 instead of loading everything into Redshift
Monitor skew with svv_table_info — high skew kills query performance
Use materialized views for repeated aggregation patterns
Consider Serverless for variable workloads to avoid cluster management

Redshift's MPP architecture distributes data across nodes and slices for parallel processing
Distribution keys determine data placement — wrong choice causes shuffle and skew
Sort keys enable block skipping — compound for single-column, interleaved for multi-column
Redshift Spectrum extends queries to S3 without loading — cost-effective for cold data
Proper DISTKEY + SORTKEY can improve query performance by 10-100x