Redshift: Massively Parallel Processing Data Warehouse
Amazon Redshift is a fully managed MPP data warehouse built on columnar storage and parallel query execution.
Why Redshift Matters
Key Benefits:
- Cost-effective β $0.25/hour for ra3.4xlarge
- AWS ecosystem β tight integration with S3, Glue, and AWS services
- Leader-node architecture β distributes queries across compute nodes for parallel execution
- Natural choice β for AWS-centric data platforms
Key Insight: Redshift provides the most cost-effective cloud data warehouse at scale.
Architecture Overview
Redshift MPP Architecture
Distribution Key Comparison
Node Types and Distribution
Redshift distributes data and query execution across compute nodes using a leader node that coordinates and a set of compute nodes that process data in parallel. Each compute node contains multiple slices that handle a portion of the data.
Redshift Distribution Model
- Nodes: N = number of compute nodes in cluster
- Slices per Node: S = 2 (DC2), 4 (RA3), 8 (RA3 XL+)
- Total Slices: T = N Γ S
- Data per Slice = Total_Table_Size / T
- Distribution Key determines which slice stores each row: slice = HASH(distribution_key) mod T
- Even Distribution: Each row goes to round-robin slice (no skew)
- All Distribution: Full copy on every node (small dimensions only)
-- Create cluster via AWS CLI
-- aws redshift create-cluster \
-- --cluster-identifier my-cluster \
-- --node-type ra3.4xlarge \
-- --number-of-nodes 4 \
-- --master-username admin \
-- --master-user-password xxx \
-- --db-name analytics
-- Create table with distribution key and sort key
CREATE TABLE fact_orders (
order_id BIGINT NOT NULL,
customer_key INT NOT NULL,
product_key INT NOT NULL,
order_date DATE NOT NULL,
quantity INT,
unit_price DECIMAL(12,2),
net_amount DECIMAL(14,2),
order_status VARCHAR(20),
region VARCHAR(20)
)
DISTSTYLE KEY
DISTKEY(customer_key) -- Distribute by customer
COMPOUND SORTKEY(order_date, order_status); -- Sort within slices
-- Dimension table with ALL distribution (small table)
CREATE TABLE dim_date (
date_key INT NOT NULL,
full_date DATE,
day_of_week SMALLINT,
month_name VARCHAR(10),
quarter SMALLINT,
year INT
)
DISTSTYLE ALL
SORTKEY(full_date);
-- Auto distribution (Redshift Spectrum managed)
CREATE TABLE external_orders (
order_id BIGINT,
customer_id INT,
order_date DATE,
amount DECIMAL(12,2)
)
STYLE AUTO -- Redshift automatically selects distribution
EXTERNAL DISTRIBUTION;
-- COPY from S3
COPY fact_orders
FROM 's3://data-lake/orders/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftLoadRole'
FORMAT AS PARQUET
REGION 'us-east-1';
-- COPY with options
COPY fact_orders
FROM 's3://data-lake/orders/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftLoadRole'
FORMAT AS CSV
IGNOREHEADER 1
DATEFORMAT 'YYYY-MM-DD'
EMPTYASNULL
BLANKSASNULL
MAXERROR 100
COMPUPDATE OFF
STATUPDATE ON;
Distribution Styles
Distribution Style determines how rows are distributed across compute nodes. The choice directly impacts query performance by minimizing data movement (shuffling) during JOINs.
| Dist Style | Behavior | Best For | Drawback |
|---|---|---|---|
| KEY | HASH on specified column | Large fact tables | Skew risk |
| EVEN | Round-robin distribution | No clear join pattern | Poor join performance |
| ALL | Full copy on every node | Small dimension tables | Storage overhead |
| AUTO | Redshift-manages distribution | Evolving workloads | Less control |
| COMPOUND | Composite distribution | Multi-tenant analytics | Complex tuning |
Distribution Skew Analysis
- Ideal: Each slice has equal data volume -> V_ideal = Total_Size / T
- Skew Factor: SF = V_max / V_ideal where V_max is the largest slice
- Skew Threshold: SF < 1.1 = good, SF > 1.5 = problematic, SF > 3.0 = critical
- Remediation: Change distribution key to higher cardinality column
-- Analyze distribution skew
SELECT
stv_partitions.node,
stv_partitions.slice,
SUM(stv_partitions.rows) AS row_count,
SUM(stv_partitions.bytes) / 1024 / 1024 AS mb_used
FROM stv_partitions
WHERE tbl = (SELECT oid FROM pg_class WHERE relname = 'fact_orders')
GROUP BY stv_partitions.node, stv_partitions.slice
ORDER BY node, slice;
-- Identify skew with svv_table_info
SELECT
tbl,
tbl_rows,
skew_sortkey1,
skew_rows,
CASE
WHEN skew_rows > 10 THEN 'HIGH SKEW - CHANGE DISTKEY'
WHEN skew_rows > 5 THEN 'MODERATE SKEW'
ELSE 'ACCEPTABLE'
END AS skew_assessment
FROM svv_table_info
WHERE schema = 'public';
-- Compare distribution effectiveness
EXPLAIN
SELECT
c.customer_name,
COUNT(*) AS order_count,
SUM(f.net_amount) AS total_spent
FROM fact_orders f
JOIN dim_customer c ON f.customer_key = c.customer_key
GROUP BY c.customer_name;
Sort Keys
Sort Keys define the physical sort order of data within each slice. Compound sort keys sort by the first key, then second, etc. Interleaved sort keys give equal weight to all key columns, improving multi-column filter performance.
-- Compound sort key (best for single-column filters)
CREATE TABLE fact_clickstream (
click_id BIGINT,
session_id BIGINT,
event_timestamp TIMESTAMP,
page_url VARCHAR(500),
user_agent VARCHAR(500)
)
DISTSTYLE KEY
DISTKEY(session_id)
COMPOUND SORTKEY(event_timestamp);
-- Interleaved sort key (best for multi-column filters)
CREATE TABLE fact_transactions (
transaction_id BIGINT,
account_id INT,
transaction_date DATE,
merchant_id INT,
category VARCHAR(50),
amount DECIMAL(12,2)
)
DISTSTYLE KEY
DISTKEY(account_id)
INTERLEAVED SORTKEY(transaction_date, merchant_id, category);
-- Analyze sort key effectiveness
SELECT
tbl,
skew_sortkey1,
CASE
WHEN skew_sortkey1 < 1.5 THEN 'EFFECTIVE'
WHEN skew_sortkey1 < 3.0 THEN 'MODERATE'
ELSE 'INEFFECTIVE'
END AS sort_effectiveness
FROM svv_table_info;
-- Vacuum to re-sort after bulk loads
VACUUM SORT ONLY fact_orders;
VACUUM FULL fact_orders;
VACUUM DELETE ONLY fact_orders;
Redshift Spectrum
Redshift Spectrum extends Redshift queries to data stored in S3 without loading it into Redshift. It uses an external catalog (Glue/Hive) to query Parquet, ORC, JSON, and CSV files directly from S3.
Spectrum Cost Model
- Cost: $5/TB scanned (no compute cost β Spectrum runs on AWS-managed infrastructure)
- vs. Loading Data: Spectrum is cheaper for infrequently queried data (<1 query/month)
- vs. Loading Data: Loading into Redshift is cheaper for frequently queried data (>4 queries/month)
- Break-Even Point: Cost_Spectrum Γ Queries = Cost_Load + Storage
-- Create external schema pointing to Glue catalog
CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG
DATABASE 'analytics_db'
IAM_ROLE 'arn:aws:iam::123456789012:role/SpectrumRole'
REGION 'us-east-1';
-- Query external tables (Parquet in S3)
CREATE EXTERNAL TABLE spectrum_schema.events (
event_id BIGINT,
event_type VARCHAR(50),
user_id BIGINT,
event_ts TIMESTAMP,
properties STRING -- JSON string
)
PARTITIONED BY (event_date STRING)
STORED AS PARQUET
LOCATION 's3://data-lake/events/'
TABLE PROPERTIES ('parquet.compression' = 'SNAPPY');
-- Query combining internal and external data
SELECT
c.customer_name,
COUNT(e.event_id) AS event_count,
SUM(CASE WHEN e.event_type = 'purchase' THEN 1 ELSE 0 END) AS purchases
FROM spectrum_schema.events e
JOIN dim_customer c ON e.user_id = c.customer_id
WHERE e.event_date = '2025-01-15'
GROUP BY c.customer_name;
-- Federated query to Aurora PostgreSQL
CREATE EXTERNAL SCHEMA aurora_schema
FROM POSTGRES
DATABASE 'app_db'
IAM_ROLE 'arn:aws:iam::123456789012:role/FederatedRole'
URI 'my-aurora-cluster.cluster-xxxxx.us-east-1.rds.amazonaws.com:5432'
SECRET_ARN 'arn:aws:secretsmanager:us-east-1:123456789012:secret:redshift-aurora';
SELECT * FROM aurora_schema.users WHERE created_at > '2025-01-01';
Key Concepts Summary
| Concept | Description | Benefit | When to Use |
|---|---|---|---|
| Leader Node | Query coordination | Parallel optimization | All queries |
| Compute Node | Data processing | Parallel execution | All queries |
| Slice | Unit of parallelism | Fine-grained distribution | Data distribution |
| Distribution Key | Row placement | Minimize data shuffle | JOINs |
| Sort Key | Data ordering | Block skipping | Filter/group queries |
| COPY | Bulk load from S3 | Parallel load | ETL pipelines |
| UNLOAD | Export to S3 | Parallel export | Data extraction |
| WLM | Workload Management | Queue isolation | Concurrent users |
| Materialized Views | Pre-computed results | Query acceleration | Dashboards |
| Redshift ML | In-database ML | No data movement | Predictive analytics |
| Aqua | Advanced caching | 10x faster scans | Frequent full scans |
| RA3 Nodes | Managed storage | Auto-scaling storage | Growing datasets |
Performance Metrics
| Metric | DC2.8xlarge | RA3.4xlarge | RA3.16xlarge | Serverless |
|---|---|---|---|---|
| vCPU | 32 | 12 | 48 | 128 |
| Memory (GB) | 244 | 96 | 384 | 384 |
| Storage | 2.56 TB SSD | 32 TB managed | 128 TB managed | Per-query |
| Price/Hour | 3.26 | 0.375/slot-hr | ||
| Max Rows | ~5 billion | ~20 billion | ~80 billion | Unlimited |
| Query Latency (1TB) | 60-180 sec | 90-240 sec | 30-90 sec | 30-120 sec |
10 Best Practices
- Choose distribution keys wisely β pick columns frequently used in JOINs with high cardinality
- Use COMPOUND sort keys for single-column filters, INTERLEAVED for multi-column
- Run VACUUM regularly after bulk loads to maintain sort order and reclaim space
- Use COPY over INSERT for bulk loads β COPY uses parallel loading from S3
- Enable automatic statistics (
STATUPDATE ONin COPY) to keep query planner informed - Use WLM queues to separate ETL workloads from ad-hoc queries
- Leverage Redshift Spectrum for cold data in S3 instead of loading everything into Redshift
- Monitor skew with
svv_table_infoβ high skew kills query performance - Use materialized views for repeated aggregation patterns
- Consider Serverless for variable workloads to avoid cluster management
- Redshift's MPP architecture distributes data across nodes and slices for parallel processing
- Distribution keys determine data placement β wrong choice causes shuffle and skew
- Sort keys enable block skipping β compound for single-column, interleaved for multi-column
- Redshift Spectrum extends queries to S3 without loading β cost-effective for cold data
- Proper DISTKEY + SORTKEY can improve query performance by 10-100x
See Also
- Snowflake Fundamentals β Cloud warehouse with compute-storage separation
- BigQuery Fundamentals β Google's serverless alternative
- Data Warehouse Concepts β Star/snowflake schema and dimensional modeling
- Delta Lake & Iceberg β Open table formats for S3 data lakes
- Partitioning & Indexing β Range, hash, and list partitioning strategies
- Cost Optimization β Right-sizing clusters and managing Redshift costs