πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Amazon Redshift Fundamentals: MPP Data Warehouse

Module 3: Data Warehouses & StorageCloud Data Platforms🟒 Free Lesson

Advertisement

Redshift: Massively Parallel Processing Data Warehouse

Amazon Redshift is a fully managed MPP data warehouse built on columnar storage and parallel query execution.

Why Redshift Matters


Key Benefits:

  • Cost-effective β€” $0.25/hour for ra3.4xlarge
  • AWS ecosystem β€” tight integration with S3, Glue, and AWS services
  • Leader-node architecture β€” distributes queries across compute nodes for parallel execution
  • Natural choice β€” for AWS-centric data platforms

Key Insight: Redshift provides the most cost-effective cloud data warehouse at scale.


Architecture Overview

Redshift MPP Architecture

Redshift MPP (Massively Parallel Processing)Leader NodeQuery Parse | Optimize | CoordinateCompute Node 1Slice A | Slice BCompute Node 2Slice A | Slice BCompute Node 3Slice A | Slice BCompute Node NSlice A | Slice BInter-node Communication: Distributed Hash Join | Broadcast | Shuffle

Distribution Key Comparison

Redshift Distribution StylesKEY DistributionEven data distributionCollocated joins on keyBest for: large fact tablesSkew risk if key is hotALL DistributionFull copy on every nodeNo network shuffle for joinsBest for: small dimension tablesStorage: N * size (N = nodes)EVEN DistributionRound-robin assignmentEven if no good join keyBest for: staging tablesJoins require broadcast

Node Types and Distribution

Redshift distributes data and query execution across compute nodes using a leader node that coordinates and a set of compute nodes that process data in parallel. Each compute node contains multiple slices that handle a portion of the data.

Redshift Distribution Model

  • Nodes: N = number of compute nodes in cluster
  • Slices per Node: S = 2 (DC2), 4 (RA3), 8 (RA3 XL+)
  • Total Slices: T = N Γ— S
  • Data per Slice = Total_Table_Size / T
  • Distribution Key determines which slice stores each row: slice = HASH(distribution_key) mod T
  • Even Distribution: Each row goes to round-robin slice (no skew)
  • All Distribution: Full copy on every node (small dimensions only)
-- Create cluster via AWS CLI
-- aws redshift create-cluster \
--   --cluster-identifier my-cluster \
--   --node-type ra3.4xlarge \
--   --number-of-nodes 4 \
--   --master-username admin \
--   --master-user-password xxx \
--   --db-name analytics

-- Create table with distribution key and sort key
CREATE TABLE fact_orders (
    order_id        BIGINT NOT NULL,
    customer_key    INT NOT NULL,
    product_key     INT NOT NULL,
    order_date      DATE NOT NULL,
    quantity        INT,
    unit_price      DECIMAL(12,2),
    net_amount      DECIMAL(14,2),
    order_status    VARCHAR(20),
    region          VARCHAR(20)
)
DISTSTYLE KEY
DISTKEY(customer_key)           -- Distribute by customer
COMPOUND SORTKEY(order_date, order_status);  -- Sort within slices

-- Dimension table with ALL distribution (small table)
CREATE TABLE dim_date (
    date_key        INT NOT NULL,
    full_date       DATE,
    day_of_week     SMALLINT,
    month_name      VARCHAR(10),
    quarter         SMALLINT,
    year            INT
)
DISTSTYLE ALL
SORTKEY(full_date);

-- Auto distribution (Redshift Spectrum managed)
CREATE TABLE external_orders (
    order_id        BIGINT,
    customer_id     INT,
    order_date      DATE,
    amount          DECIMAL(12,2)
)
STYLE AUTO  -- Redshift automatically selects distribution
EXTERNAL DISTRIBUTION;

-- COPY from S3
COPY fact_orders
FROM 's3://data-lake/orders/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftLoadRole'
FORMAT AS PARQUET
REGION 'us-east-1';

-- COPY with options
COPY fact_orders
FROM 's3://data-lake/orders/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftLoadRole'
FORMAT AS CSV
IGNOREHEADER 1
DATEFORMAT 'YYYY-MM-DD'
EMPTYASNULL
BLANKSASNULL
MAXERROR 100
COMPUPDATE OFF
STATUPDATE ON;

Distribution Styles

Distribution Style determines how rows are distributed across compute nodes. The choice directly impacts query performance by minimizing data movement (shuffling) during JOINs.

Dist StyleBehaviorBest ForDrawback
KEYHASH on specified columnLarge fact tablesSkew risk
EVENRound-robin distributionNo clear join patternPoor join performance
ALLFull copy on every nodeSmall dimension tablesStorage overhead
AUTORedshift-manages distributionEvolving workloadsLess control
COMPOUNDComposite distributionMulti-tenant analyticsComplex tuning

Distribution Skew Analysis

  • Ideal: Each slice has equal data volume -> V_ideal = Total_Size / T
  • Skew Factor: SF = V_max / V_ideal where V_max is the largest slice
  • Skew Threshold: SF < 1.1 = good, SF > 1.5 = problematic, SF > 3.0 = critical
  • Remediation: Change distribution key to higher cardinality column
-- Analyze distribution skew
SELECT
    stv_partitions.node,
    stv_partitions.slice,
    SUM(stv_partitions.rows) AS row_count,
    SUM(stv_partitions.bytes) / 1024 / 1024 AS mb_used
FROM stv_partitions
WHERE tbl = (SELECT oid FROM pg_class WHERE relname = 'fact_orders')
GROUP BY stv_partitions.node, stv_partitions.slice
ORDER BY node, slice;

-- Identify skew with svv_table_info
SELECT
    tbl,
    tbl_rows,
    skew_sortkey1,
    skew_rows,
    CASE
        WHEN skew_rows > 10 THEN 'HIGH SKEW - CHANGE DISTKEY'
        WHEN skew_rows > 5 THEN 'MODERATE SKEW'
        ELSE 'ACCEPTABLE'
    END AS skew_assessment
FROM svv_table_info
WHERE schema = 'public';

-- Compare distribution effectiveness
EXPLAIN
SELECT
    c.customer_name,
    COUNT(*) AS order_count,
    SUM(f.net_amount) AS total_spent
FROM fact_orders f
JOIN dim_customer c ON f.customer_key = c.customer_key
GROUP BY c.customer_name;

Sort Keys

Sort Keys define the physical sort order of data within each slice. Compound sort keys sort by the first key, then second, etc. Interleaved sort keys give equal weight to all key columns, improving multi-column filter performance.

-- Compound sort key (best for single-column filters)
CREATE TABLE fact_clickstream (
    click_id        BIGINT,
    session_id      BIGINT,
    event_timestamp TIMESTAMP,
    page_url        VARCHAR(500),
    user_agent      VARCHAR(500)
)
DISTSTYLE KEY
DISTKEY(session_id)
COMPOUND SORTKEY(event_timestamp);

-- Interleaved sort key (best for multi-column filters)
CREATE TABLE fact_transactions (
    transaction_id  BIGINT,
    account_id      INT,
    transaction_date DATE,
    merchant_id     INT,
    category        VARCHAR(50),
    amount          DECIMAL(12,2)
)
DISTSTYLE KEY
DISTKEY(account_id)
INTERLEAVED SORTKEY(transaction_date, merchant_id, category);

-- Analyze sort key effectiveness
SELECT
    tbl,
    skew_sortkey1,
    CASE
        WHEN skew_sortkey1 < 1.5 THEN 'EFFECTIVE'
        WHEN skew_sortkey1 < 3.0 THEN 'MODERATE'
        ELSE 'INEFFECTIVE'
    END AS sort_effectiveness
FROM svv_table_info;

-- Vacuum to re-sort after bulk loads
VACUUM SORT ONLY fact_orders;
VACUUM FULL fact_orders;
VACUUM DELETE ONLY fact_orders;

Redshift Spectrum

Redshift Spectrum extends Redshift queries to data stored in S3 without loading it into Redshift. It uses an external catalog (Glue/Hive) to query Parquet, ORC, JSON, and CSV files directly from S3.

Spectrum Cost Model

  • Cost: $5/TB scanned (no compute cost β€” Spectrum runs on AWS-managed infrastructure)
  • vs. Loading Data: Spectrum is cheaper for infrequently queried data (<1 query/month)
  • vs. Loading Data: Loading into Redshift is cheaper for frequently queried data (>4 queries/month)
  • Break-Even Point: Cost_Spectrum Γ— Queries = Cost_Load + Storage
-- Create external schema pointing to Glue catalog
CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG
DATABASE 'analytics_db'
IAM_ROLE 'arn:aws:iam::123456789012:role/SpectrumRole'
REGION 'us-east-1';

-- Query external tables (Parquet in S3)
CREATE EXTERNAL TABLE spectrum_schema.events (
    event_id    BIGINT,
    event_type  VARCHAR(50),
    user_id     BIGINT,
    event_ts    TIMESTAMP,
    properties  STRING  -- JSON string
)
PARTITIONED BY (event_date STRING)
STORED AS PARQUET
LOCATION 's3://data-lake/events/'
TABLE PROPERTIES ('parquet.compression' = 'SNAPPY');

-- Query combining internal and external data
SELECT
    c.customer_name,
    COUNT(e.event_id) AS event_count,
    SUM(CASE WHEN e.event_type = 'purchase' THEN 1 ELSE 0 END) AS purchases
FROM spectrum_schema.events e
JOIN dim_customer c ON e.user_id = c.customer_id
WHERE e.event_date = '2025-01-15'
GROUP BY c.customer_name;

-- Federated query to Aurora PostgreSQL
CREATE EXTERNAL SCHEMA aurora_schema
FROM POSTGRES
DATABASE 'app_db'
IAM_ROLE 'arn:aws:iam::123456789012:role/FederatedRole'
URI 'my-aurora-cluster.cluster-xxxxx.us-east-1.rds.amazonaws.com:5432'
SECRET_ARN 'arn:aws:secretsmanager:us-east-1:123456789012:secret:redshift-aurora';

SELECT * FROM aurora_schema.users WHERE created_at > '2025-01-01';

Key Concepts Summary

ConceptDescriptionBenefitWhen to Use
Leader NodeQuery coordinationParallel optimizationAll queries
Compute NodeData processingParallel executionAll queries
SliceUnit of parallelismFine-grained distributionData distribution
Distribution KeyRow placementMinimize data shuffleJOINs
Sort KeyData orderingBlock skippingFilter/group queries
COPYBulk load from S3Parallel loadETL pipelines
UNLOADExport to S3Parallel exportData extraction
WLMWorkload ManagementQueue isolationConcurrent users
Materialized ViewsPre-computed resultsQuery accelerationDashboards
Redshift MLIn-database MLNo data movementPredictive analytics
AquaAdvanced caching10x faster scansFrequent full scans
RA3 NodesManaged storageAuto-scaling storageGrowing datasets

Performance Metrics

MetricDC2.8xlargeRA3.4xlargeRA3.16xlargeServerless
vCPU321248128
Memory (GB)24496384384
Storage2.56 TB SSD32 TB managed128 TB managedPer-query
Price/Hour4.80∣4.80 |3.2613.04∣13.04 |0.375/slot-hr
Max Rows~5 billion~20 billion~80 billionUnlimited
Query Latency (1TB)60-180 sec90-240 sec30-90 sec30-120 sec

10 Best Practices

  1. Choose distribution keys wisely β€” pick columns frequently used in JOINs with high cardinality
  2. Use COMPOUND sort keys for single-column filters, INTERLEAVED for multi-column
  3. Run VACUUM regularly after bulk loads to maintain sort order and reclaim space
  4. Use COPY over INSERT for bulk loads β€” COPY uses parallel loading from S3
  5. Enable automatic statistics (STATUPDATE ON in COPY) to keep query planner informed
  6. Use WLM queues to separate ETL workloads from ad-hoc queries
  7. Leverage Redshift Spectrum for cold data in S3 instead of loading everything into Redshift
  8. Monitor skew with svv_table_info β€” high skew kills query performance
  9. Use materialized views for repeated aggregation patterns
  10. Consider Serverless for variable workloads to avoid cluster management

  • Redshift's MPP architecture distributes data across nodes and slices for parallel processing
  • Distribution keys determine data placement β€” wrong choice causes shuffle and skew
  • Sort keys enable block skipping β€” compound for single-column, interleaved for multi-column
  • Redshift Spectrum extends queries to S3 without loading β€” cost-effective for cold data
  • Proper DISTKEY + SORTKEY can improve query performance by 10-100x

See Also

⭐

Premium Content

Amazon Redshift Fundamentals: MPP Data Warehouse

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Data Engineering Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement