📊 Amazon Redshift

Master Redshift architecture, distribution styles, sort keys, Spectrum, Serverless, and concurrency scaling.

Module: AWS Data Engineering • Topic 9 of 65 • Premium Content

Redshift Architecture

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                    AMAZON REDSHIFT ARCHITECTURE                              │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                    REDSHIFT CLUSTER                                   │    │
│  │                                                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  LEADER NODE                                                   │  │    │
│  │  │  • Query parsing and optimization                              │  │    │
│  │  │  • SQL compilation                                             │  │    │
│  │  │  • Result aggregation                                          │  │    │
│  │  │  • Client connections                                          │  │    │
│  │  │  Instance: dc2.large (2 vCPU, 15 GB)                          │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  │                              │                                     │    │
│  │              ┌───────────────┼───────────────┐                     │    │
│  │              ▼               ▼               ▼                     │    │
│  │  ┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐ │    │
│  │  │  COMPUTE NODE 1   │ │  COMPUTE NODE 2   │ │  COMPUTE NODE N   │ │    │
│  │  │                   │ │                   │ │                   │ │    │
│  │  │  ┌─────────────┐  │ │  ┌─────────────┐  │ │  ┌─────────────┐  │ │    │
│  │  │  │  Slices     │  │ │  │  Slices     │  │ │  │  Slices     │  │ │    │
│  │  │  │  (4 per node)│ │  │  │  (4 per node)│ │  │  │  (4 per node)│ │    │
│  │  │  └─────────────┘  │ │  └─────────────┘  │ │  └─────────────┘  │ │    │
│  │  │  Instance:        │ │  Instance:        │ │  Instance:        │ │    │
│  │  │  dc2.8xlarge      │ │  dc2.8xlarge      │ │  dc2.8xlarge      │ │    │
│  │  │  (32 vCPU, 244 GB)│ │  (32 vCPU, 244 GB)│ │  (32 vCPU, 244 GB)│ │    │
│  │  └───────────────────┘ └───────────────────┘ └───────────────────┘ │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  DISTRIBUTION ARCHITECTURE                                          │    │
│  │                                                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  Table: sales (DISTSTYLE KEY, DISTKEY customer_id)            │  │    │
│  │  │                                                               │  │    │
│  │  │  Compute Node 1          Compute Node 2          Compute N    │  │    │
│  │  │  ┌─────────────────┐    ┌─────────────────┐    ┌───────────┐ │  │    │
│  │  │  │ Slice 0         │    │ Slice 0         │    │ Slice 0   │ │  │    │
│  │  │  │ customer_id:    │    │ customer_id:    │    │ customer  │ │  │    │
│  │  │  │ 1000-1999       │    │ 3000-3999       │    │ 5000-5999 │ │  │    │
│  │  │  ├─────────────────┤    ├─────────────────┤    ├───────────┤ │  │    │
│  │  │  │ Slice 1         │    │ Slice 1         │    │ Slice 1   │ │  │    │
│  │  │  │ customer_id:    │    │ customer_id:    │    │ customer  │ │  │    │
│  │  │  │ 2000-2999       │    │ 4000-4999       │    │ 6000-6999 │ │  │    │
│  │  │  └─────────────────┘    └─────────────────┘    └───────────┘ │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

Distribution Styles

Style	Description	Use Case	Performance
KEY	Hash on column	Large fact tables	Best for joins
EVEN	Round-robin	Even distribution	Good for distribution
ALL	Copy to all nodes	Small dimension tables	Fastest scans
AUTO	Redshift decides	Most tables	Adaptive

Distribution Strategy

-- KEY distribution for large fact tables
CREATE TABLE sales (
    sale_id BIGINT,
    customer_id BIGINT,
    product_id BIGINT,
    amount DECIMAL(10,2),
    sale_date DATE
)
DISTSTYLE KEY
DISTKEY(customer_id)
SORTKEY(sale_date);

-- ALL distribution for small dimension tables
CREATE TABLE products (
    product_id INT,
    product_name VARCHAR(100),
    category VARCHAR(50),
    price DECIMAL(10,2)
)
DISTSTYLE ALL
SORTKEY(category);

-- EVEN distribution for staging tables
CREATE TABLE staging_sales (
    sale_id BIGINT,
    customer_id BIGINT,
    amount DECIMAL(10,2)
)
DISTSTYLE EVEN
SORTKEY(sale_date);

-- AUTO distribution (recommended)
CREATE TABLE orders (
    order_id BIGINT,
    customer_id BIGINT,
    order_date TIMESTAMP,
    total DECIMAL(10,2)
)
DISTSTYLE AUTO;

ℹ️

Pro Tip: Use DISTSTYLE AUTO for most tables. Redshift will automatically choose the best distribution style based on table size and usage patterns.

Sort Keys

-- Compound Sort Key (recommended)
CREATE TABLE events (
    event_id BIGINT,
    event_type VARCHAR(50),
    user_id BIGINT,
    event_date TIMESTAMP
)
COMPOUND SORTKEY(event_date, event_type);

-- Interleaved Sort Key
CREATE TABLE logs (
    log_id BIGINT,
    user_id BIGINT,
    event_date TIMESTAMP,
    action VARCHAR(50)
)
INTERLEAVED SORTKEY(user_id, event_date, action);

Sort Key Selection Guide

Key Type	Best For	Trade-off
Compound	Range queries, filtering by first column	Better for ordered scans
Interleaved	Multiple columns used equally	More complex maintenance

Redshift Spectrum

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                    REDSHIFT SPECTRUM ARCHITECTURE                             │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  REDSHIFT CLUSTER                                                    │    │
│  │                                                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  Leader Node                                                  │  │    │
│  │  │  • Query planning                                             │  │    │
│  │  │  • Spectra requests                                           │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  └─────────────────────────────┬───────────────────────────────────────┘    │
│                                │                                           │
│                                ▼                                           │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  SPECTRUM LAYER (Serverless)                                         │    │
│  │                                                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  Spectrum Nodes                                               │  │    │
│  │  │  • Auto-scaling (up to 100 nodes per query)                   │  │    │
│  │  │  • Data filtering at source                                    │  │    │
│  │  │  • Columnar processing                                         │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  └─────────────────────────────┬───────────────────────────────────────┘    │
│                                │                                           │
│                                ▼                                           │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  DATA LAKE (S3)                                                      │    │
│  │                                                                     │    │
│  │  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐    │    │
│  │  │  Parquet        │  │  ORC            │  │  JSON/CSV       │    │    │
│  │  │  (Recommended)  │  │  (Columnar)     │  │  (Row-based)    │    │    │
│  │  └─────────────────┘  └─────────────────┘  └─────────────────┘    │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

External Table Creation

-- Create external schema for Spectrum
CREATE EXTERNAL SCHEMA spectrum_data
FROM DATA CATALOG
DATABASE 'data_lake_db'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftSpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

-- Create external table
CREATE EXTERNAL TABLE spectrum_data.sales (
    sale_id BIGINT,
    customer_id BIGINT,
    product_id BIGINT,
    amount DECIMAL(10,2),
    sale_date DATE
)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION 's3://data-lake-processed/silver/sales/'
TABLE PROPERTIES ('parquet.compression'='SNAPPY');

-- Query external table
SELECT
    customer_id,
    SUM(amount) as total_amount,
    COUNT(*) as transaction_count
FROM spectrum_data.sales
WHERE year = 2024 AND month = 1
GROUP BY customer_id
ORDER BY total_amount DESC;

-- Query mixing internal and external tables
SELECT
    c.customer_name,
    SUM(s.amount) as total_sales
FROM dev.customers c
JOIN spectrum_data.sales s ON c.customer_id = s.customer_id
WHERE s.year = 2024
GROUP BY c.customer_name;

ℹ️

Spectrum Pricing: $5 per TB scanned. Use partitioning and columnar formats to minimize data scanned.

Redshift Serverless

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                    REDSHIFT SERVERLESS ARCHITECTURE                           │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                                                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  WORKGROUP CONFIGURATION                                       │  │    │
│  │  │                                                               │  │    │
│  │  │  • Base Capacity: 128 - 512 RPUs                             │  │    │
│  │  │  • Max Capacity: 512 RPUs                                     │  │    │
│  │  │  • Timeout: 300 - 600 seconds                                 │  │    │
│  │  │  • VPC: Configured                                            │  │    │
│  │  │  • Encryption: KMS managed                                    │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  │                              │                                     │    │
│  │                              ▼                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  AUTO-SCALING                                                  │  │    │
│  │  │                                                               │  │    │
│  │  │  Query Load ──► 128 RPUs ──► 256 RPUs ──► 512 RPUs          │  │    │
│  │  │                      │              │              │           │  │    │
│  │  │                      ▼              ▼              ▼           │  │    │
│  │  │  Auto-pause ──► 30 sec ──► 1 min ──► 5 min ──► Pause        │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  │                                                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  COST MODEL                                                    │  │    │
│  │  │                                                               │  │    │
│  │  │  RPU-hour = (RPU count × hours used)                         │  │    │
│  │  │  Storage: $0.024/GB/month (managed storage)                   │  │    │
│  │  │  Data transfer: Standard AWS rates                            │  │    │
│  │  │                                                               │  │    │
│  │  │  Example: 128 RPUs for 1 hour = $12.80                       │  │    │
│  │  │           256 RPUs for 1 hour = $25.60                       │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

Serverless Configuration

import boto3

redshift = boto3.client('redshift-serverless')

# Create workgroup
workgroup = redshift.create_workgroup(
    workgroupName='analytics-workgroup',
    baseCapacity=128,
    enhancedVpcRouting=True,
    securityGroupIds=['sg-12345678'],
    subnetIds=['subnet-12345678', 'subnet-87654321'],
    publiclyAccessible=False,
    configParameters=[
        {
            'parameterKey': 'enable_user_activity_logging',
            'parameterValue': 'true'
        },
        {
            'parameterKey': 'max_query_execution_time',
            'parameterValue': '3600'
        }
    ],
    tags={
        'Environment': 'production',
        'Team': 'analytics'
    }
)

# Create namespace
namespace = redshift.create_namespace(
    namespaceName='analytics-namespace',
    adminUsername='admin',
    adminUserPassword='SecurePassword123!',
    dbName='analytics_db',
    iamRoles=['arn:aws:iam::123456789012:role/RedshiftServerlessRole'],
    logExports=['useractivitylog', 'userlog', 'connectionlog'],
    tags={
        'Environment': 'production'
    }
)

print(f"Workgroup: {workgroup['workgroup']['workgroupName']}")
print(f"Namespace: {namespace['namespace']['namespaceName']}")

Concurrency Scaling

-- Enable concurrency scaling for a cluster
ALTER TABLE sales SET AUTOMATICALLY CREATE SORT KEY ON;

-- Configure concurrency scaling
CREATE WIDGET SCALING CONCURRENTLY FOR TABLE sales
    SCALING TYPE AUTO
    QUEUE waitForScaling
    CONCURRENCY 5;

-- Monitor concurrency scaling
SELECT * FROM stl_concurrency_scaling
WHERE start_time > DATEADD(hour, -24, GETDATE())
ORDER BY start_time DESC;

Redshift Best Practices

ℹ️

Pro Tip: Use COPY command instead of INSERT for bulk loading. It's 5-10x faster and automatically handles compression.

Loading Data

-- COPY from S3
COPY sales
FROM 's3://data-lake-processed/silver/sales/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftLoadRole'
FORMAT AS PARQUET;

-- COPY from S3 with options
COPY sales
FROM 's3://data-lake-processed/silver/sales/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftLoadRole'
FORMAT AS CSV
IGNOREHEADER 1
REGION 'us-east-1'
COMPUPDATE OFF
STATUPDATE OFF;

-- Unload to S3
UNLOAD ('SELECT * FROM sales WHERE year = 2024')
TO 's3://data-export/sales_2024/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftUnloadRole'
FORMAT AS PARQUET
PARTITION BY (month, day);

Performance Monitoring

-- Query performance
SELECT
    query,
    pid,
    starttime,
    endtime,
    datediff(seconds, starttime, endtime) as duration,
    rows_scanned,
    rows_returned,
    cpu_time,
    blocks_read
FROM stl_query
WHERE starttime > DATEADD(hour, -24, GETDATE())
ORDER BY duration DESC;

-- Table statistics
SELECT
    schemaname,
    tablename,
    size,
    skew_sortkey1,
    skew_rows
FROM svv_table_info
WHERE schema = 'public'
ORDER BY size DESC;

-- Query plan
EXPLAIN
SELECT * FROM sales
WHERE sale_date = '2024-01-15'
AND amount > 100;

Interview Questions & Answers

Q1: What is the difference between DISTSTYLE KEY and DISTSTYLE ALL?

Answer:

KEY: Distributes data across nodes based on hash of DISTKEY column. Best for large tables joined frequently.
ALL: Copies entire table to every node. Best for small dimension tables (<2GB).

Use KEY for fact tables, ALL for dimension tables.

Q2: How does Redshift Spectrum differ from Redshift?

Answer:

Redshift: Queries data stored in cluster-attached storage
Spectrum: Queries data directly in S3 without loading

Spectrum is serverless and scales independently. Use it for data lake queries without ETL.

Q3: When should you use Redshift Serverless vs. Provisioned?

Answer:

Serverless: Variable workloads, development, unpredictable queries
Provisioned: Steady-state production, predictable costs, high concurrency

Serverless is more flexible but can be expensive for continuous workloads.

Q4: What is the COPY command and why is it recommended?

Answer: COPY is Redshift's bulk loading command. Benefits:

5-10x faster than INSERT
Automatic compression detection
Parallel loading from multiple files
Error handling options
Supports Parquet, ORC, JSON, CSV

Q5: How do you optimize Redshift query performance?

Answer:

Sort Keys: Use compound sort keys for range queries
Distribution Keys: Use KEY for join columns
Compression: Use columnar compression (Automatic)
Vacuum: Reclaim space and resort
Analyze: Update statistics
Result Caching: Enable for repeated queries

Cost Considerations

Component	Cost	Optimization
Provisioned	$0.25/hr per node (dc2.large)	Reserved instances
Serverless	$0.375 per RPU-hour	Auto-pause when idle
Spectrum	$5 per TB scanned	Partition data
Managed Storage	$0.024/GB/month	Use S3 for cold data
Data Transfer	$0.09/GB outbound	Use VPC endpoints

⚠️

Cost Warning: Redshift Serverless costs can spike with complex queries. Set base capacity appropriately and monitor RPUs used per query.

Summary

Amazon Redshift is the leading cloud data warehouse. Key takeaways:

Architecture: Leader node + Compute nodes with slices
Distribution: KEY (facts), ALL (dimensions), EVEN (staging)
Sort Keys: Compound for range queries, Interleaved for multi-column
Spectrum: Query S3 data lake directly
Serverless: Auto-scaling, pay-per-use
Best Practices: COPY for loading, VACUUM for maintenance, ANALYZE for statistics

Amazon Redshift for Data Engineers

📊 Amazon Redshift

Redshift Architecture

Distribution Styles

Distribution Strategy

Sort Keys

Sort Key Selection Guide

Redshift Spectrum

External Table Creation

Redshift Serverless

Serverless Configuration

Concurrency Scaling

Redshift Best Practices

Loading Data

Performance Monitoring

Interview Questions & Answers

Q1: What is the difference between DISTSTYLE KEY and DISTSTYLE ALL?

Q2: How does Redshift Spectrum differ from Redshift?

Q3: When should you use Redshift Serverless vs. Provisioned?

Q4: What is the COPY command and why is it recommended?

Q5: How do you optimize Redshift query performance?

Cost Considerations

Summary