Redshift Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AMAZON REDSHIFT ARCHITECTURE β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β REDSHIFT CLUSTER β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β LEADER NODE β β β
β β β β’ Query parsing and optimization β β β
β β β β’ SQL compilation β β β
β β β β’ Result aggregation β β β
β β β β’ Client connections β β β
β β β Instance: dc2.large (2 vCPU, 15 GB) β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β βββββββββββββββββΌββββββββββββββββ β β
β β βΌ βΌ βΌ β β
β β βββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββ β β
β β β COMPUTE NODE 1 β β COMPUTE NODE 2 β β COMPUTE NODE N β β β
β β β β β β β β β β
β β β βββββββββββββββ β β βββββββββββββββ β β βββββββββββββββ β β β
β β β β Slices β β β β Slices β β β β Slices β β β β
β β β β (4 per node)β β β β (4 per node)β β β β (4 per node)β β β
β β β βββββββββββββββ β β βββββββββββββββ β β βββββββββββββββ β β β
β β β Instance: β β Instance: β β Instance: β β β
β β β dc2.8xlarge β β dc2.8xlarge β β dc2.8xlarge β β β
β β β (32 vCPU, 244 GB)β β (32 vCPU, 244 GB)β β (32 vCPU, 244 GB)β β β
β β βββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DISTRIBUTION ARCHITECTURE β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Table: sales (DISTSTYLE KEY, DISTKEY customer_id) β β β
β β β β β β
β β β Compute Node 1 Compute Node 2 Compute N β β β
β β β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββ β β β
β β β β Slice 0 β β Slice 0 β β Slice 0 β β β β
β β β β customer_id: β β customer_id: β β customer β β β β
β β β β 1000-1999 β β 3000-3999 β β 5000-5999 β β β β
β β β βββββββββββββββββββ€ βββββββββββββββββββ€ βββββββββββββ€ β β β
β β β β Slice 1 β β Slice 1 β β Slice 1 β β β β
β β β β customer_id: β β customer_id: β β customer β β β β
β β β β 2000-2999 β β 4000-4999 β β 6000-6999 β β β β
β β β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββ β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Distribution Styles
| Style | Description | Use Case | Performance |
|---|---|---|---|
| KEY | Hash on column | Large fact tables | Best for joins |
| EVEN | Round-robin | Even distribution | Good for distribution |
| ALL | Copy to all nodes | Small dimension tables | Fastest scans |
| AUTO | Redshift decides | Most tables | Adaptive |
Distribution Strategy
-- KEY distribution for large fact tables
CREATE TABLE sales (
sale_id BIGINT,
customer_id BIGINT,
product_id BIGINT,
amount DECIMAL(10,2),
sale_date DATE
)
DISTSTYLE KEY
DISTKEY(customer_id)
SORTKEY(sale_date);
-- ALL distribution for small dimension tables
CREATE TABLE products (
product_id INT,
product_name VARCHAR(100),
category VARCHAR(50),
price DECIMAL(10,2)
)
DISTSTYLE ALL
SORTKEY(category);
-- EVEN distribution for staging tables
CREATE TABLE staging_sales (
sale_id BIGINT,
customer_id BIGINT,
amount DECIMAL(10,2)
)
DISTSTYLE EVEN
SORTKEY(sale_date);
-- AUTO distribution (recommended)
CREATE TABLE orders (
order_id BIGINT,
customer_id BIGINT,
order_date TIMESTAMP,
total DECIMAL(10,2)
)
DISTSTYLE AUTO;
βΉοΈ
Pro Tip: Use DISTSTYLE AUTO for most tables. Redshift will automatically choose the best distribution style based on table size and usage patterns.
Sort Keys
-- Compound Sort Key (recommended)
CREATE TABLE events (
event_id BIGINT,
event_type VARCHAR(50),
user_id BIGINT,
event_date TIMESTAMP
)
COMPOUND SORTKEY(event_date, event_type);
-- Interleaved Sort Key
CREATE TABLE logs (
log_id BIGINT,
user_id BIGINT,
event_date TIMESTAMP,
action VARCHAR(50)
)
INTERLEAVED SORTKEY(user_id, event_date, action);
Sort Key Selection Guide
| Key Type | Best For | Trade-off |
|---|---|---|
| Compound | Range queries, filtering by first column | Better for ordered scans |
| Interleaved | Multiple columns used equally | More complex maintenance |
Redshift Spectrum
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β REDSHIFT SPECTRUM ARCHITECTURE β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β REDSHIFT CLUSTER β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Leader Node β β β
β β β β’ Query planning β β β
β β β β’ Spectra requests β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SPECTRUM LAYER (Serverless) β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Spectrum Nodes β β β
β β β β’ Auto-scaling (up to 100 nodes per query) β β β
β β β β’ Data filtering at source β β β
β β β β’ Columnar processing β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DATA LAKE (S3) β β
β β β β
β β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β β
β β β Parquet β β ORC β β JSON/CSV β β β
β β β (Recommended) β β (Columnar) β β (Row-based) β β β
β β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
External Table Creation
-- Create external schema for Spectrum
CREATE EXTERNAL SCHEMA spectrum_data
FROM DATA CATALOG
DATABASE 'data_lake_db'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftSpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;
-- Create external table
CREATE EXTERNAL TABLE spectrum_data.sales (
sale_id BIGINT,
customer_id BIGINT,
product_id BIGINT,
amount DECIMAL(10,2),
sale_date DATE
)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION 's3://data-lake-processed/silver/sales/'
TABLE PROPERTIES ('parquet.compression'='SNAPPY');
-- Query external table
SELECT
customer_id,
SUM(amount) as total_amount,
COUNT(*) as transaction_count
FROM spectrum_data.sales
WHERE year = 2024 AND month = 1
GROUP BY customer_id
ORDER BY total_amount DESC;
-- Query mixing internal and external tables
SELECT
c.customer_name,
SUM(s.amount) as total_sales
FROM dev.customers c
JOIN spectrum_data.sales s ON c.customer_id = s.customer_id
WHERE s.year = 2024
GROUP BY c.customer_name;
βΉοΈ
Spectrum Pricing: $5 per TB scanned. Use partitioning and columnar formats to minimize data scanned.
Redshift Serverless
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β REDSHIFT SERVERLESS ARCHITECTURE β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β WORKGROUP CONFIGURATION β β β
β β β β β β
β β β β’ Base Capacity: 128 - 512 RPUs β β β
β β β β’ Max Capacity: 512 RPUs β β β
β β β β’ Timeout: 300 - 600 seconds β β β
β β β β’ VPC: Configured β β β
β β β β’ Encryption: KMS managed β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β βΌ β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β AUTO-SCALING β β β
β β β β β β
β β β Query Load βββΊ 128 RPUs βββΊ 256 RPUs βββΊ 512 RPUs β β β
β β β β β β β β β
β β β βΌ βΌ βΌ β β β
β β β Auto-pause βββΊ 30 sec βββΊ 1 min βββΊ 5 min βββΊ Pause β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β COST MODEL β β β
β β β β β β
β β β RPU-hour = (RPU count Γ hours used) β β β
β β β Storage: $0.024/GB/month (managed storage) β β β
β β β Data transfer: Standard AWS rates β β β
β β β β β β
β β β Example: 128 RPUs for 1 hour = $12.80 β β β
β β β 256 RPUs for 1 hour = $25.60 β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Serverless Configuration
import boto3
redshift = boto3.client('redshift-serverless')
# Create workgroup
workgroup = redshift.create_workgroup(
workgroupName='analytics-workgroup',
baseCapacity=128,
enhancedVpcRouting=True,
securityGroupIds=['sg-12345678'],
subnetIds=['subnet-12345678', 'subnet-87654321'],
publiclyAccessible=False,
configParameters=[
{
'parameterKey': 'enable_user_activity_logging',
'parameterValue': 'true'
},
{
'parameterKey': 'max_query_execution_time',
'parameterValue': '3600'
}
],
tags={
'Environment': 'production',
'Team': 'analytics'
}
)
# Create namespace
namespace = redshift.create_namespace(
namespaceName='analytics-namespace',
adminUsername='admin',
adminUserPassword='SecurePassword123!',
dbName='analytics_db',
iamRoles=['arn:aws:iam::123456789012:role/RedshiftServerlessRole'],
logExports=['useractivitylog', 'userlog', 'connectionlog'],
tags={
'Environment': 'production'
}
)
print(f"Workgroup: {workgroup['workgroup']['workgroupName']}")
print(f"Namespace: {namespace['namespace']['namespaceName']}")
Concurrency Scaling
-- Enable concurrency scaling for a cluster
ALTER TABLE sales SET AUTOMATICALLY CREATE SORT KEY ON;
-- Configure concurrency scaling
CREATE WIDGET SCALING CONCURRENTLY FOR TABLE sales
SCALING TYPE AUTO
QUEUE waitForScaling
CONCURRENCY 5;
-- Monitor concurrency scaling
SELECT * FROM stl_concurrency_scaling
WHERE start_time > DATEADD(hour, -24, GETDATE())
ORDER BY start_time DESC;
Redshift Best Practices
βΉοΈ
Pro Tip: Use COPY command instead of INSERT for bulk loading. It's 5-10x faster and automatically handles compression.
Loading Data
-- COPY from S3
COPY sales
FROM 's3://data-lake-processed/silver/sales/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftLoadRole'
FORMAT AS PARQUET;
-- COPY from S3 with options
COPY sales
FROM 's3://data-lake-processed/silver/sales/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftLoadRole'
FORMAT AS CSV
IGNOREHEADER 1
REGION 'us-east-1'
COMPUPDATE OFF
STATUPDATE OFF;
-- Unload to S3
UNLOAD ('SELECT * FROM sales WHERE year = 2024')
TO 's3://data-export/sales_2024/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftUnloadRole'
FORMAT AS PARQUET
PARTITION BY (month, day);
Performance Monitoring
-- Query performance
SELECT
query,
pid,
starttime,
endtime,
datediff(seconds, starttime, endtime) as duration,
rows_scanned,
rows_returned,
cpu_time,
blocks_read
FROM stl_query
WHERE starttime > DATEADD(hour, -24, GETDATE())
ORDER BY duration DESC;
-- Table statistics
SELECT
schemaname,
tablename,
size,
skew_sortkey1,
skew_rows
FROM svv_table_info
WHERE schema = 'public'
ORDER BY size DESC;
-- Query plan
EXPLAIN
SELECT * FROM sales
WHERE sale_date = '2024-01-15'
AND amount > 100;
Interview Questions & Answers
Q1: What is the difference between DISTSTYLE KEY and DISTSTYLE ALL?
Answer:
- KEY: Distributes data across nodes based on hash of DISTKEY column. Best for large tables joined frequently.
- ALL: Copies entire table to every node. Best for small dimension tables (<2GB).
Use KEY for fact tables, ALL for dimension tables.
Q2: How does Redshift Spectrum differ from Redshift?
Answer:
- Redshift: Queries data stored in cluster-attached storage
- Spectrum: Queries data directly in S3 without loading
Spectrum is serverless and scales independently. Use it for data lake queries without ETL.
Q3: When should you use Redshift Serverless vs. Provisioned?
Answer:
- Serverless: Variable workloads, development, unpredictable queries
- Provisioned: Steady-state production, predictable costs, high concurrency
Serverless is more flexible but can be expensive for continuous workloads.
Q4: What is the COPY command and why is it recommended?
Answer: COPY is Redshift's bulk loading command. Benefits:
- 5-10x faster than INSERT
- Automatic compression detection
- Parallel loading from multiple files
- Error handling options
- Supports Parquet, ORC, JSON, CSV
Q5: How do you optimize Redshift query performance?
Answer:
- Sort Keys: Use compound sort keys for range queries
- Distribution Keys: Use KEY for join columns
- Compression: Use columnar compression (Automatic)
- Vacuum: Reclaim space and resort
- Analyze: Update statistics
- Result Caching: Enable for repeated queries
Cost Considerations
| Component | Cost | Optimization |
|---|---|---|
| Provisioned | $0.25/hr per node (dc2.large) | Reserved instances |
| Serverless | $0.375 per RPU-hour | Auto-pause when idle |
| Spectrum | $5 per TB scanned | Partition data |
| Managed Storage | $0.024/GB/month | Use S3 for cold data |
| Data Transfer | $0.09/GB outbound | Use VPC endpoints |
β οΈ
Cost Warning: Redshift Serverless costs can spike with complex queries. Set base capacity appropriately and monitor RPUs used per query.
Summary
Amazon Redshift is the leading cloud data warehouse. Key takeaways:
- Architecture: Leader node + Compute nodes with slices
- Distribution: KEY (facts), ALL (dimensions), EVEN (staging)
- Sort Keys: Compound for range queries, Interleaved for multi-column
- Spectrum: Query S3 data lake directly
- Serverless: Auto-scaling, pay-per-use
- Best Practices: COPY for loading, VACUUM for maintenance, ANALYZE for statistics