πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Redshift Spectrum Deep Dive

AWS Data EngineeringExternal Tables & Data Lake Queries⭐ Premium

Advertisement

πŸ” Redshift Spectrum

Master Redshift Spectrum for querying S3 data lakes directly from Redshift.

Module: AWS Data Engineering β€’ Topic 37 of 65 β€’ Premium Content

Spectrum Architecture

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    REDSHIFT SPECTRUM ARCHITECTURE                             β”‚
β”‚                                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  REDSHIFT CLUSTER                                                    β”‚    β”‚
β”‚  β”‚  Leader Node: Query planning β†’ Spectrum requests                   β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                β”‚                                           β”‚
β”‚                                β–Ό                                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  SPECTRUM LAYER (Serverless, auto-scaling to 100 nodes)             β”‚    β”‚
β”‚  β”‚  β€’ Columnar processing                                               β”‚    β”‚
β”‚  β”‚  β€’ Predicate pushdown to S3                                          β”‚    β”‚
β”‚  β”‚  β€’ Partition pruning                                                 β”‚    β”‚
β”‚  β”‚  β€’ $5 per TB scanned                                                β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                β”‚                                           β”‚
β”‚                                β–Ό                                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  S3 DATA LAKE                                                        β”‚    β”‚
β”‚  β”‚  Parquet/ORC (Recommended) | JSON/CSV (Supported)                   β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

External Table Creation

-- Create external schema
CREATE EXTERNAL SCHEMA spectrum_data
FROM DATA CATALOG
DATABASE 'data_lake_db'
IAM_ROLE 'arn:aws:iam::123456789012:role/SpectrumRole';

-- Create partitioned external table
CREATE EXTERNAL TABLE spectrum_data.sales (
    sale_id BIGINT,
    customer_id BIGINT,
    amount DECIMAL(10,2),
    product_category VARCHAR(50)
)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION 's3://data-lake-processed/silver/sales/'
TABLE PROPERTIES ('parquet.compression'='SNAPPY');

-- Add partitions
ALTER TABLE spectrum_data.sales ADD PARTITION (year=2024, month=1, day=15)
LOCATION 's3://data-lake-processed/silver/sales/year=2024/month=1/day=15/';

-- Query with partition pruning
SELECT customer_id, SUM(amount) as total
FROM spectrum_data.sales
WHERE year = 2024 AND month = 1 AND day = 15
GROUP BY customer_id;

-- Join internal and external tables
SELECT c.name, SUM(s.amount) as total_sales
FROM dev.customers c
JOIN spectrum_data.sales s ON c.id = s.customer_id
WHERE s.year = 2024
GROUP BY c.name;

Interview Q&A

Q1: How much does Spectrum cost?

Answer: $5 per TB scanned. Minimize cost by using columnar formats (Parquet), partitioning, and predicate pushdown.

Q2: Can Spectrum join internal Redshift tables with external tables?

Answer: Yes. Spectrum federates queries between Redshift-internal tables and S3-external tables seamlessly.

Q3: What file formats does Spectrum support?

Answer: Parquet, ORC (best for columnar), Avro, JSON, CSV, Sequence, and Text.

Summary

  • Architecture: Redshift β†’ Spectrum Layer β†’ S3 Data Lake
  • Cost: $5 per TB scanned, optimize with partitioning
  • Formats: Parquet/ORC recommended for columnar access
  • Federation: Query internal Redshift + external S3 tables
  • Performance: Partition pruning, predicate pushdown, auto-scaling

Advertisement