ELT Architecture
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ELT PIPELINE: S3 β Redshift Spectrum β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DATA SOURCES β β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β S3 Data β β External β β On-Prem β β SaaS β β β
β β β Lake β β Tables β β DB β β APIs β β β
β β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β β
β βββββββββΌβββββββββββββββΌβββββββββββββββΌβββββββββββββββΌββββββββββββββββ β
β βΌ βΌ βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β REDSHIFT SPECTRUM (Query S3 directly) β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β External Schema β External Tables β Query in Redshift β β β
β β β β β β
β β β Benefits: β β β
β β β β’ No data movement required β β β
β β β β’ Query petabytes in S3 β β β
β β β β’ Auto-scaling compute (100 nodes per query) β β β
β β β β’ Pay per TB scanned ($5/TB) β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββΌββββββββββββββββββ β
β βΌ βΌ βΌ β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β External Tablesβ β Materialized β β Regular Tables β β
β β (S3 Direct) β β Views β β (Loaded) β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Loading Strategies
Strategy 1: COPY Command (Bulk Load)
-- Bulk load from S3
COPY sales_fact
FROM 's3://data-lake-processed/silver/sales/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftRole'
FORMAT AS PARQUET
COMPUPDATE OFF
STATUPDATE OFF;
Strategy 2: Spectrum for Ad-hoc
-- Query S3 directly without loading
CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG
DATABASE 'data_lake_db'
IAM_ROLE 'arn:aws:iam::123456789012:role/SpectrumRole';
SELECT customer_id, SUM(amount)
FROM spectrum_schema.sales
WHERE year = 2024 AND month = 1
GROUP BY customer_id;
Strategy 3: Staging + Upsert
-- Stage data
CREATE TEMPORARY TABLE staging_sales (LIKE sales_fact);
COPY staging_sales FROM 's3://staging/sales/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftRole'
FORMAT AS PARQUET;
-- Upsert (merge)
BEGIN TRANSACTION;
DELETE FROM sales_fact
USING staging_sales
ON sales_fact.sale_id = staging_sales.sale_id;
INSERT INTO sales_fact
SELECT * FROM staging_sales;
DROP TABLE staging_sales;
COMMIT;
Interview Q&A
Q1: When should you use Spectrum vs. COPY?
Answer: Spectrum for ad-hoc queries on rarely accessed data. COPY for frequently queried data that benefits from local storage performance.
Q2: How do you optimize Spectrum queries?
Answer: Partition data, use columnar formats (Parquet/ORC), push predicates to S3, use V3 manifest files.
Q3: What is the difference between external tables and materialized views?
Answer: External tables query S3 directly (always fresh). Materialized views pre-compute results (faster but need refresh).
Summary
- ELT: Load raw data, transform in warehouse
- Spectrum: Query S3 directly, pay per TB scanned
- COPY: Bulk load for frequently queried data
- Partitioning: Essential for Spectrum performance and cost
- Merge/Upsert: Use transactions for incremental updates