dbt with Redshift
Redshift Architecture
Sort Key Strategy
Distribution Strategy
Detailed Explanation
What is Redshift + dbt?
Amazon Redshift is a fully managed, petabyte-scale data warehouse that provides powerful analytics capabilities.
Core Features
| Feature | Description |
|---|
| Columnar Storage | Optimized for analytics queries |
| MPP | Distribute queries across nodes |
| Distribution Keys | Optimize data distribution for joins |
| Sort Keys | Optimize data ordering for queries |
| Compression | Automatic data compression |
Distribution Styles
| Style | Behavior | Best For |
|---|
| EVEN | Round-robin distribution | Default / no join optimization |
| KEY | Hash-based on a column | Large fact tables with joins |
| ALL | Copy to every node | Small dimension tables |
Sort Key Types
- Compound β Multi-column sort with prefix matching
- Interleaved β Equal-weight multi-column sort
- Auto β System-managed sort key selection
Key Takeaway: Choose distribution keys based on your most frequent JOIN patterns. Use ALL distribution for small dimension tables to avoid shuffling.
Code Examples
Redshift Profile Configuration
# profiles.yml
my_profile:
target: dev
outputs:
dev:
type: redshift
host: my-cluster.xxxxxxxxxxxx.us-west-2.redshift.amazonaws.com
port: 5439
user: my_user
password: "{{ env_var('REDSHIFT_PASSWORD') }}"
dbname: analytics_dev
schema: dbt_dev
threads: 4
connect_timeout: 10
ra3_node: true
prod:
type: redshift
host: my-prod-cluster.xxxxxxxxxxxx.us-west-2.redshift.amazonaws.com
port: 5439
user: service_account
password: "{{ env_var('REDSHIFT_PASSWORD') }}"
dbname: analytics_prod
schema: public
threads: 8
connect_timeout: 30
ra3_node: true
Distribution Key Configuration
# models/marts/fct_orders.yml
version: 2
models:
- name: fct_orders
description: "Fact table for orders"
config:
materialized: incremental
unique_key: order_id
incremental_strategy: merge
dist_style: key
dist_key: customer_id
sort_type: compound
sort_key:
- order_date
- customer_id
post_hook:
- "vacuum sortkey order by order_date"
- "analyze"
Sort Key Configuration
-- models/marts/fct_events.sql
{{
config(
materialized='incremental',
unique_key='event_id',
incremental_strategy='merge',
dist_style='key',
dist_key='user_id',
sort_type='compound',
sort_key=['event_date', 'user_id', 'event_type']
)
}}
with events as (
select * from {{ ref('stg_events') }}
),
final as (
select
event_id,
user_id,
event_type,
event_timestamp,
cast(event_timestamp as date) as event_date,
event_properties,
current_timestamp() as updated_at
from events
)
select * from final
{% if is_incremental() %}
where event_date >= date_sub(
(select max(event_date) from {{ this }}),
interval 7 day
)
{% endif %}
ALL Distribution for Dimension Tables
-- models/marts/dim_customers.sql
{{
config(
materialized='table',
dist_style='all',
sort_type='compound',
sort_key=['customer_id']
)
}}
with customers as (
select * from {{ ref('stg_customers') }}
),
final as (
select
customer_id,
customer_name,
email,
segment,
current_timestamp() as updated_at
from customers
)
select * from final
Redshift-Specific Macros
-- macros/redshift/vacuum_table.sql
{% macro vacuum_table(table_name, sort_key=None) %}
{% if sort_key %}
vacuum sortkey order by {{ sort_key }} on {{ table_name }};
{% else %}
vacuum on {{ table_name }};
{% endif %}
{% endmacro %}
-- macros/redshift/analyze_table.sql
{% macro analyze_table(table_name) %}
analyze {{ table_name }};
{% endmacro %}
-- macros/redshift/grant_permissions.sql
{% macro grant_permissions(table_name, role='analytics_reader') %}
grant select on {{ table_name }} to role {{ role }};
{% endmacro %}
Workload Management Configuration
-- macros/redshift/create_wlm_queue.sql
{% macro create_wlm_queue(queue_name, memory_percent=40, concurrency=5) %}
-- Note: WLM configuration is typically done via AWS Console
-- This macro provides a template for documentation
/*
WLM Queue Configuration:
Queue: {{ queue_name }}
Memory: {{ memory_percent }}%
Concurrency: {{ concurrency }}
Query Timeout: 600 seconds
Use Cases:
- dbt runs: Use dedicated queue
- Ad-hoc queries: Use separate queue
- ETL jobs: Use high-memory queue
*/
{% endmacro %}
Optimized Redshift Model
-- models/marts/fct_orders_optimized.sql
{{
config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge',
dist_style='key',
dist_key='customer_id',
sort_type='compound',
sort_key=['order_date', 'customer_id'],
post_hook=[
"{{ vacuum_table(this, 'order_date') }}",
"{{ analyze_table(this) }}"
]
)
}}
with orders as (
select * from {{ ref('stg_orders') }}
),
final as (
select
order_id,
customer_id,
order_date,
order_status,
amount,
current_timestamp() as updated_at
from orders
)
select * from final
{% if is_incremental() %}
where order_date >= date_sub(
(select max(order_date) from {{ this }}),
interval 7 day
)
{% endif %}
Performance Metrics
| Metric | Description | Target |
|---|
| Query Time | Average query execution | <10s |
| Vacuum Time | Time to vacuum tables | <5min |
| Analyze Time | Time to analyze tables | <1min |
| Storage Usage | Percentage of disk used | <70% |
| Query Queue Time | Time waiting in queue | <5s |
Best Practices
- Choose distribution keys wisely - Match join patterns
- Use compound sort keys - For prefix-matching queries
- Distribute dimension tables as ALL - For small tables
- Use vacuum - Reclaim space after deletes
- Analyze statistics - Keep query planner informed
- Monitor query performance - Use STL tables
- Use result caching - Cache frequent queries
- Implement workload management - Prioritize queries
See Also
- Performance Tuning β General optimization strategies
- dbt BigQuery β BigQuery partitioning and clustering
- dbt Snowflake β Snowflake warehouse optimization
- dbt Best Practices β Project structure and patterns