Snowflake: The Elastic Data Cloud
Snowflake is a cloud-native data platform that separates compute from storage, enabling independent scaling of each resource layer.
Why Snowflake Matters
Traditional Warehouses:
- Require vertical scaling (bigger machines)
- Capacity planning is complex
- Limited elasticity
Snowflake's Innovation:
- Horizontal elastic scaling through virtual warehouses
- Pause, resume, and resize in seconds
- Near-infinite elasticity
- Eliminates capacity planning
Key Insight: Snowflake fundamentally changes how data teams manage analytical workloads by introducing horizontal elastic scaling through virtual warehouses that can be paused, resumed, and resized in seconds.
Architecture Overview
Snowflake 3-Layer Architecture
Compute-Storage Separation
Compute-Storage Separation
Snowflake decouples query processing (virtual warehouses) from persistent storage (micro-partitions). Each layer scales independently: warehouses are ephemeral compute clusters, while storage persists encrypted data in cloud object storage.
Snowflake Resource Model
- Storage Cost = Data_Volume_GB Γ 40/TB/month (Snowpipe)
- Compute Cost = Ξ£ (warehouse_size_credits Γ hours_used Γ credit_price)
- Credit Price = $2-4/credit depending on edition (Standard, Enterprise, Business Critical)
- Effective $/TB queried = Compute_Cost / (Data_Volume_TB Γ Queries_per_Hour)
- Storage-to-Compute Ratio = Monthly_Storage_Cost / Monthly_Compute_Cost (typical: 0.1-0.3)
-- Create a virtual warehouse with auto-suspend
CREATE WAREHOUSE analytics_wh
WAREHOUSE_SIZE = 'medium'
AUTO_SUSPEND = 300 -- Suspend after 5 minutes of inactivity
AUTO_RESUME = TRUE -- Resume automatically on query
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 4 -- Multi-cluster scaling
SCALING_POLICY = 'ECONOMY' -- Cost-optimized scaling
INITIALLY_SUSPENDED = FALSE;
-- Create warehouse for ad-hoc queries (smaller, faster)
CREATE WAREHOUSE adhoc_wh
WAREHOUSE_SIZE = 'x-small'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 1;
-- Monitor warehouse utilization
SELECT
warehouse_name,
warehouse_size,
SUM(credits_used) AS total_credits,
SUM(query_count) AS total_queries,
ROUND(SUM(credits_used) / NULLIF(SUM(query_count), 0), 4) AS credits_per_query,
AVG(avg_queue_time_ms) AS avg_queue_ms,
AVG(avg_execution_time_ms) AS avg_exec_ms
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
WHERE start_date >= DATEADD(day, -30, CURRENT_DATE())
GROUP BY warehouse_name, warehouse_size
ORDER BY total_credits DESC;
Micro-Partitions and Pruning
Snowflake stores data in micro-partitions: compressed, columnar chunks of 50-500 MB uncompressed. Each micro-partition stores metadata including min/max values per column, enabling partition pruning β skipping irrelevant micro-partitions entirely during query execution.
Partition Pruning Efficiency
- Let M = total micro-partitions, m = micro-partitions accessed after pruning
- Pruning Ratio = (M - m) / M Γ 100%
- I/O Reduction = Pruning_Ratio Γ Data_Per_MP_GB
- Query Speedup β 1 / (1 - Pruning_Ratio) assuming I/O-bound
- Example: 1000 MPs, prune to 50 -> 95% pruning -> 20x speedup
-- Query pruning metadata
SELECT
partition_id,
row_count,
row_start_byte,
row_end_byte,
column_name,
min_value,
max_value
FROM TABLE(GET_METADATA_FOR_POSITIONAL_BLOCK('fact_sales'))
LIMIT 50;
-- Use clustering key for better pruning
CREATE TABLE fact_sales (
sale_key BIGINT,
customer_key INT,
product_key INT,
sale_date DATE,
net_amount DECIMAL(14,2)
)
CLUSTER BY (sale_date); -- Micro-partitions sorted by sale_date
-- Check clustering depth and overlap
SELECT
system$clustering_depth('fact_sales', '(sale_date)'),
system$clustering_ratio('fact_sales', '(sale_date)'),
system$clustering_information('fact_sales', '(sale_date)');
Time Travel and Zero-Copy Cloning
Time Travel enables querying historical data at any point within a retention period (1-90 days on Enterprise). Snowflake maintains a metadata layer that tracks changes, allowing point-in-time queries and data recovery without restoring backups.
-- Query data as it was 2 hours ago
SELECT * FROM fact_sales AT (OFFSET => -7200);
-- Query data at a specific timestamp
SELECT * FROM fact_sales AT (TIMESTAMP => '2025-01-15 10:30:00'::TIMESTAMP_TZ);
-- Query data before a specific statement
SELECT * FROM fact_sales BEFORE (STATEMENT => 'abc123');
-- Zero-Copy Clone (instant, metadata-only)
CREATE TABLE fact_sales_staging CLONE fact_sales;
-- Clone with data freshness control
CREATE TABLE fact_sales_jan CLONE fact_sales
AT (TIMESTAMP => '2025-01-31 23:59:59'::TIMESTAMP_TZ);
-- Drop and restore using Time Travel
DROP TABLE fact_sales;
UNDROP TABLE fact_sales;
-- Cross-region clone
CREATE TABLE fact_sales_dr CLONE snowflake_db.fact_sales
AT (TIMESTAMP => '2025-01-15 00:00:00'::TIMESTAMP_TZ);
Data Sharing and Snowflake Marketplace
Snowflake Secure Data Sharing enables zero-copy, real-time data sharing between Snowflake accounts without data movement. Data remains in the provider's account while consumers query it directly via their virtual warehouse.
-- Create a share
CREATE SHARE company_analytics_share;
GRANT USAGE ON DATABASE analytics_db TO SHARE company_analytics_share;
GRANT USAGE ON SCHEMA analytics_db.public TO SHARE company_analytics_share;
GRANT SELECT ON TABLE analytics_db.public.fact_daily_metrics TO SHARE company_analytics_share;
-- Add consumer account
ALTER SHARE company_analytics_share ADD ACCOUNTS = consumer_account_id;
-- Create reader account for external sharing
CREATE ACCOUNT reader_acct
ADMIN_NAME = 'admin'
ADMIN_PASSWORD = 'secure_password'
EMAIL = 'reader@company.com'
TYPE = READER;
-- List shares available to this account
SHOW SHARES;
SELECT * FROM INFORMATION_SCHEMA.SHARES;
Multi-Cluster Auto-Scaling
Multi-Cluster Scaling
- Cluster Count: C β [C_min, C_max]
- Concurrency Level: T = C Γ T_per_cluster (typically 8-16 concurrent queries per cluster)
- Scaling Trigger: Queue depth > T_per_cluster for > 60 seconds -> add cluster
- Scale-Down Trigger: Utilization < 50% for > 300 seconds -> remove cluster
- Elasticity Score = Actual_Concurrency / Max_Possible_Concurrency
-- Monitor multi-cluster scaling events
SELECT
warehouse_name,
start_time,
end_time,
cluster_number,
credits_used,
queries_executed
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_LOAD_HISTORY
WHERE warehouse_name = 'analytics_wh'
AND start_date >= DATEADD(day, -7, CURRENT_DATE())
ORDER BY start_time DESC;
-- Query to check if scaling is needed
SELECT
warehouse_name,
AVG(available) AS avg_available,
AVG(queued) AS avg_queued,
CASE
WHEN AVG(queued) > 8 THEN 'SCALE UP NEEDED'
WHEN AVG(available) > 12 THEN 'SCALE DOWN OPPORTUNITY'
ELSE 'ADEQUATE'
END AS scaling_recommendation
FROM TABLE(INFORMATION_SCHEMA.WAREHOUSE_LOAD_HISTORY(
TIME_RANGE_START => DATEADD(day, -7, CURRENT_DATE()),
TIME_RANGE_END => CURRENT_DATE()
))
GROUP BY warehouse_name;
Snowflake Editions and Pricing
Snowflake offers multiple editions with different feature sets and pricing tiers. Standard provides basic functionality, Enterprise adds Time Travel and materialized views, Business Critical adds HIPAA compliance and AWS PrivateLink, and Virtual Private Snowflake provides dedicated infrastructure.
Snowflake Pricing Model
- Standard Edition: $2/credit, 1-day Time Travel, basic security
- Enterprise Edition: $3/credit, 90-day Time Travel, materialized views, search optimization
- Business Critical: $4/credit, HIPAA, PCI DSS, AWS PrivateLink, Tri-Secret Secure
- Virtual Private Snowflake: Custom pricing, dedicated infrastructure
- Storage: 40/TB/month (Snowpipe Continuous Ingestion)
- Cost Optimization: Annual commitment -> 10-30% discount vs pay-as-you-go
-- Monitor credit consumption by warehouse
SELECT
warehouse_name,
SUM(credits_used) AS total_credits,
SUM(credits_used) * 3 AS estimated_cost_usd,
SUM(query_count) AS total_queries,
ROUND(SUM(credits_used) * 3.0 / NULLIF(SUM(query_count), 0), 4) AS cost_per_query
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
WHERE start_date >= DATEADD(day, -30, CURRENT_DATE())
GROUP BY warehouse_name
ORDER BY total_credits DESC;
-- Monitor storage consumption
SELECT
table_schema,
SUM(active_bytes) / 1024 / 1024 / 1024 AS active_storage_gb,
SUM(time_travel_bytes) / 1024 / 1024 / 1024 AS time_travel_gb,
SUM(fail_safe_bytes) / 1024 / 1024 / 1024 AS fail_safe_gb
FROM SNOWFLAKE.ACCOUNT_USAGE.TABLE_STORAGE_METRICS
WHERE deleted = 0
GROUP BY table_schema
ORDER BY active_storage_gb DESC;
-- Cost optimization: Identify expensive queries
SELECT
query_id,
query_text,
user_name,
warehouse_name,
credits_used,
ROUND(credits_used * 3, 4) AS cost_usd,
bytes_scanned / 1024 / 1024 / 1024 AS gb_scanned,
execution_time_ms
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE start_time >= DATEADD(day, -7, CURRENT_DATE())
AND credits_used > 1
ORDER BY credits_used DESC
LIMIT 20;
Key Concepts Summary
| Concept | Description | Benefit | Default Setting |
|---|---|---|---|
| Virtual Warehouse | Ephemeral compute cluster | Elastic scaling | X-Small, auto-suspend 300s |
| Micro-Partition | 50-500MB compressed columnar chunk | Automatic pruning | 50-500 MB uncompressed |
| Time Travel | Query historical data | Recovery, audit | 1 day (Standard), 90 days (Enterprise) |
| Zero-Copy Clone | Instant metadata-only copy | Dev/test, branching | No additional storage |
| Multi-Cluster | Auto-scaling compute clusters | High concurrency | 1-10 clusters |
| Data Sharing | Zero-copy cross-account sharing | Collaboration | No data movement |
| Result Caching | Automatic query result cache | Instant repeat queries | Enabled by default |
| Automatic Clustering | Self-managing cluster maintenance | Optimal pruning | Manual or auto |
| Snowpipe | Continuous file ingestion | Near-real-time loading | Auto-ingest enabled |
| Streams | Change data capture | Incremental processing | Append-only by default |
| Tasks | Scheduled SQL execution | ETL orchestration | Serverless option |
| Materialized Views | Pre-computed query results | Dashboard performance | Manual refresh |
Performance Metrics
| Warehouse Size | Credits/Hour | Queries/Hour (est.) | Max Data Scanned | Cost/Query (est.) |
|---|---|---|---|---|
| X-Small | 1 | 100-500 | 100 GB | $0.004-0.02 |
| Small | 2 | 200-1,000 | 500 GB | $0.004-0.01 |
| Medium | 4 | 400-2,000 | 1 TB | $0.004-0.01 |
| Large | 8 | 600-3,000 | 2 TB | $0.005-0.013 |
| X-Large | 16 | 800-4,000 | 4 TB | $0.008-0.02 |
| 2X-Large | 32 | 1,000-5,000 | 8 TB | $0.013-0.032 |
| 3X-Large | 64 | 1,200-6,000 | 16 TB | $0.021-0.053 |
| 4X-Large | 128 | 1,500-8,000 | 32 TB | $0.043-0.085 |
10 Best Practices
- Right-size virtual warehouses β start small, monitor queue times, scale only when needed
- Use AUTO_SUSPEND aggressively β set 60-300 seconds to avoid paying for idle compute
- Implement clustering keys on frequently filtered columns to maximize micro-partition pruning
- Leverage result caching β identical queries within 24 hours return cached results instantly
- Use Zero-Copy Clones for dev/test instead of full data copies
- Enable Time Travel retention at the Enterprise level for compliance and debugging
- Monitor warehouse utilization via ACCOUNT_USAGE views and adjust sizes weekly
- Use Snowpipe for continuous ingestion rather than bulk COPY for near-real-time data
- Separate warehouses by workload β ETL, ad-hoc, and BI should use dedicated warehouses
- Set up alerts for queue depth > 10 and credit consumption anomalies
- Compute-storage separation enables independent scaling and cost optimization
- Micro-partitions with automatic pruning reduce I/O by 80-95% for filtered queries
- Time Travel provides point-in-time recovery without backup restoration overhead
- Multi-cluster warehouses handle concurrency spikes automatically
- Proper warehouse sizing and auto-suspend are the primary cost levers
See Also
- Snowflake Advanced β Streams, Tasks, Dynamic Tables, and Snowpipe
- Data Warehouse Concepts β Star schema, snowflake schema, and dimensional modeling
- BigQuery Fundamentals β Google's serverless data warehouse alternative
- Redshift Fundamentals β Amazon's MPP data warehouse
- dbt Fundamentals β SQL-first transformation layer
- Cost Optimization β Managing data platform expenses across warehouses