πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Snowflake Data Lakehouse Architecture

🟒 Free Lesson

Advertisement

Snowflake Data Lakehouse Architecture

The Data Lakehouse architecture represents a paradigm shift in data platform design, unifying the best aspects of data lakes and data warehouses into a single, cohesive platform.

Architecture Diagram 2: Data Lakehouse Unified Architecture

Data Lakehouse Unified ArchitectureIngestionBatch & streamingCDC, files, APIsCloud StorageS3 / ADLS / GCSParquet, ORC, JSONSnowflake ComputeSQL, SnowparkML, Cortex AIAnalytics & MLBI dashboardsData science, ML modelsGoverned by Catalog | ACID Transactions | Time Travel | Schema Enforcement

What is a Data Lakehouse?

A Data Lakehouse combines the low-cost, flexible storage of data lakes with the performance, reliability, and ACID transactions of data warehouses.

Key Characteristics

FeatureData LakeData WarehouseData Lakehouse
Storage CostLowHighLow
Schema EnforcementNoneStrictFlexible
ACID TransactionsNoYesYes
Data FormatsProprietaryProprietaryOpen (Parquet, ORC)
ML SupportExcellentLimitedExcellent
BI SupportLimitedExcellentExcellent

What is a Data Lakehouse?

  • Combines data lake flexibility with data warehouse reliability
  • Open formats (Parquet, ORC) enable multi-engine access
  • ACID transactions ensure data integrity
  • Cost-effective storage with warehouse-grade performance

Snowflake's Lakehouse Implementation

External Tables

  • Query data stored externally without copying
  • Supports Parquet, JSON, ORC formats
  • Auto-refresh for up-to-date queries

Iceberg Tables

  • Open table format with ACID transactions
  • Schema evolution without table recreation
  • Time travel on Iceberg snapshots

Hybrid Tables

  • OLTP + OLAP on the same table
  • Primary key enforcement
  • ACID transactions for single-row operations

Medallion Architecture

LayerPurposeData State
BronzeRaw data ingestionRaw, unvalidated
SilverCleaned and validatedStandardized, deduplicated
GoldBusiness-ready aggregatesAggregated, enriched
  • Progressive data refinement β€” each layer adds quality and logic
  • ACID transactions ensure integrity across layers

Cost Optimization

StrategyImplementationBenefit
Auto-scalingMulti-cluster warehousesPay only for active compute
Storage tieringTime-based partitioningMove old data to cheaper storage
Query optimizationPartition pruningReduce data scanned
Materialized viewsPre-aggregate common queriesFaster response times

Key Takeaway: The Lakehouse pattern is ideal for organizations needing both lake flexibility and warehouse reliability. Snowflake's compute-storage separation makes this seamless.

Snowflake's Lakehouse Implementation

Snowflake implements the Lakehouse pattern through several key features:

External Tables

External tables allow Snowflake to query data stored externally without copying it:

-- Create external table from S3
CREATE OR REPLACE EXTERNAL TABLE my_external_table
  WITH LOCATION = @my_s3_stage
  FILE_FORMAT = (TYPE = PARQUET)
  AUTO_REFRESH = true;

-- Query external data directly
SELECT * FROM my_external_table
WHERE date >= '2024-01-01';

Iceberg Tables (Snowflake Native Iceberg)

Snowflake supports Apache Iceberg for open table formats:

-- Create Iceberg table
CREATE OR REPLACE EXTERNAL TABLE my_iceberg_table
  CATALOG = 'SNOWFLAKE'
  EXTERNAL_VOLUME = 'my_volume'
  TABLE_FORMAT = ICEBERG
  DATA_FILES_FORMAT = (TYPE = PARQUET);

-- Time travel on Iceberg tables
SELECT * FROM my_iceberg_table
  AT (OFFSET => -3600);  -- 1 hour ago

Hybrid Tables

Hybrid tables combine transactional and analytical workloads:

-- Create hybrid table for OLTP-like operations
CREATE OR REPLACE HYBRID TABLE orders (
  order_id INTEGER PRIMARY KEY,
  customer_id INTEGER,
  order_date TIMESTAMP_NTZ,
  total_amount DECIMAL(10,2),
  status VARCHAR(20)
);

-- ACID transactions on hybrid tables
BEGIN TRANSACTION;
  INSERT INTO orders VALUES (1, 100, CURRENT_TIMESTAMP(), 99.99, 'PENDING');
  UPDATE orders SET status = 'COMPLETED' WHERE order_id = 1;
COMMIT;

Data Lakehouse Patterns in Snowflake

Medallion Architecture

The Medallion (Multi-Hop) architecture organizes data into bronze, silver, and gold layers:

-- Bronze Layer: Raw data ingestion
CREATE OR REPLACE TABLE bronze.raw_orders (
  raw_data VARIANT
);

-- Silver Layer: Cleaned and validated data
CREATE OR REPLACE TABLE silver.orders AS
  SELECT
    raw_data:order_id::INTEGER AS order_id,
    raw_data:customer_id::INTEGER AS customer_id,
    raw_data:order_date::DATE AS order_date,
    raw_data:amount::DECIMAL(10,2) AS amount,
    CURRENT_TIMESTAMP() AS ingested_at
  FROM bronze.raw_orders;

-- Gold Layer: Business-ready aggregates
CREATE OR REPLACE TABLE gold.daily_sales_summary AS
  SELECT
    order_date,
    COUNT(*) AS total_orders,
    SUM(amount) AS total_revenue,
    AVG(amount) AS avg_order_value
  FROM silver.orders
  GROUP BY order_date;

Data Sharing Across Domains

-- Share data across business domains
CREATE SHARE sales_share;
GRANT USAGE ON DATABASE analytics_db TO SHARE sales_share;
GRANT USAGE ON SCHEMA gold TO SHARE sales_share;
GRANT SELECT ON gold.daily_sales_summary TO SHARE sales_share;

-- Consumers can query shared data
CREATE DATABASE sales_from_share FROM SHARE sales_share;
SELECT * FROM sales_from_share.gold.daily_sales_summary;

Performance Optimization for Lakehouse

Partitioning Strategies

-- Partition external data by date
CREATE OR REPLACE EXTERNAL TABLE daily_events
  PARTITION BY (event_date)
  WITH LOCATION = @events_stage
  FILE_FORMAT = (TYPE = PARQUET);

-- Query optimization through partition pruning
SELECT * FROM daily_events
  WHERE event_date = '2024-01-15';

Clustering for Large Tables

-- Cluster large fact tables
CREATE OR REPLACE TABLE large_fact_table CLUSTER BY (event_date, region)
  AS SELECT * FROM staging_events;

-- Manual re-clustering
ALTER TABLE large_fact_table RECLUSTER;

The Lakehouse pattern is ideal for organizations that need both the flexibility of a data lake and the reliability of a data warehouse. Snowflake's architecture makes this particularly seamless because the compute and storage layers are independent.

Cost Optimization in Lakehouse

StrategyImplementationBenefit
Auto-scalingMulti-cluster warehousesPay only for active compute
Storage tieringTime-based partitioningMove old data to cheaper storage
Query optimizationPartition pruningReduce data scanned
Materialized viewsPre-aggregate common queriesFaster response times

Key Takeaways:

  • Data Lakehouse combines lake flexibility with warehouse reliability
  • Snowflake supports Iceberg, Hybrid Tables, and External Tables
  • Medallion Architecture (Bronze/Silver/Gold) organizes data progressively
  • ACID transactions ensure data integrity in the lakehouse
  • Cost optimization through partitioning, clustering, and auto-scaling
⭐

Premium Content

Snowflake Data Lakehouse Architecture

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Snowflake Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement