Iceberg Tables: External Volumes, Hybrid Tables & Open Formats

Architecture Diagram 1: Iceberg Table Architecture

Architecture Diagram 2: Hybrid Tables Architecture

Architecture Diagram 3: Data Lakehouse Architecture

Architecture Diagram 4: Iceberg Table Metadata Layers

Apache Iceberg Tables are open table format standards that provide ACID transactions, schema evolution, time travel, and hidden partitioning for data lakes. Snowflake supports Iceberg as an external table format, enabling open data lake architectures while leveraging Snowflake's query engine.

The Iceberg metadata layer is a catalog abstraction that tracks table state via metadata files and manifest files. It enables atomic operations (commit, rollback) and snapshot isolation without requiring Spark or Hive — Snowflake reads Iceberg metadata directly.

Iceberg vs Snowflake Native Tradeoff

total\_cost = storage\_cost + compute\_cost + metadata\_overhead

Iceberg Partition Evolution

Hidden partitioning: Partitioning defined in metadata, not in query
Partition evolution: Change partitioning without rewriting data
Time-based partitioning: Partition by day/month using partition transforms
Equality transforms: Partition by column ranges (e.g., 100-value buckets)
Migration cost: One-time metadata rewrite; data blocks unchanged

Use Iceberg for: multi-engine data lakes, vendor independence, compliance-driven data retention, and gradual migration from legacy systems. Use Snowflake native for: performance-critical analytics, managed infrastructure, and simplified operations.

Open format: Apache Iceberg tables work across Spark, Trino, DuckDB, Snowflake
ACID transactions: Atomic commits with snapshot isolation on data lakes
Schema evolution: Add/rename columns without table recreation
Time travel: Query historical snapshots for auditing and reproducibility
Snowflake integration: Native Iceberg catalog support (2024+)

Detailed Explanation

What are Iceberg Tables?

Open Apache Iceberg table format providing ACID transactions, schema evolution, time travel, and hidden partitioning for data lakes — integrated natively in Snowflake.

External Volumes and Storage

Define cloud storage locations (S3, Azure Blob, GCS) with IAM credentials
Multiple Iceberg tables can share an external volume
Data remains in your account; Snowflake manages compute and metadata

Hybrid Tables: OLAP + OLTP

Feature	Regular Table	Hybrid Table
Single-row INSERT	Medium (full scan)	Fast (PK lookup)
Single-row UPDATE	Slow (full scan)	Fast (PK lookup)
Primary Key	No	Yes
Analytics	Fast	Fast

Combines columnar storage with transactional optimizations
Primary key enforcement for fast single-row operations

Open Table Formats Comparison

Format	Schema Evolution	Time Travel	Best For
Iceberg	Full (add/rename/drop)	Snapshots + timestamps	Multi-engine, schema evolution
Delta Lake	Add columns only	Delta log versions	Spark ecosystem
Hudi	Add columns only	Commit timeline	Incremental processing

Data Lakehouse Architecture

Combines data lake flexibility with warehouse reliability
Open formats enable multi-engine access (Snowflake, Spark, Trino)
Data layers: raw → processed → analytics
Governance: schema registry, lineage, RBAC, quality

Key Takeaway: Use Iceberg for multi-engine access and vendor independence. Use Snowflake native for performance-critical analytics.

Key Concepts Table

Feature	Iceberg Tables	Hybrid Tables	Regular Tables
Storage Location	External volume	Snowflake-managed	Snowflake-managed
ACID Transactions	Yes	Yes	Yes
Schema Evolution	Yes	Yes	Limited
Time Travel	Yes	Yes	Yes
Primary Key	No	Yes	No
Single-row DML	No	Yes	Limited
External Access	Yes	No	No

Open Format	ACID	Schema Evolution	Time Travel	Best For
Iceberg	Yes	Yes	Yes	Multi-engine, schema evolution
Delta Lake	Yes	Yes	Yes	Spark ecosystem, schema enforcement
Hudi	Yes	Yes	Yes	Incremental processing, real-time

Metric	Iceberg	Hybrid	Regular
Single-row INSERT	Slow	Fast	Medium
Bulk INSERT	Fast	Fast	Fast
Single-row UPDATE	Slow	Fast	Slow
Analytics Query	Fast	Fast	Fast
Storage Cost	External (lower)	Internal	Internal

Code Examples

External Volumes and Iceberg Table Setup

-- Example 1: Create external volume
-- External volumes map to cloud storage locations with IAM credentials
-- Parameters:
--   STORAGE_PROVIDER: 'S3', 'AZURE', or 'GCS'
--   STORAGE_AWS_ROLE_ARN: IAM role ARN for S3 access
--   STORAGE_BASE_LOCATIONS: Array of S3 paths (one per Snowflake account)
--   ENCRYPTION_TYPE: Server-side encryption type
CREATE EXTERNAL VOLUME my_iceberg_volume
    STORAGE_PROVIDER = 'S3'
    STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/SnowflakeAccessRole'
    STORAGE_BASE_LOCATIONS = ('s3://my-bucket/iceberg-data')
    ENCRYPTION_TYPE = 'AES_SSE_S3'   -- Server-side encryption
    COMMENT = 'External volume for Iceberg tables';

-- Verify external volume
DESC EXTERNAL VOLUME my_iceberg_volume;
SHOW EXTERNAL VOLUMES;

-- Example 2: Create Iceberg table
-- Parameters:
--   CATALOG: 'snowflake' (native Iceberg catalog)
--   EXTERNAL_VOLUME: References the external volume created above
--   BASE_LOCATION: Subpath within the volume for this table's data
--   STORAGE_SERIALIZATION_FORMAT: 'PARQUET' (default) or 'ORC'
--   STORAGE_COMPRESSION_TYPE: 'ZSTD' (best compression) or 'SNAPPY' (fast)
CREATE ICEBERG TABLE sales_iceberg (
    order_id INTEGER,
    customer_id INTEGER,
    order_date DATE,
    amount DECIMAL(10,2),
    region VARCHAR(10)
)
CATALOG = 'snowflake'                          -- Snowflake-managed Iceberg catalog
EXTERNAL_VOLUME = 'my_iceberg_volume'          -- Cloud storage volume
BASE_LOCATION = 'sales'                        -- Subpath for this table
STORAGE_SERIALIZATION_FORMAT = 'PARQUET'       -- Data file format
STORAGE_COMPRESSION_TYPE = 'ZSTD'              -- Compression algorithm
COMMENT = 'Sales data in Iceberg format';

Data Operations on Iceberg Tables

-- Example 3: Insert data into Iceberg table
-- Iceberg tables support standard DML: INSERT, UPDATE, DELETE, MERGE
INSERT INTO sales_iceberg 
SELECT * FROM raw_sales_data;

-- Example 4: Query Iceberg table with partition pruning
-- Iceberg metadata enables automatic partition pruning
SELECT 
    region,
    COUNT(*) as order_count,
    SUM(amount) as total_amount
FROM sales_iceberg
WHERE order_date >= '2024-01-01'   -- Partition pruning on date column
GROUP BY region;

-- Example 5: UPDATE operations on Iceberg
-- Iceberg supports row-level updates (unlike external tables)
UPDATE sales_iceberg 
SET amount = amount * 1.10    -- 10% price increase
WHERE region = 'US' AND order_date >= '2024-06-01';

-- Example 6: DELETE operations on Iceberg
DELETE FROM sales_iceberg 
WHERE order_date < '2023-01-01';   -- Remove old data

-- Example 7: MERGE (upsert) operations on Iceberg
MERGE INTO sales_iceberg t
USING staging_sales s
ON t.order_id = s.order_id
WHEN MATCHED THEN
    UPDATE SET 
        amount = s.amount,
        region = s.region,
        updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
    INSERT (order_id, customer_id, order_date, amount, region)
    VALUES (s.order_id, s.customer_id, s.order_date, s.amount, s.region);

Time Travel and Schema Evolution

-- Example 8: Query with time travel
-- Iceberg supports time travel via snapshot IDs or timestamps
-- AT (TIMESTAMP): Point-in-time query
-- AT (OFFSET): N seconds ago
-- AT (STATEMENT): Before/after a specific query
SELECT * FROM sales_iceberg
AT (TIMESTAMP => '2024-01-15 10:00:00'::TIMESTAMP_TZ)
WHERE region = 'US';

-- Example 9: Time travel with snapshot offset
-- Query state 1 hour ago
SELECT * FROM sales_iceberg
AT (OFFSET => -3600)    -- -3600 seconds = 1 hour ago
WHERE region = 'US';

-- Example 10: Schema evolution
-- Add new column (existing data gets NULL for new column)
ALTER TABLE sales_iceberg ADD COLUMN country VARCHAR(10);

-- Rename column (metadata-only operation, no data rewrite)
ALTER TABLE sales_iceberg RENAME COLUMN region TO sales_region;

-- Drop column (metadata-only operation)
ALTER TABLE sales_iceberg DROP COLUMN deprecated_column;

-- Change column data type (if compatible)
ALTER TABLE sales_iceberg ALTER COLUMN amount SET DATA TYPE DECIMAL(12,2);

-- Example 11: Create Iceberg table from existing data (CTAS)
CREATE OR REPLACE ICEBERG TABLE sales_iceberg_v2
CATALOG = 'snowflake'
EXTERNAL_VOLUME = 'my_iceberg_volume'
BASE_LOCATION = 'sales_v2'
STORAGE_SERIALIZATION_FORMAT = 'PARQUET'
AS SELECT 
    order_id,
    customer_id,
    order_date,
    amount,
    region,
    CURRENT_TIMESTAMP() AS created_at
FROM raw_sales_data
WHERE order_date >= '2024-01-01';

Hybrid Tables

-- Example 12: Create hybrid table
-- Hybrid tables support both OLAP (analytics) and OLTP (transactional) workloads
-- PRIMARY KEY: Enforces uniqueness and enables fast single-row lookups
-- CLUSTER BY: Optimizes data layout for frequent query patterns
CREATE HYBRID TABLE orders_hybrid (
    order_id INTEGER PRIMARY KEY,            -- Unique constraint enforced
    customer_id INTEGER NOT NULL,            -- Not null constraint
    order_date DATE NOT NULL,
    amount DECIMAL(10,2),
    status VARCHAR(20) DEFAULT 'PENDING',   -- Default value for new rows
    created_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
)
CLUSTER BY (order_date, status)             -- Cluster key for query optimization
COMMENT = 'Hybrid table for order management';

-- Example 13: Single-row operations on hybrid table
-- Insert: Fast primary key lookup, ACID transaction
INSERT INTO orders_hybrid (order_id, customer_id, order_date, amount, status)
VALUES (1, 101, '2024-01-15', 99.99, 'PENDING');

-- Update: Primary key lookup (no full table scan)
UPDATE orders_hybrid 
SET status = 'SHIPPED', updated_at = CURRENT_TIMESTAMP()
WHERE order_id = 1;

-- Delete: Primary key lookup
DELETE FROM orders_hybrid WHERE order_id = 1;

-- Example 14: Upsert operation on hybrid table
MERGE INTO orders_hybrid t
USING staging_orders s
ON t.order_id = s.order_id
WHEN MATCHED THEN
    UPDATE SET 
        amount = s.amount,
        status = s.status,
        updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
    INSERT (order_id, customer_id, order_date, amount, status)
    VALUES (s.order_id, s.customer_id, s.order_date, s.amount, s.status);

-- Example 15: Bulk operations on hybrid table
-- Hybrid tables support efficient bulk operations alongside single-row DML
INSERT INTO orders_hybrid
SELECT * FROM staging_orders_bulk;

-- Bulk update
UPDATE orders_hybrid 
SET status = 'ARCHIVED'
WHERE order_date < '2023-01-01';

Monitoring and Metadata

-- Example 16: Monitor Iceberg table metadata
SELECT *
FROM TABLE(INFORMATION_SCHEMA.ICEBERG_TABLES())
WHERE table_name = 'SALES_ICEBERG';

-- Example 17: Query Iceberg table metadata
-- View snapshots, partitions, and data files
SELECT * FROM TABLE(INFORMATION_SCHEMA.ICEBERG_TABLES()) 
WHERE table_name = 'SALES_ICEBERG';

-- Example 18: Monitor hybrid table performance
-- Check primary key constraint violations
SELECT * FROM TABLE(INFORMATION_SCHEMA.TABLE_CONSTRAINTS())
WHERE table_name = 'ORDERS_HYBRID' 
AND constraint_type = 'PRIMARY KEY';

-- Example 19: View Iceberg table storage statistics
SELECT 
    table_name,
    storage_location,
    total_rows,
    total_bytes
FROM TABLE(INFORMATION_SCHEMA.ICEBERG_TABLES())
WHERE table_name = 'SALES_ICEBERG';

-- Example 20: Compare Iceberg vs regular table performance
EXPLAIN SELECT * FROM sales_iceberg WHERE order_date = '2024-01-15';
EXPLAIN SELECT * FROM regular_sales WHERE order_date = '2024-01-15';

Performance Metrics

Metric	Target	Warning	Critical
Iceberg Query Latency	< 30s	30-120s	> 120s
Hybrid Single-row Ops	< 10ms	10-50ms	> 50ms
Metadata Read Time	< 1s	1-5s	> 5s
Data File Scan Rate	> 100 MB/s	50-100 MB/s	< 50 MB/s

Best Practices

Choose appropriate table format: Use Iceberg for multi-engine access and schema evolution, Hybrid for mixed OLAP/OLTP workloads.
Optimize external volumes: Place external volumes in the same region as your Snowflake account to minimize data transfer costs.
Manage Iceberg snapshots: Configure snapshot retention to balance time travel needs with storage costs.
Use hybrid tables wisely: Reserve hybrid tables for workloads requiring single-row transactions. Use regular tables for pure analytics.
Implement data partitioning: Use Iceberg's hidden partitioning to optimize query performance without exposing partition columns.
Monitor storage costs: Track external storage usage and implement lifecycle policies for old data files.
Test multi-engine access: Validate that open table formats work correctly across different compute engines.
Implement data quality: Use Iceberg's schema enforcement and constraints to maintain data quality.
Plan for scale: Design table structures to handle expected data growth without performance degradation.
Document data lineage: Track data provenance and transformations for compliance and debugging.

Additional Theory: Iceberg vs Delta Lake vs Hudi

Feature	Iceberg	Delta Lake	Hudi
ACID Transactions	Yes	Yes	Yes
Schema Evolution	Full (add/rename/drop)	Add columns only	Add columns only
Time Travel	Snapshots + timestamps	Delta log versions	Commit timeline
Partition Evolution	Yes (hidden partitioning)	No (requires rewrite)	Yes (explicit)
Engine Support	Spark, Trino, Snowflake, Flink	Spark (primary), Trino	Spark (primary)
Metadata Format	Avro + JSON	JSON commit log	Avro + log blocks
Upsert Performance	Good	Good	Excellent (record-level)
Best For	Multi-engine, schema evolution	Spark ecosystem	Real-time ingestion

Key Iceberg advantages:

Hidden partitioning eliminates partition columns from queries
Partition evolution changes partitioning without data rewrite
Snapshot isolation enables concurrent readers and writers
Open format ensures no vendor lock-in

Additional Theory: Hybrid Table Performance Characteristics

Operation	Regular Table	Hybrid Table	Improvement
Single-row INSERT	Medium (full scan)	Fast (PK lookup)	10-100×
Bulk INSERT	Fast	Fast	Same
Single-row UPDATE	Slow (full scan)	Fast (PK lookup)	10-100×
Bulk UPDATE	Fast	Fast	Same
Single-row DELETE	Slow (full scan)	Fast (PK lookup)	10-100×
Complex analytics	Fast	Fast	Same
JOIN operations	Fast	Fast	Same

Iceberg Tables: External Volumes, Hybrid Tables & Open Formats

Iceberg Tables: External Volumes, Hybrid Tables & Open Formats

Architecture Diagram 1: Iceberg Table Architecture

Architecture Diagram 2: Hybrid Tables Architecture

Architecture Diagram 3: Data Lakehouse Architecture

Architecture Diagram 4: Iceberg Table Metadata Layers

Iceberg Partition Evolution

Detailed Explanation

What are Iceberg Tables?

External Volumes and Storage

Hybrid Tables: OLAP + OLTP

Open Table Formats Comparison

Data Lakehouse Architecture

Key Concepts Table

Code Examples

External Volumes and Iceberg Table Setup

Data Operations on Iceberg Tables

Time Travel and Schema Evolution

Hybrid Tables

Monitoring and Metadata

Performance Metrics

Best Practices

Additional Theory: Iceberg vs Delta Lake vs Hudi

Additional Theory: Hybrid Table Performance Characteristics

See Also

Premium Content

Need Expert Snowflake Help?