πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Iceberg Tables: External Volumes, Hybrid Tables & Open Formats

🟒 Free Lesson

Advertisement

Iceberg Tables: External Volumes, Hybrid Tables & Open Formats

Architecture Diagram 1: Iceberg Table Architecture

Architecture Diagram 2: Hybrid Tables Architecture

Architecture Diagram 3: Data Lakehouse Architecture

Architecture Diagram 4: Iceberg Table Metadata Layers

Iceberg Table Metadata Layer OrganizationMetadata FileTable schemaPartition specManifest ListSnapshot pointersPartition rangesManifest FilesData file listingColumn statsData FilesParquet/ORCActual dataSnapshot isolation | ACID transactions | Schema evolution | Partition evolution without data rewrite

Apache Iceberg Tables are open table format standards that provide ACID transactions, schema evolution, time travel, and hidden partitioning for data lakes. Snowflake supports Iceberg as an external table format, enabling open data lake architectures while leveraging Snowflake's query engine.

The Iceberg metadata layer is a catalog abstraction that tracks table state via metadata files and manifest files. It enables atomic operations (commit, rollback) and snapshot isolation without requiring Spark or Hive β€” Snowflake reads Iceberg metadata directly.

Iceberg vs Snowflake Native Tradeoff
total_cost=storage_cost+compute_cost+metadata_overheadtotal\_cost = storage\_cost + compute\_cost + metadata\_overhead

Iceberg Partition Evolution

  • Hidden partitioning: Partitioning defined in metadata, not in query
  • Partition evolution: Change partitioning without rewriting data
  • Time-based partitioning: Partition by day/month using partition transforms
  • Equality transforms: Partition by column ranges (e.g., 100-value buckets)
  • Migration cost: One-time metadata rewrite; data blocks unchanged

Use Iceberg for: multi-engine data lakes, vendor independence, compliance-driven data retention, and gradual migration from legacy systems. Use Snowflake native for: performance-critical analytics, managed infrastructure, and simplified operations.

  • Open format: Apache Iceberg tables work across Spark, Trino, DuckDB, Snowflake
  • ACID transactions: Atomic commits with snapshot isolation on data lakes
  • Schema evolution: Add/rename columns without table recreation
  • Time travel: Query historical snapshots for auditing and reproducibility
  • Snowflake integration: Native Iceberg catalog support (2024+)


Detailed Explanation

What are Iceberg Tables?

Open Apache Iceberg table format providing ACID transactions, schema evolution, time travel, and hidden partitioning for data lakes β€” integrated natively in Snowflake.


External Volumes and Storage

  • Define cloud storage locations (S3, Azure Blob, GCS) with IAM credentials
  • Multiple Iceberg tables can share an external volume
  • Data remains in your account; Snowflake manages compute and metadata

Hybrid Tables: OLAP + OLTP

FeatureRegular TableHybrid Table
Single-row INSERTMedium (full scan)Fast (PK lookup)
Single-row UPDATESlow (full scan)Fast (PK lookup)
Primary KeyNoYes
AnalyticsFastFast
  • Combines columnar storage with transactional optimizations
  • Primary key enforcement for fast single-row operations

Open Table Formats Comparison

FormatSchema EvolutionTime TravelBest For
IcebergFull (add/rename/drop)Snapshots + timestampsMulti-engine, schema evolution
Delta LakeAdd columns onlyDelta log versionsSpark ecosystem
HudiAdd columns onlyCommit timelineIncremental processing

Data Lakehouse Architecture

  • Combines data lake flexibility with warehouse reliability
  • Open formats enable multi-engine access (Snowflake, Spark, Trino)
  • Data layers: raw β†’ processed β†’ analytics
  • Governance: schema registry, lineage, RBAC, quality

Key Takeaway: Use Iceberg for multi-engine access and vendor independence. Use Snowflake native for performance-critical analytics.

Key Concepts Table

FeatureIceberg TablesHybrid TablesRegular Tables
Storage LocationExternal volumeSnowflake-managedSnowflake-managed
ACID TransactionsYesYesYes
Schema EvolutionYesYesLimited
Time TravelYesYesYes
Primary KeyNoYesNo
Single-row DMLNoYesLimited
External AccessYesNoNo
Open FormatACIDSchema EvolutionTime TravelBest For
IcebergYesYesYesMulti-engine, schema evolution
Delta LakeYesYesYesSpark ecosystem, schema enforcement
HudiYesYesYesIncremental processing, real-time
MetricIcebergHybridRegular
Single-row INSERTSlowFastMedium
Bulk INSERTFastFastFast
Single-row UPDATESlowFastSlow
Analytics QueryFastFastFast
Storage CostExternal (lower)InternalInternal

Code Examples

External Volumes and Iceberg Table Setup

-- Example 1: Create external volume
-- External volumes map to cloud storage locations with IAM credentials
-- Parameters:
--   STORAGE_PROVIDER: 'S3', 'AZURE', or 'GCS'
--   STORAGE_AWS_ROLE_ARN: IAM role ARN for S3 access
--   STORAGE_BASE_LOCATIONS: Array of S3 paths (one per Snowflake account)
--   ENCRYPTION_TYPE: Server-side encryption type
CREATE EXTERNAL VOLUME my_iceberg_volume
    STORAGE_PROVIDER = 'S3'
    STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/SnowflakeAccessRole'
    STORAGE_BASE_LOCATIONS = ('s3://my-bucket/iceberg-data')
    ENCRYPTION_TYPE = 'AES_SSE_S3'   -- Server-side encryption
    COMMENT = 'External volume for Iceberg tables';

-- Verify external volume
DESC EXTERNAL VOLUME my_iceberg_volume;
SHOW EXTERNAL VOLUMES;

-- Example 2: Create Iceberg table
-- Parameters:
--   CATALOG: 'snowflake' (native Iceberg catalog)
--   EXTERNAL_VOLUME: References the external volume created above
--   BASE_LOCATION: Subpath within the volume for this table's data
--   STORAGE_SERIALIZATION_FORMAT: 'PARQUET' (default) or 'ORC'
--   STORAGE_COMPRESSION_TYPE: 'ZSTD' (best compression) or 'SNAPPY' (fast)
CREATE ICEBERG TABLE sales_iceberg (
    order_id INTEGER,
    customer_id INTEGER,
    order_date DATE,
    amount DECIMAL(10,2),
    region VARCHAR(10)
)
CATALOG = 'snowflake'                          -- Snowflake-managed Iceberg catalog
EXTERNAL_VOLUME = 'my_iceberg_volume'          -- Cloud storage volume
BASE_LOCATION = 'sales'                        -- Subpath for this table
STORAGE_SERIALIZATION_FORMAT = 'PARQUET'       -- Data file format
STORAGE_COMPRESSION_TYPE = 'ZSTD'              -- Compression algorithm
COMMENT = 'Sales data in Iceberg format';

Data Operations on Iceberg Tables

-- Example 3: Insert data into Iceberg table
-- Iceberg tables support standard DML: INSERT, UPDATE, DELETE, MERGE
INSERT INTO sales_iceberg 
SELECT * FROM raw_sales_data;

-- Example 4: Query Iceberg table with partition pruning
-- Iceberg metadata enables automatic partition pruning
SELECT 
    region,
    COUNT(*) as order_count,
    SUM(amount) as total_amount
FROM sales_iceberg
WHERE order_date >= '2024-01-01'   -- Partition pruning on date column
GROUP BY region;

-- Example 5: UPDATE operations on Iceberg
-- Iceberg supports row-level updates (unlike external tables)
UPDATE sales_iceberg 
SET amount = amount * 1.10    -- 10% price increase
WHERE region = 'US' AND order_date >= '2024-06-01';

-- Example 6: DELETE operations on Iceberg
DELETE FROM sales_iceberg 
WHERE order_date < '2023-01-01';   -- Remove old data

-- Example 7: MERGE (upsert) operations on Iceberg
MERGE INTO sales_iceberg t
USING staging_sales s
ON t.order_id = s.order_id
WHEN MATCHED THEN
    UPDATE SET 
        amount = s.amount,
        region = s.region,
        updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
    INSERT (order_id, customer_id, order_date, amount, region)
    VALUES (s.order_id, s.customer_id, s.order_date, s.amount, s.region);

Time Travel and Schema Evolution

-- Example 8: Query with time travel
-- Iceberg supports time travel via snapshot IDs or timestamps
-- AT (TIMESTAMP): Point-in-time query
-- AT (OFFSET): N seconds ago
-- AT (STATEMENT): Before/after a specific query
SELECT * FROM sales_iceberg
AT (TIMESTAMP => '2024-01-15 10:00:00'::TIMESTAMP_TZ)
WHERE region = 'US';

-- Example 9: Time travel with snapshot offset
-- Query state 1 hour ago
SELECT * FROM sales_iceberg
AT (OFFSET => -3600)    -- -3600 seconds = 1 hour ago
WHERE region = 'US';

-- Example 10: Schema evolution
-- Add new column (existing data gets NULL for new column)
ALTER TABLE sales_iceberg ADD COLUMN country VARCHAR(10);

-- Rename column (metadata-only operation, no data rewrite)
ALTER TABLE sales_iceberg RENAME COLUMN region TO sales_region;

-- Drop column (metadata-only operation)
ALTER TABLE sales_iceberg DROP COLUMN deprecated_column;

-- Change column data type (if compatible)
ALTER TABLE sales_iceberg ALTER COLUMN amount SET DATA TYPE DECIMAL(12,2);

-- Example 11: Create Iceberg table from existing data (CTAS)
CREATE OR REPLACE ICEBERG TABLE sales_iceberg_v2
CATALOG = 'snowflake'
EXTERNAL_VOLUME = 'my_iceberg_volume'
BASE_LOCATION = 'sales_v2'
STORAGE_SERIALIZATION_FORMAT = 'PARQUET'
AS SELECT 
    order_id,
    customer_id,
    order_date,
    amount,
    region,
    CURRENT_TIMESTAMP() AS created_at
FROM raw_sales_data
WHERE order_date >= '2024-01-01';

Hybrid Tables

-- Example 12: Create hybrid table
-- Hybrid tables support both OLAP (analytics) and OLTP (transactional) workloads
-- PRIMARY KEY: Enforces uniqueness and enables fast single-row lookups
-- CLUSTER BY: Optimizes data layout for frequent query patterns
CREATE HYBRID TABLE orders_hybrid (
    order_id INTEGER PRIMARY KEY,            -- Unique constraint enforced
    customer_id INTEGER NOT NULL,            -- Not null constraint
    order_date DATE NOT NULL,
    amount DECIMAL(10,2),
    status VARCHAR(20) DEFAULT 'PENDING',   -- Default value for new rows
    created_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
)
CLUSTER BY (order_date, status)             -- Cluster key for query optimization
COMMENT = 'Hybrid table for order management';

-- Example 13: Single-row operations on hybrid table
-- Insert: Fast primary key lookup, ACID transaction
INSERT INTO orders_hybrid (order_id, customer_id, order_date, amount, status)
VALUES (1, 101, '2024-01-15', 99.99, 'PENDING');

-- Update: Primary key lookup (no full table scan)
UPDATE orders_hybrid 
SET status = 'SHIPPED', updated_at = CURRENT_TIMESTAMP()
WHERE order_id = 1;

-- Delete: Primary key lookup
DELETE FROM orders_hybrid WHERE order_id = 1;

-- Example 14: Upsert operation on hybrid table
MERGE INTO orders_hybrid t
USING staging_orders s
ON t.order_id = s.order_id
WHEN MATCHED THEN
    UPDATE SET 
        amount = s.amount,
        status = s.status,
        updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
    INSERT (order_id, customer_id, order_date, amount, status)
    VALUES (s.order_id, s.customer_id, s.order_date, s.amount, s.status);

-- Example 15: Bulk operations on hybrid table
-- Hybrid tables support efficient bulk operations alongside single-row DML
INSERT INTO orders_hybrid
SELECT * FROM staging_orders_bulk;

-- Bulk update
UPDATE orders_hybrid 
SET status = 'ARCHIVED'
WHERE order_date < '2023-01-01';

Monitoring and Metadata

-- Example 16: Monitor Iceberg table metadata
SELECT *
FROM TABLE(INFORMATION_SCHEMA.ICEBERG_TABLES())
WHERE table_name = 'SALES_ICEBERG';

-- Example 17: Query Iceberg table metadata
-- View snapshots, partitions, and data files
SELECT * FROM TABLE(INFORMATION_SCHEMA.ICEBERG_TABLES()) 
WHERE table_name = 'SALES_ICEBERG';

-- Example 18: Monitor hybrid table performance
-- Check primary key constraint violations
SELECT * FROM TABLE(INFORMATION_SCHEMA.TABLE_CONSTRAINTS())
WHERE table_name = 'ORDERS_HYBRID' 
AND constraint_type = 'PRIMARY KEY';

-- Example 19: View Iceberg table storage statistics
SELECT 
    table_name,
    storage_location,
    total_rows,
    total_bytes
FROM TABLE(INFORMATION_SCHEMA.ICEBERG_TABLES())
WHERE table_name = 'SALES_ICEBERG';

-- Example 20: Compare Iceberg vs regular table performance
EXPLAIN SELECT * FROM sales_iceberg WHERE order_date = '2024-01-15';
EXPLAIN SELECT * FROM regular_sales WHERE order_date = '2024-01-15';

Performance Metrics

MetricTargetWarningCritical
Iceberg Query Latency< 30s30-120s> 120s
Hybrid Single-row Ops< 10ms10-50ms> 50ms
Metadata Read Time< 1s1-5s> 5s
Data File Scan Rate> 100 MB/s50-100 MB/s< 50 MB/s

Best Practices

  1. Choose appropriate table format: Use Iceberg for multi-engine access and schema evolution, Hybrid for mixed OLAP/OLTP workloads.

  2. Optimize external volumes: Place external volumes in the same region as your Snowflake account to minimize data transfer costs.

  3. Manage Iceberg snapshots: Configure snapshot retention to balance time travel needs with storage costs.

  4. Use hybrid tables wisely: Reserve hybrid tables for workloads requiring single-row transactions. Use regular tables for pure analytics.

  5. Implement data partitioning: Use Iceberg's hidden partitioning to optimize query performance without exposing partition columns.

  6. Monitor storage costs: Track external storage usage and implement lifecycle policies for old data files.

  7. Test multi-engine access: Validate that open table formats work correctly across different compute engines.

  8. Implement data quality: Use Iceberg's schema enforcement and constraints to maintain data quality.

  9. Plan for scale: Design table structures to handle expected data growth without performance degradation.

  10. Document data lineage: Track data provenance and transformations for compliance and debugging.


Additional Theory: Iceberg vs Delta Lake vs Hudi

FeatureIcebergDelta LakeHudi
ACID TransactionsYesYesYes
Schema EvolutionFull (add/rename/drop)Add columns onlyAdd columns only
Time TravelSnapshots + timestampsDelta log versionsCommit timeline
Partition EvolutionYes (hidden partitioning)No (requires rewrite)Yes (explicit)
Engine SupportSpark, Trino, Snowflake, FlinkSpark (primary), TrinoSpark (primary)
Metadata FormatAvro + JSONJSON commit logAvro + log blocks
Upsert PerformanceGoodGoodExcellent (record-level)
Best ForMulti-engine, schema evolutionSpark ecosystemReal-time ingestion

Key Iceberg advantages:

  • Hidden partitioning eliminates partition columns from queries
  • Partition evolution changes partitioning without data rewrite
  • Snapshot isolation enables concurrent readers and writers
  • Open format ensures no vendor lock-in

Additional Theory: Hybrid Table Performance Characteristics

OperationRegular TableHybrid TableImprovement
Single-row INSERTMedium (full scan)Fast (PK lookup)10-100Γ—
Bulk INSERTFastFastSame
Single-row UPDATESlow (full scan)Fast (PK lookup)10-100Γ—
Bulk UPDATEFastFastSame
Single-row DELETESlow (full scan)Fast (PK lookup)10-100Γ—
Complex analyticsFastFastSame
JOIN operationsFastFastSame

See Also

⭐

Premium Content

Iceberg Tables: External Volumes, Hybrid Tables & Open Formats

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Snowflake Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement