Iceberg Tables: External Volumes, Hybrid Tables & Open Formats
Architecture Diagram 1: Iceberg Table Architecture
Architecture Diagram 2: Hybrid Tables Architecture
Architecture Diagram 3: Data Lakehouse Architecture
Architecture Diagram 4: Iceberg Table Metadata Layers
Apache Iceberg Tables are open table format standards that provide ACID transactions, schema evolution, time travel, and hidden partitioning for data lakes. Snowflake supports Iceberg as an external table format, enabling open data lake architectures while leveraging Snowflake's query engine.
The Iceberg metadata layer is a catalog abstraction that tracks table state via metadata files and manifest files. It enables atomic operations (commit, rollback) and snapshot isolation without requiring Spark or Hive β Snowflake reads Iceberg metadata directly.
Iceberg Partition Evolution
- Hidden partitioning: Partitioning defined in metadata, not in query
- Partition evolution: Change partitioning without rewriting data
- Time-based partitioning: Partition by day/month using partition transforms
- Equality transforms: Partition by column ranges (e.g., 100-value buckets)
- Migration cost: One-time metadata rewrite; data blocks unchanged
Use Iceberg for: multi-engine data lakes, vendor independence, compliance-driven data retention, and gradual migration from legacy systems. Use Snowflake native for: performance-critical analytics, managed infrastructure, and simplified operations.
- Open format: Apache Iceberg tables work across Spark, Trino, DuckDB, Snowflake
- ACID transactions: Atomic commits with snapshot isolation on data lakes
- Schema evolution: Add/rename columns without table recreation
- Time travel: Query historical snapshots for auditing and reproducibility
- Snowflake integration: Native Iceberg catalog support (2024+)
Detailed Explanation
What are Iceberg Tables?
Open Apache Iceberg table format providing ACID transactions, schema evolution, time travel, and hidden partitioning for data lakes β integrated natively in Snowflake.
External Volumes and Storage
- Define cloud storage locations (S3, Azure Blob, GCS) with IAM credentials
- Multiple Iceberg tables can share an external volume
- Data remains in your account; Snowflake manages compute and metadata
Hybrid Tables: OLAP + OLTP
| Feature | Regular Table | Hybrid Table |
|---|---|---|
| Single-row INSERT | Medium (full scan) | Fast (PK lookup) |
| Single-row UPDATE | Slow (full scan) | Fast (PK lookup) |
| Primary Key | No | Yes |
| Analytics | Fast | Fast |
- Combines columnar storage with transactional optimizations
- Primary key enforcement for fast single-row operations
Open Table Formats Comparison
| Format | Schema Evolution | Time Travel | Best For |
|---|---|---|---|
| Iceberg | Full (add/rename/drop) | Snapshots + timestamps | Multi-engine, schema evolution |
| Delta Lake | Add columns only | Delta log versions | Spark ecosystem |
| Hudi | Add columns only | Commit timeline | Incremental processing |
Data Lakehouse Architecture
- Combines data lake flexibility with warehouse reliability
- Open formats enable multi-engine access (Snowflake, Spark, Trino)
- Data layers: raw β processed β analytics
- Governance: schema registry, lineage, RBAC, quality
Key Takeaway: Use Iceberg for multi-engine access and vendor independence. Use Snowflake native for performance-critical analytics.
Key Concepts Table
| Feature | Iceberg Tables | Hybrid Tables | Regular Tables |
|---|---|---|---|
| Storage Location | External volume | Snowflake-managed | Snowflake-managed |
| ACID Transactions | Yes | Yes | Yes |
| Schema Evolution | Yes | Yes | Limited |
| Time Travel | Yes | Yes | Yes |
| Primary Key | No | Yes | No |
| Single-row DML | No | Yes | Limited |
| External Access | Yes | No | No |
| Open Format | ACID | Schema Evolution | Time Travel | Best For |
|---|---|---|---|---|
| Iceberg | Yes | Yes | Yes | Multi-engine, schema evolution |
| Delta Lake | Yes | Yes | Yes | Spark ecosystem, schema enforcement |
| Hudi | Yes | Yes | Yes | Incremental processing, real-time |
| Metric | Iceberg | Hybrid | Regular |
|---|---|---|---|
| Single-row INSERT | Slow | Fast | Medium |
| Bulk INSERT | Fast | Fast | Fast |
| Single-row UPDATE | Slow | Fast | Slow |
| Analytics Query | Fast | Fast | Fast |
| Storage Cost | External (lower) | Internal | Internal |
Code Examples
External Volumes and Iceberg Table Setup
-- Example 1: Create external volume
-- External volumes map to cloud storage locations with IAM credentials
-- Parameters:
-- STORAGE_PROVIDER: 'S3', 'AZURE', or 'GCS'
-- STORAGE_AWS_ROLE_ARN: IAM role ARN for S3 access
-- STORAGE_BASE_LOCATIONS: Array of S3 paths (one per Snowflake account)
-- ENCRYPTION_TYPE: Server-side encryption type
CREATE EXTERNAL VOLUME my_iceberg_volume
STORAGE_PROVIDER = 'S3'
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/SnowflakeAccessRole'
STORAGE_BASE_LOCATIONS = ('s3://my-bucket/iceberg-data')
ENCRYPTION_TYPE = 'AES_SSE_S3' -- Server-side encryption
COMMENT = 'External volume for Iceberg tables';
-- Verify external volume
DESC EXTERNAL VOLUME my_iceberg_volume;
SHOW EXTERNAL VOLUMES;
-- Example 2: Create Iceberg table
-- Parameters:
-- CATALOG: 'snowflake' (native Iceberg catalog)
-- EXTERNAL_VOLUME: References the external volume created above
-- BASE_LOCATION: Subpath within the volume for this table's data
-- STORAGE_SERIALIZATION_FORMAT: 'PARQUET' (default) or 'ORC'
-- STORAGE_COMPRESSION_TYPE: 'ZSTD' (best compression) or 'SNAPPY' (fast)
CREATE ICEBERG TABLE sales_iceberg (
order_id INTEGER,
customer_id INTEGER,
order_date DATE,
amount DECIMAL(10,2),
region VARCHAR(10)
)
CATALOG = 'snowflake' -- Snowflake-managed Iceberg catalog
EXTERNAL_VOLUME = 'my_iceberg_volume' -- Cloud storage volume
BASE_LOCATION = 'sales' -- Subpath for this table
STORAGE_SERIALIZATION_FORMAT = 'PARQUET' -- Data file format
STORAGE_COMPRESSION_TYPE = 'ZSTD' -- Compression algorithm
COMMENT = 'Sales data in Iceberg format';
Data Operations on Iceberg Tables
-- Example 3: Insert data into Iceberg table
-- Iceberg tables support standard DML: INSERT, UPDATE, DELETE, MERGE
INSERT INTO sales_iceberg
SELECT * FROM raw_sales_data;
-- Example 4: Query Iceberg table with partition pruning
-- Iceberg metadata enables automatic partition pruning
SELECT
region,
COUNT(*) as order_count,
SUM(amount) as total_amount
FROM sales_iceberg
WHERE order_date >= '2024-01-01' -- Partition pruning on date column
GROUP BY region;
-- Example 5: UPDATE operations on Iceberg
-- Iceberg supports row-level updates (unlike external tables)
UPDATE sales_iceberg
SET amount = amount * 1.10 -- 10% price increase
WHERE region = 'US' AND order_date >= '2024-06-01';
-- Example 6: DELETE operations on Iceberg
DELETE FROM sales_iceberg
WHERE order_date < '2023-01-01'; -- Remove old data
-- Example 7: MERGE (upsert) operations on Iceberg
MERGE INTO sales_iceberg t
USING staging_sales s
ON t.order_id = s.order_id
WHEN MATCHED THEN
UPDATE SET
amount = s.amount,
region = s.region,
updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
INSERT (order_id, customer_id, order_date, amount, region)
VALUES (s.order_id, s.customer_id, s.order_date, s.amount, s.region);
Time Travel and Schema Evolution
-- Example 8: Query with time travel
-- Iceberg supports time travel via snapshot IDs or timestamps
-- AT (TIMESTAMP): Point-in-time query
-- AT (OFFSET): N seconds ago
-- AT (STATEMENT): Before/after a specific query
SELECT * FROM sales_iceberg
AT (TIMESTAMP => '2024-01-15 10:00:00'::TIMESTAMP_TZ)
WHERE region = 'US';
-- Example 9: Time travel with snapshot offset
-- Query state 1 hour ago
SELECT * FROM sales_iceberg
AT (OFFSET => -3600) -- -3600 seconds = 1 hour ago
WHERE region = 'US';
-- Example 10: Schema evolution
-- Add new column (existing data gets NULL for new column)
ALTER TABLE sales_iceberg ADD COLUMN country VARCHAR(10);
-- Rename column (metadata-only operation, no data rewrite)
ALTER TABLE sales_iceberg RENAME COLUMN region TO sales_region;
-- Drop column (metadata-only operation)
ALTER TABLE sales_iceberg DROP COLUMN deprecated_column;
-- Change column data type (if compatible)
ALTER TABLE sales_iceberg ALTER COLUMN amount SET DATA TYPE DECIMAL(12,2);
-- Example 11: Create Iceberg table from existing data (CTAS)
CREATE OR REPLACE ICEBERG TABLE sales_iceberg_v2
CATALOG = 'snowflake'
EXTERNAL_VOLUME = 'my_iceberg_volume'
BASE_LOCATION = 'sales_v2'
STORAGE_SERIALIZATION_FORMAT = 'PARQUET'
AS SELECT
order_id,
customer_id,
order_date,
amount,
region,
CURRENT_TIMESTAMP() AS created_at
FROM raw_sales_data
WHERE order_date >= '2024-01-01';
Hybrid Tables
-- Example 12: Create hybrid table
-- Hybrid tables support both OLAP (analytics) and OLTP (transactional) workloads
-- PRIMARY KEY: Enforces uniqueness and enables fast single-row lookups
-- CLUSTER BY: Optimizes data layout for frequent query patterns
CREATE HYBRID TABLE orders_hybrid (
order_id INTEGER PRIMARY KEY, -- Unique constraint enforced
customer_id INTEGER NOT NULL, -- Not null constraint
order_date DATE NOT NULL,
amount DECIMAL(10,2),
status VARCHAR(20) DEFAULT 'PENDING', -- Default value for new rows
created_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
)
CLUSTER BY (order_date, status) -- Cluster key for query optimization
COMMENT = 'Hybrid table for order management';
-- Example 13: Single-row operations on hybrid table
-- Insert: Fast primary key lookup, ACID transaction
INSERT INTO orders_hybrid (order_id, customer_id, order_date, amount, status)
VALUES (1, 101, '2024-01-15', 99.99, 'PENDING');
-- Update: Primary key lookup (no full table scan)
UPDATE orders_hybrid
SET status = 'SHIPPED', updated_at = CURRENT_TIMESTAMP()
WHERE order_id = 1;
-- Delete: Primary key lookup
DELETE FROM orders_hybrid WHERE order_id = 1;
-- Example 14: Upsert operation on hybrid table
MERGE INTO orders_hybrid t
USING staging_orders s
ON t.order_id = s.order_id
WHEN MATCHED THEN
UPDATE SET
amount = s.amount,
status = s.status,
updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
INSERT (order_id, customer_id, order_date, amount, status)
VALUES (s.order_id, s.customer_id, s.order_date, s.amount, s.status);
-- Example 15: Bulk operations on hybrid table
-- Hybrid tables support efficient bulk operations alongside single-row DML
INSERT INTO orders_hybrid
SELECT * FROM staging_orders_bulk;
-- Bulk update
UPDATE orders_hybrid
SET status = 'ARCHIVED'
WHERE order_date < '2023-01-01';
Monitoring and Metadata
-- Example 16: Monitor Iceberg table metadata
SELECT *
FROM TABLE(INFORMATION_SCHEMA.ICEBERG_TABLES())
WHERE table_name = 'SALES_ICEBERG';
-- Example 17: Query Iceberg table metadata
-- View snapshots, partitions, and data files
SELECT * FROM TABLE(INFORMATION_SCHEMA.ICEBERG_TABLES())
WHERE table_name = 'SALES_ICEBERG';
-- Example 18: Monitor hybrid table performance
-- Check primary key constraint violations
SELECT * FROM TABLE(INFORMATION_SCHEMA.TABLE_CONSTRAINTS())
WHERE table_name = 'ORDERS_HYBRID'
AND constraint_type = 'PRIMARY KEY';
-- Example 19: View Iceberg table storage statistics
SELECT
table_name,
storage_location,
total_rows,
total_bytes
FROM TABLE(INFORMATION_SCHEMA.ICEBERG_TABLES())
WHERE table_name = 'SALES_ICEBERG';
-- Example 20: Compare Iceberg vs regular table performance
EXPLAIN SELECT * FROM sales_iceberg WHERE order_date = '2024-01-15';
EXPLAIN SELECT * FROM regular_sales WHERE order_date = '2024-01-15';
Performance Metrics
| Metric | Target | Warning | Critical |
|---|---|---|---|
| Iceberg Query Latency | < 30s | 30-120s | > 120s |
| Hybrid Single-row Ops | < 10ms | 10-50ms | > 50ms |
| Metadata Read Time | < 1s | 1-5s | > 5s |
| Data File Scan Rate | > 100 MB/s | 50-100 MB/s | < 50 MB/s |
Best Practices
-
Choose appropriate table format: Use Iceberg for multi-engine access and schema evolution, Hybrid for mixed OLAP/OLTP workloads.
-
Optimize external volumes: Place external volumes in the same region as your Snowflake account to minimize data transfer costs.
-
Manage Iceberg snapshots: Configure snapshot retention to balance time travel needs with storage costs.
-
Use hybrid tables wisely: Reserve hybrid tables for workloads requiring single-row transactions. Use regular tables for pure analytics.
-
Implement data partitioning: Use Iceberg's hidden partitioning to optimize query performance without exposing partition columns.
-
Monitor storage costs: Track external storage usage and implement lifecycle policies for old data files.
-
Test multi-engine access: Validate that open table formats work correctly across different compute engines.
-
Implement data quality: Use Iceberg's schema enforcement and constraints to maintain data quality.
-
Plan for scale: Design table structures to handle expected data growth without performance degradation.
-
Document data lineage: Track data provenance and transformations for compliance and debugging.
Additional Theory: Iceberg vs Delta Lake vs Hudi
| Feature | Iceberg | Delta Lake | Hudi |
|---|---|---|---|
| ACID Transactions | Yes | Yes | Yes |
| Schema Evolution | Full (add/rename/drop) | Add columns only | Add columns only |
| Time Travel | Snapshots + timestamps | Delta log versions | Commit timeline |
| Partition Evolution | Yes (hidden partitioning) | No (requires rewrite) | Yes (explicit) |
| Engine Support | Spark, Trino, Snowflake, Flink | Spark (primary), Trino | Spark (primary) |
| Metadata Format | Avro + JSON | JSON commit log | Avro + log blocks |
| Upsert Performance | Good | Good | Excellent (record-level) |
| Best For | Multi-engine, schema evolution | Spark ecosystem | Real-time ingestion |
Key Iceberg advantages:
- Hidden partitioning eliminates partition columns from queries
- Partition evolution changes partitioning without data rewrite
- Snapshot isolation enables concurrent readers and writers
- Open format ensures no vendor lock-in
Additional Theory: Hybrid Table Performance Characteristics
| Operation | Regular Table | Hybrid Table | Improvement |
|---|---|---|---|
| Single-row INSERT | Medium (full scan) | Fast (PK lookup) | 10-100Γ |
| Bulk INSERT | Fast | Fast | Same |
| Single-row UPDATE | Slow (full scan) | Fast (PK lookup) | 10-100Γ |
| Bulk UPDATE | Fast | Fast | Same |
| Single-row DELETE | Slow (full scan) | Fast (PK lookup) | 10-100Γ |
| Complex analytics | Fast | Fast | Same |
| JOIN operations | Fast | Fast | Same |
See Also
- 07-External-Tables-and-Data-Lake - External table patterns
- 09-Streams-Tasks - CDC with Iceberg tables
- 04-Time-Travel-and-Cloning - Time travel concepts
- PySpark Iceberg - Deep dive on Iceberg tables
- Delta Lake on Databricks - Delta Lake comparison
- Data Warehouse Concepts - Data warehouse design principles