Snowflake Advanced Β· Interview Prep
Data Lakehouse, External Tables & Iceberg
Difficulty: Hard Β· Commonly asked at Netflix, Uber, Google
Interview Question
"How would you architect a hybrid data lakehouse using Snowflake and S3? Discuss external tables, Iceberg tables, and the trade-offs between Snowflake-native tables and open table formats."
βΉοΈ
Companies Asking This: Netflix (Staff Data Engineer), Uber (Senior Data Engineer), Google (Cloud Data Architect), Amazon (L6 Data Engineer)
Data Lakehouse Architecture
External Tables
External tables query data directly from cloud storage without loading it into Snowflake.
Creating External Tables
-- 1. Create storage integration
CREATE OR REPLACE STORAGE INTEGRATION s3_lake_int
TYPE = EXTERNAL_STAGE
ENABLED = TRUE
STORAGE_PROVIDER = S3
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789:role/snowflake-role'
STORAGE_ALLOWED_LOCATIONS = ('s3://data-lake/raw/');
-- 2. Create external stage
CREATE OR REPLACE STAGE lake_stage
URL = 's3://data-lake/raw/'
STORAGE_INTEGRATION = s3_lake_int;
-- 3. Create external table (JSON)
CREATE OR REPLACE EXTERNAL TABLE raw_events_ext (
event_id VARCHAR(100) AS ($1:event_id::VARCHAR),
event_type VARCHAR(50) AS ($1:event_type::VARCHAR),
event_timestamp TIMESTAMP_NTZ AS ($1:timestamp::TIMESTAMP_NTZ),
user_id VARCHAR(100) AS ($1:user_id::VARCHAR),
payload VARIANT AS ($1)
)
LOCATION = @lake_stage
FILE_FORMAT = (TYPE = 'JSON' STRIP_OUTER_ARRAY = TRUE)
AUTO_REFRESH = TRUE
REFRESH_ON_CREATE = TRUE;
-- 4. Create external table (Parquet)
CREATE OR REPLACE EXTERNAL TABLE raw_orders_ext (
order_id VARCHAR(100) AS ($1:order_id::VARCHAR),
customer_id VARCHAR(100) AS ($1:customer_id::VARCHAR),
order_date DATE AS ($1:order_date::DATE),
amount NUMBER(12,2) AS ($1:amount::NUMBER),
status VARCHAR(20) AS ($1:status::VARCHAR)
)
LOCATION = @lake_stage/orders/
FILE_FORMAT = (TYPE = 'PARQUET')
AUTO_REFRESH = TRUE;
-- 5. Query external tables
SELECT
event_type,
COUNT(*) AS event_count,
MIN(event_timestamp) AS earliest,
MAX(event_timestamp) AS latest
FROM raw_events_ext
WHERE event_timestamp >= DATEADD(day, -7, CURRENT_DATE())
GROUP BY 1;
-- 6. Monitor external table refresh
SELECT
table_name,
refresh_time,
state,
bytes_scanned,
files_added,
files_removed
FROM information_schema.external_table_refresh_history
WHERE table_name = 'RAW_EVENTS_EXT'
ORDER BY refresh_time DESC;
Iceberg Tables
Snowflake supports Apache Iceberg, an open table format for large analytic datasets.
Creating Iceberg Tables
-- 1. Create Iceberg table backed by S3
CREATE OR REPLACE EXTERNAL TABLE orders_iceberg
CATALOG = 'SNOWFLAKE_CATALOG'
EXTERNAL_VOLUME = 's3_iceberg_volume'
TABLE_FORMAT = 'ICEBERG'
DATA_FILES_LOCATION = 's3://data-lake/iceberg/orders/'
AS
SELECT
order_id,
customer_id,
order_date,
amount,
status
FROM orders;
-- 2. Create Iceberg table from existing data
CREATE OR REPLACE ICEBERG TABLE orders_iceberg
CATALOG = 'SNOWFLAKE_CATALOG'
EXTERNAL_VOLUME = 's3_iceberg_volume'
BASE_LOCATION = 's3://data-lake/iceberg/orders/'
AS
SELECT * FROM orders;
-- 3. Query Iceberg table
SELECT
order_date,
COUNT(*) AS order_count,
SUM(amount) AS total_revenue
FROM orders_iceberg
WHERE order_date >= '2024-01-01'
GROUP BY 1
ORDER BY 1;
-- 4. Time travel on Iceberg tables
SELECT * FROM orders_iceberg
AT (TIMESTAMP => '2024-01-15 10:00:00'::TIMESTAMP_NTZ);
-- 5. Schema evolution on Iceberg tables
ALTER TABLE orders_iceberg ADD COLUMN shipping_address VARCHAR(500);
ALTER TABLE orders_iceberg RENAME COLUMN amount TO total_amount;
Real-World Scenario: Netflix
Question: "How do you manage a data lake with 50PB of data using Snowflake? Discuss the hybrid approach using external tables and native tables."
Solution: Hybrid Lakehouse Strategy
-- 1. Tiered storage strategy
-- Hot data: Snowflake native tables (frequently queried)
CREATE TABLE events_hot CLUSTER BY (event_date, event_type)
AS SELECT * FROM events WHERE event_date >= DATEADD(day, -90, CURRENT_DATE());
-- Warm data: External tables (infrequently queried)
CREATE EXTERNAL TABLE events_warm (
event_id VARCHAR(100) AS ($1:event_id::VARCHAR),
event_date DATE AS ($1:event_date::DATE),
event_type VARCHAR(50) AS ($1:event_type::VARCHAR),
payload VARIANT AS ($1)
)
LOCATION = @lake_stage/events/warm/
FILE_FORMAT = (TYPE = 'PARQUET');
-- Cold data: Iceberg tables (archival)
CREATE OR REPLACE ICEBERG TABLE events_cold
CATALOG = 'SNOWFLAKE_CATALOG'
EXTERNAL_VOLUME = 's3_cold_volume'
BASE_LOCATION = 's3://data-lake/archive/events/'
AS
SELECT * FROM events WHERE event_date < DATEADD(year, -1, CURRENT_DATE());
-- 2. Data lifecycle management
CREATE OR REPLACE PROCEDURE manage_data_lifecycle()
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
DECLARE
moved_count NUMBER;
BEGIN
-- Move data from hot to warm (90+ days old)
CREATE OR REPLACE TEMPORARY TABLE events_to_warm AS
SELECT * FROM events_hot
WHERE event_date < DATEADD(day, -90, CURRENT_DATE());
-- Insert into external stage
COPY INTO @lake_stage/events/warm/
FROM events_to_warm
FILE_FORMAT = (TYPE = 'PARQUET')
HEADER = TRUE;
-- Delete from hot
DELETE FROM events_hot
WHERE event_date < DATEADD(day, -90, CURRENT_DATE());
moved_count := SQLROWCOUNT;
RETURN 'Moved ' || moved_count || ' records from hot to warm';
END;
$$;
-- 3. Unified query across tiers
CREATE OR REPLACE VIEW events_unified AS
SELECT 'HOT' AS data_tier, * FROM events_hot
UNION ALL
SELECT 'WARM' AS data_tier, * FROM events_warm
UNION ALL
SELECT 'COLD' AS data_tier, * FROM events_cold;
Best Practices
| Tier | Storage Type | Use Case | Retention |
|---|---|---|---|
| Hot | Snowflake Native | Frequently queried, < 90 days | 90 days |
| Warm | External Tables | Infrequently queried, 90-365 days | 1 year |
| Cold | Iceberg | Archival, > 365 days | Years |
| Archive | S3 Glacier | Compliance, > 7 years | Permanent |
β οΈ
Key Trade-offs:
- Native tables β Best performance, highest cost
- External tables β Good for querying data in place, slower than native
- Iceberg tables β Open format, supports time travel, good for cross-platform access
- Hybrid approach β Balance cost and performance based on access patterns