Snowflake Advanced Β· Interview Prep
Production Best Practices & Migration
Difficulty: Hard Β· Commonly asked at Amazon, Uber, Google
Interview Question
"Walk me through a complete Snowflake migration from a legacy on-premise data warehouse. What are the key phases, risks, and validation strategies?"
βΉοΈ
Companies Asking This: Amazon (L6 Data Engineer), Uber (Staff Data Engineer), Google (Senior Cloud Architect), Microsoft (Principal Data Architect)
Migration Framework
Phase 1: Assessment
-- 1. Source system inventory (for migration planning)
CREATE TABLE migration_inventory (
source_system VARCHAR(100),
table_name VARCHAR(100),
row_count NUMBER,
size_gb NUMBER,
last_updated TIMESTAMP_NTZ,
data_types VARCHAR(500),
dependencies VARCHAR(2000),
migration_priority VARCHAR(20)
);
-- 2. Profile source data
SELECT
table_name,
COUNT(*) AS row_count,
COUNT(DISTINCT column_name) AS column_count,
SUM(CASE WHEN data_type = 'VARCHAR' THEN 1 ELSE 0 END) AS varchar_columns,
SUM(CASE WHEN data_type IN ('NUMBER', 'DECIMAL', 'FLOAT') THEN 1 ELSE 0 END) AS numeric_columns,
SUM(CASE WHEN data_type IN ('DATE', 'TIMESTAMP') THEN 1 ELSE 0 END) AS date_columns
FROM information_schema.columns
WHERE table_schema = 'SOURCE'
GROUP BY table_name
ORDER BY row_count DESC;
-- 3. Estimate migration costs
-- Assuming $10/credit, Large warehouse = 8 credits/hour
SELECT
table_name,
row_count,
size_gb,
CEIL(size_gb / 100) AS estimated_hours, -- Assume 100GB/hour throughput
CEIL(size_gb / 100) * 8 * 10 AS estimated_cost_usd -- 8 credits/hour * $10
FROM migration_inventory
ORDER BY estimated_cost_usd DESC;
Phase 2: Schema Migration
-- 1. Generate Snowflake DDL from source
-- (Example: Oracle to Snowflake mapping)
CREATE TABLE snowflake_table AS
SELECT
column_name,
CASE
WHEN data_type = 'NUMBER' THEN 'NUMBER(' || data_precision || ',' || data_scale || ')'
WHEN data_type = 'VARCHAR2' THEN 'VARCHAR(' || data_length || ')'
WHEN data_type = 'DATE' THEN 'DATE'
WHEN data_type = 'TIMESTAMP' THEN 'TIMESTAMP_NTZ'
WHEN data_type = 'CLOB' THEN 'VARCHAR(16777216)'
WHEN data_type = 'BLOB' THEN 'BINARY(16777216)'
ELSE 'VARCHAR(1000)'
END AS snowflake_type,
CASE WHEN nullable = 'Y' THEN 'TRUE' ELSE 'FALSE' END AS is_nullable
FROM source_table_columns
WHERE table_name = 'SOURCE_TABLE';
-- 2. Create target schema
CREATE SCHEMA IF NOT EXISTS prod.migrated;
-- 3. Create tables with Snowflake-optimized types
CREATE TABLE prod.migrated.orders (
order_id VARCHAR(100) PRIMARY KEY,
customer_id VARCHAR(100),
order_date TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP(),
amount NUMBER(12,2),
status VARCHAR(20),
_source_system VARCHAR(50),
_migration_timestamp TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);
-- 4. Add clustering keys based on query patterns
ALTER TABLE prod.migrated.orders CLUSTER BY (order_date, customer_id);
Phase 3: Data Migration
-- 1. Create external stage for bulk loading
CREATE OR REPLACE STAGE migration_stage
URL = 's3://migration-bucket/data/'
STORAGE_INTEGRATION = migration_integration;
-- 2. Bulk load with COPY INTO
COPY INTO prod.migrated.orders
FROM @migration_stage/orders/
FILE_FORMAT = (
TYPE = 'PARQUET'
USE_VECTORIZED_ROW_READER = TRUE
)
ON_ERROR = 'CONTINUE'
FORCE = TRUE;
-- 3. Validate row counts
SELECT
'SOURCE' AS system,
COUNT(*) AS row_count
FROM source_db.public.orders
UNION ALL
SELECT
'TARGET' AS system,
COUNT(*) AS row_count
FROM prod.migrated.orders;
-- 4. Validate data samples
SELECT
s.order_id,
s.amount AS source_amount,
t.amount AS target_amount,
CASE
WHEN s.amount = t.amount THEN 'MATCH'
ELSE 'MISMATCH'
END AS validation_status
FROM source_db.public.orders s
JOIN prod.migrated.orders t ON s.order_id = t.order_id
WHERE s.amount != t.amount
LIMIT 100;
Phase 4: Validation
-- 1. Comprehensive validation procedure
CREATE OR REPLACE PROCEDURE validate_migration()
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
DECLARE
source_count NUMBER;
target_count NUMBER;
mismatch_count NUMBER;
result VARCHAR;
BEGIN
-- Row count validation
EXECUTE IMMEDIATE 'SELECT COUNT(*) FROM source_db.public.orders' INTO source_count;
EXECUTE IMMEDIATE 'SELECT COUNT(*) FROM prod.migrated.orders' INTO target_count;
-- Data validation
EXECUTE IMMEDIATE '
SELECT COUNT(*)
FROM source_db.public.orders s
JOIN prod.migrated.orders t ON s.order_id = t.order_id
WHERE s.amount != t.amount OR s.status != t.status
' INTO mismatch_count;
result := 'Row counts - Source: ' || source_count || ', Target: ' || target_count ||
' | Mismatches: ' || mismatch_count ||
' | Match rate: ' || ROUND((1 - mismatch_count / source_count) * 100, 2) || '%';
RETURN result;
END;
$$;
-- 2. Run validation
CALL validate_migration();
-- 3. Performance benchmarking
-- Run key queries on both systems and compare
SELECT
query_name,
source_time_ms,
target_time_ms,
ROUND((source_time_ms - target_time_ms) * 100.0 / source_time_ms, 2) AS improvement_pct
FROM performance_benchmark_results
ORDER BY improvement_pct DESC;
Real-World Scenario: Amazon
Question: "How do you handle a zero-downtime migration from Oracle to Snowflake?"
Solution: Blue-Green Migration
-- 1. Set up dual-write pattern
-- Application writes to both source and target
CREATE TABLE prod.migrated.orders_dual_write AS
SELECT * FROM prod.migrated.orders WHERE 1=0;
-- 2. Create change data capture on source
CREATE OR REPLACE STREAM source_cdc
ON TABLE source_db.public.orders
SHOW_INITIAL_ROWS = TRUE;
-- 3. Sync changes to target
MERGE INTO prod.migrated.orders t
USING source_cdc s
ON t.order_id = s.order_id
WHEN MATCHED AND METADATA$ACTION = 'UPDATE' THEN
UPDATE SET
amount = s.amount,
status = s.status
WHEN MATCHED AND METADATA$ACTION = 'DELETE' THEN
DELETE
WHEN NOT MATCHED AND METADATA$ACTION = 'INSERT' THEN
INSERT (order_id, customer_id, order_date, amount, status)
VALUES (s.order_id, s.customer_id, s.order_date, s.amount, s.status);
-- 4. Gradual traffic shift
-- Phase 1: 10% to Snowflake
-- Phase 2: 50% to Snowflake
-- Phase 3: 100% to Snowflake
-- 5. Validation during migration
CREATE OR REPLACE PROCEDURE validate_dual_write()
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
DECLARE
source_count NUMBER;
target_count NUMBER;
drift_count NUMBER;
BEGIN
SELECT COUNT(*) INTO source_count FROM source_db.public.orders;
SELECT COUNT(*) INTO target_count FROM prod.migrated.orders;
SELECT COUNT(*) INTO drift_count
FROM source_db.public.orders s
FULL OUTER JOIN prod.migrated.orders t ON s.order_id = t.order_id
WHERE s.order_id IS NULL OR t.order_id IS NULL
OR s.amount != t.amount;
RETURN 'Source: ' || source_count || ', Target: ' || target_count ||
', Drift: ' || drift_count;
END;
$$;
Production Best Practices
| Practice | Implementation |
|---|---|
| Monitoring | Set up alerts for query performance, costs, errors |
| Resource management | Use resource monitors, right-size warehouses |
| Security | Implement RBAC, masking, encryption |
| Backup & recovery | Use Time Travel, test recovery procedures |
| Documentation | Maintain data catalog, lineage, runbooks |
| Optimization | Regular performance reviews, clustering tuning |
Operational Runbook Template
-- 1. Incident response procedure
CREATE OR REPLACE PROCEDURE handle_incident(incident_type VARCHAR)
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
BEGIN
CASE incident_type
WHEN 'SLOW_QUERIES' THEN
-- Check warehouse queue
RETURN 'Check warehouse queue depth and scale if needed';
WHEN 'COST_OVERRUN' THEN
-- Check resource monitors
RETURN 'Review resource monitors and adjust quotas';
WHEN 'DATA_QUALITY' THEN
-- Run quality checks
RETURN 'Execute data quality validation procedures';
ELSE
RETURN 'Unknown incident type: ' || incident_type;
END CASE;
END;
$$;
-- 2. Health check dashboard
CREATE OR REPLACE VIEW production_health AS
SELECT
'Queries' AS category,
COUNT(*) AS value,
'Queries in last hour' AS metric
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD(hour, -1, CURRENT_TIMESTAMP())
UNION ALL
SELECT
'Slow Queries' AS category,
COUNT(*) AS value,
'Queries > 30s in last hour' AS metric
FROM snowflake.account_usage.query_history
WHERE total_elapsed_time_ms > 30000
AND start_time >= DATEADD(hour, -1, CURRENT_TIMESTAMP())
UNION ALL
SELECT
'Credits' AS category,
SUM(credits_used) AS value,
'Credits used today' AS metric
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATE_TRUNC('day', CURRENT_TIMESTAMP())
UNION ALL
SELECT
'Storage' AS category,
SUM(bytes) / (1024*1024*1024*1024) AS value,
'Storage in TB' AS metric
FROM information_schema.tables;
Cost Optimization Checklist
-- 1. Auto-suspend audit
SELECT
warehouse_name,
auto_suspend,
CASE
WHEN auto_suspend = 0 THEN 'CRITICAL: Set auto-suspend'
WHEN auto_suspend < 60 THEN 'WARNING: Very short'
ELSE 'GOOD'
END AS recommendation
FROM information_schema.warehouses;
-- 2. Unused warehouse detection
SELECT
w.warehouse_name,
w.warehouse_size,
COALESCE(m.credits_last_30d, 0) AS credits_used_30d,
CASE
WHEN COALESCE(m.credits_last_30d, 0) = 0 THEN 'UNUSED - DROP'
WHEN m.credits_last_30d < 10 THEN 'LOW USAGE - CONSIDER DOWNSIZE'
ELSE 'ACTIVE'
END AS recommendation
FROM information_schema.warehouses w
LEFT JOIN (
SELECT warehouse_name, SUM(credits_used) AS credits_last_30d
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD(day, -30, CURRENT_TIMESTAMP())
GROUP BY warehouse_name
) m ON w.warehouse_name = m.warehouse_name;
-- 3. Retention optimization
SELECT
table_name,
retention_time,
bytes / (1024*1024*1024) AS size_gb,
CASE
WHEN retention_time > 90 AND size_gb > 100 THEN 'REDUCE RETENTION'
WHEN retention_time > 30 AND size_gb > 1000 THEN 'CONSIDER REDUCING'
ELSE 'APPROPRIATE'
END AS recommendation
FROM information_schema.tables
WHERE retention_time > 30
ORDER BY size_gb DESC;
Key Takeaways
βΉοΈ
Migration Success Factors:
- Thorough assessment β Understand source systems completely
- Phased approach β Don't try to migrate everything at once
- Validation β Continuous validation during and after migration
- Parallel run β Run both systems until confident
- Rollback plan β Always have a way back
- Performance testing β Benchmark before go-live
- User training β Ensure users are ready for the new system