Snowflake Advanced · Interview Prep

Zero-Copy Cloning & Data Branching

Difficulty: Medium-Hard · Commonly asked at Google, Microsoft, Amazon

Interview Question

"Explain how zero-copy cloning works in Snowflake. How would you use it to create isolated dev/test environments from production data without duplicating storage? What are the implications for data modification?"

ℹ️

Companies Asking This: Google (Senior Cloud Data Engineer), Microsoft (Principal Data Architect), Amazon (L5 Data Engineer), Meta (Data Platform Engineer)

Zero-Copy Cloning Fundamentals

Zero-copy cloning creates a new table (or schema, database) that initially shares the same underlying micro-partitions as the source. No data is physically copied — only metadata is created.

How It Works Internally

Basic Cloning Syntax

-- Clone a table (zero-copy initially)
CREATE TABLE orders_dev CLONE orders;

-- Clone with explicit schema
CREATE TABLE dev_schema.orders_dev CLONE production.orders;

-- Clone entire schema
CREATE SCHEMA dev_clone CLONE production;

-- Clone entire database
CREATE DATABASE dev_clone CLONE production_db;

-- Clone with data retention
CREATE TABLE orders_dev CLONE orders
    DATA_RETENTION_TIME_IN_DAYS = 30;

Copy-on-Write Behavior

When either the source or clone is modified, Snowflake uses copy-on-write semantics. Only the affected micro-partitions are physically duplicated.

-- Initial state: source and clone share micro-partitions
SELECT 
    table_name,
    bytes / (1024*1024*1024) AS storage_gb,
    row_count,
    clone_group_id
FROM information_schema.tables
WHERE table_name IN ('ORDERS', 'ORDERS_DEV');

-- After modifying the clone, only new micro-partitions are created
UPDATE orders_dev SET status = 'TEST' WHERE order_id < 1000;

-- Check storage impact
SELECT 
    table_name,
    bytes / (1024*1024*1024) AS storage_gb,
    row_count,
    clone_group_id
FROM information_schema.tables
WHERE table_name IN ('ORDERS', 'ORDERS_DEV');

-- The source table storage remains unchanged
-- The clone now has its own micro-partitions for modified rows

ℹ️

Key Insight: Cloning is metadata-only for reads. Storage divergence only occurs when data is modified in either the source or clone. For read-heavy clone use cases (like reporting), the storage savings are massive.

Real-World Scenario: Google

Question: "How would you set up a complete dev/test environment for a data engineering team using cloning, and what guardrails would you put in place?"

Solution: Environment Cloning Strategy

-- 1. Clone the entire production database for dev
CREATE DATABASE prod_clone 
    CLONE company_db
    DATA_RETENTION_TIME_IN_DAYS = 7;

-- 2. Create a dedicated warehouse for dev work
CREATE WAREHOUSE dev_wh
    WAREHOUSE_SIZE = 'SMALL'
    AUTO_SUSPEND = 60
    AUTO_RESUME = TRUE;

-- 3. Set up access controls for dev environment
GRANT USAGE ON DATABASE prod_clone TO ROLE dev_role;
GRANT USAGE ON SCHEMA prod_clone.public TO ROLE dev_role;
GRANT SELECT ON ALL TABLES IN SCHEMA prod_clone.public TO ROLE dev_role;
GRANT CREATE TABLE ON SCHEMA prod_clone.public TO ROLE dev_role;

-- 4. Create a procedure to refresh dev clone periodically
CREATE OR REPLACE PROCEDURE refresh_dev_clone()
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
DECLARE
    refresh_start TIMESTAMP_NTZ;
BEGIN
    refresh_start := CURRENT_TIMESTAMP();
    
    -- Drop and recreate the clone
    DROP TABLE IF EXISTS prod_clone.public.orders_dev;
    CREATE TABLE prod_clone.public.orders_dev 
        CLONE company_db.public.orders;
    
    DROP TABLE IF EXISTS prod_clone.public.customers_dev;
    CREATE TABLE prod_clone.public.customers_dev 
        CLONE company_db.public.customers;
    
    RETURN 'Dev clone refreshed at ' || 
           TO_CHAR(refresh_start, 'YYYY-MM-DD HH24:MI:SS');
END;
$$;

-- 5. Schedule daily refresh
CREATE OR REPLACE TASK daily_dev_refresh
    WAREHOUSE = admin_wh
    SCHEDULE = 'USING CRON 0 6 * * * UTC'
AS
    CALL refresh_dev_clone();

Branching Strategy for Feature Development

-- Create a branch for a new feature
CREATE TABLE orders_feature_v2 CLONE orders;

-- Make changes on the branch
ALTER TABLE orders_feature_v2 ADD COLUMN discount_pct NUMBER(5,2);
UPDATE orders_feature_v2 
SET discount_pct = ROUND(RANDOM() * 0.3, 2) 
WHERE order_date >= '2024-01-01';

-- Test the feature
SELECT 
    order_id,
    total_amount,
    discount_pct,
    total_amount * (1 - discount_pct / 100) AS final_amount
FROM orders_feature_v2
WHERE discount_pct > 0;

-- If successful, merge changes to production
ALTER TABLE orders ADD COLUMN discount_pct NUMBER(5,2);
UPDATE orders o
SET discount_pct = (
    SELECT discount_pct 
    FROM orders_feature_v2 v 
    WHERE v.order_id = o.order_id
);

-- Clean up the branch
DROP TABLE orders_feature_v2;

Real-World Scenario: Microsoft

Question: "How do you handle schema evolution when cloning tables? If the source table gets a new column, does the clone automatically get it?"

Schema Evolution with Clones

-- Clone a table
CREATE TABLE orders_dev CLONE orders;

-- Add a column to the source
ALTER TABLE orders ADD COLUMN priority VARCHAR(10) DEFAULT 'NORMAL';

-- Check if clone has the new column
DESCRIBE TABLE orders_dev;
-- Result: The clone does NOT automatically get the new column
-- Cloning is a point-in-time operation

-- To get the new column in the clone, you need to re-clone
DROP TABLE orders_dev;
CREATE TABLE orders_dev CLONE orders;

-- Alternative: Manually add the column to the clone
ALTER TABLE orders_dev ADD COLUMN priority VARCHAR(10) DEFAULT 'NORMAL';

-- Update clone with data from source
UPDATE orders_dev d
SET priority = (
    SELECT priority 
    FROM orders o 
    WHERE o.order_id = d.order_id
);

⚠️

Important: Cloning is a point-in-time snapshot. After cloning, changes to the source do NOT automatically propagate to the clone. This is by design — it provides isolation but requires manual sync if you need the latest schema.

Advanced Cloning Patterns

Pattern: Selective Cloning with FILTER

-- Clone only recent data using CTAS + clone
CREATE TABLE orders_recent CLONE orders
    FILTER => DATEADD(day, -30, CURRENT_DATE());  -- Not supported

-- Alternative: Use CTAS for filtered clone
CREATE TABLE orders_recent AS
SELECT * FROM orders
WHERE order_date >= DATEADD(day, -30, CURRENT_DATE());

-- Clone specific partitions
CREATE TABLE orders_q1_2024 CLONE orders
    FILTER => order_date BETWEEN '2024-01-01' AND '2024-03-31';

Pattern: Clone with Transformation

-- Create a transformed clone for analytics
CREATE TABLE orders_analytics AS
SELECT 
    o.*,
    c.customer_segment,
    c.customer_lifetime_value,
    DATEDIFF(day, o.order_date, CURRENT_DATE()) AS days_since_order,
    NTILE(10) OVER (ORDER BY o.total_amount DESC) AS amount_decile
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= DATEADD(year, -1, CURRENT_DATE());

-- This is not zero-copy (data is transformed), but useful for analytics

Pattern: Cloning for Testing Data Quality Rules

-- Clone production data for testing quality rules
CREATE TABLE orders_quality_test CLONE orders;

-- Test quality rules on the clone
SELECT 
    'NULL_CUSTOMER_ID' AS rule_name,
    COUNT(*) AS violations
FROM orders_quality_test WHERE customer_id IS NULL
UNION ALL
SELECT 
    'NEGATIVE_AMOUNT' AS rule_name,
    COUNT(*) AS violations
FROM orders_quality_test WHERE total_amount < 0
UNION ALL
SELECT 
    'FUTURE_DATE' AS rule_name,
    COUNT(*) AS violations
FROM orders_quality_test WHERE order_date > CURRENT_DATE()
UNION ALL
SELECT 
    'INVALID_STATUS' AS rule_name,
    COUNT(*) AS violations
FROM orders_quality_test 
WHERE status NOT IN ('PENDING', 'SHIPPED', 'DELIVERED', 'CANCELLED');

-- If violations found, fix the rules before applying to production
DROP TABLE orders_quality_test;

Cloning Performance Considerations

-- Cloning is fast because it's metadata-only
-- But monitor the operation for large tables
CREATE OR REPLACE TASK clone_monitor
    WAREHOUSE = admin_wh
    SCHEDULE = 'USING CRON 0 0 1 * * UTC'  -- Monthly
AS
    SELECT 
        table_name,
        bytes / (1024*1024*1024) AS size_gb,
        row_count,
        clone_group_id,
        created_on
    FROM information_schema.tables
    WHERE clone_group_id IS NOT NULL
    ORDER BY bytes DESC;

-- Check clone relationship
SELECT 
    t1.table_name AS source_table,
    t2.table_name AS clone_table,
    t1.bytes / (1024*1024*1024) AS source_size_gb,
    t2.bytes / (1024*1024*1024) AS clone_size_gb,
    ROUND(
        (1 - t2.bytes / NULLIF(t1.bytes, 0)) * 100, 2
    ) AS storage_savings_pct
FROM information_schema.tables t1
JOIN information_schema.tables t2 
    ON t1.clone_group_id = t2.clone_group_id
    AND t1.table_name != t2.table_name
WHERE t1.clone_group_id IS NOT NULL;

Best Practices

Use Case	Approach	Storage Impact
Dev/Test environments	Clone entire database	Zero initially, grows with modifications
Feature branches	Clone specific tables	Zero initially
Data quality testing	Clone + filter	Minimal
Analytics sandbox	Clone + transform	Depends on transformation
Reporting isolation	Clone for BI	Zero if read-only

⚠️

Anti-Patterns:

Cloning for backups — Use Time Travel instead; cloning is not a backup strategy
Never cleaning up clones — Old clones consume storage when modified
Cloning large datasets for small tests — Use FILTER or WHERE to limit data
Expecting auto-sync — Clones are point-in-time, not live replicas

Topic: Zero-Copy Cloning & Data Branching