πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Topic: UDFs, Stored Procedures & Snowpark

Snowflake AdvancedUDFs & Snowpark⭐ Premium

Advertisement

Snowflake Advanced Β· Interview Prep

UDFs, Stored Procedures & Snowpark

Difficulty: Hard Β· Commonly asked at Microsoft, Meta, Amazon

Interview Question

"When would you use a UDF vs a stored procedure vs Snowpark? Walk me through the performance implications and use cases for each. How do you handle state management in UDFs?"

ℹ️

Companies Asking This: Microsoft (Principal Data Engineer), Meta (Staff Data Engineer), Amazon (L6 Data Engineer), Google (Senior Data Engineer)


UDFs (User-Defined Functions)

UDFs are custom functions that can be used in SQL queries. They run per-row (scalar) or per-group (aggregate) and are pushed down to the warehouse.

Scalar UDFs

-- Create a scalar UDF for tax calculation
CREATE OR REPLACE FUNCTION calculate_tax(amount NUMBER, tax_rate NUMBER)
RETURNS NUMBER
LANGUAGE SQL
AS
    'amount * tax_rate / 100';

-- Use in queries
SELECT 
    order_id,
    total_amount,
    calculate_tax(total_amount, 8.5) AS tax,
    total_amount + calculate_tax(total_amount, 8.5) AS total_with_tax
FROM orders;

-- UDF with more complex logic
CREATE OR REPLACE FUNCTION categorize_amount(amount NUMBER)
RETURNS VARCHAR
LANGUAGE SQL
AS
    $$
    CASE 
        WHEN amount < 100 THEN 'LOW'
        WHEN amount < 1000 THEN 'MEDIUM'
        WHEN amount < 10000 THEN 'HIGH'
        ELSE 'PREMIUM'
    END
    $$;

-- UDF with string manipulation
CREATE OR REPLACE FUNCTION mask_email(email VARCHAR)
RETURNS VARCHAR
LANGUAGE SQL
AS
    $$
    CONCAT(
        LEFT(email, 2),
        REPEAT('*', LENGTH(email) - POSITION('@' IN email)),
        '@',
        RIGHT(email, LENGTH(email) - POSITION('@' IN email) + 1)
    )
    $$;

-- Use mask_email
SELECT mask_email('john.doe@example.com') AS masked_email;
-- Result: jd*****@example.com

JavaScript UDFs

-- JavaScript UDF for complex string processing
CREATE OR REPLACE FUNCTION parse_user_agent(ua_string VARCHAR)
RETURNS OBJECT
LANGUAGE JAVASCRIPT
AS
    $$
    var result = {};
    if (ua_string.includes('Chrome')) {
        result.browser = 'Chrome';
    } else if (ua_string.includes('Firefox')) {
        result.browser = 'Firefox';
    } else if (ua_string.includes('Safari')) {
        result.browser = 'Safari';
    } else {
        result.browser = 'Unknown';
    }
    
    if (ua_string.includes('Windows')) {
        result.os = 'Windows';
    } else if (ua_string.includes('Mac')) {
        result.os = 'MacOS';
    } else if (ua_string.includes('Linux')) {
        result.os = 'Linux';
    } else {
        result.os = 'Unknown';
    }
    
    return result;
    $$;

-- Use JavaScript UDF
SELECT 
    user_agent,
    parse_user_agent(user_agent):browser::VARCHAR AS browser,
    parse_user_agent(user_agent):os::VARCHAR AS operating_system
FROM user_sessions;

Aggregate UDFs

-- Create an aggregate UDF for weighted average
CREATE OR REPLACE FUNCTION weighted_avg(value NUMBER, weight NUMBER)
RETURNS NUMBER
LANGUAGE SQL
AS
    'SUM(value * weight) / NULLIF(SUM(weight), 0)';

-- Use in aggregate context
SELECT 
    product_category,
    weighted_avg(rating, review_count) AS weighted_avg_rating
FROM product_reviews
GROUP BY 1;

Stored Procedures

Stored procedures encapsulate business logic and can perform DDL/DML operations. They support transactions and can call other procedures.

SQL Stored Procedures

-- Create a stored procedure for data quality checks
CREATE OR REPLACE PROCEDURE check_data_quality(table_name VARCHAR)
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
DECLARE
    total_rows NUMBER;
    null_count NUMBER;
    duplicate_count NUMBER;
    result VARCHAR;
BEGIN
    -- Count total rows
    EXECUTE IMMEDIATE 'SELECT COUNT(*) FROM ' || table_name INTO total_rows;
    
    -- Count nulls in key columns
    EXECUTE IMMEDIATE 
        'SELECT COUNT(*) FROM ' || table_name || 
        ' WHERE order_id IS NULL OR customer_id IS NULL' 
        INTO null_count;
    
    -- Count duplicates
    EXECUTE IMMEDIATE 
        'SELECT COUNT(*) FROM (SELECT order_id, COUNT(*) c FROM ' || 
        table_name || ' GROUP BY order_id HAVING c > 1)' 
        INTO duplicate_count;
    
    result := 'Table: ' || table_name || 
              ' | Total: ' || total_rows ||
              ' | Nulls: ' || null_count ||
              ' | Duplicates: ' || duplicate_count;
    
    RETURN result;
END;
$$;

-- Call the procedure
CALL check_data_quality('ORDERS');

-- Procedure with error handling
CREATE OR REPLACE PROCEDURE safe_delete(schema_name VARCHAR, table_name VARCHAR)
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
DECLARE
    row_count NUMBER;
BEGIN
    EXECUTE IMMEDIATE 'SELECT COUNT(*) FROM ' || schema_name || '.' || table_name 
        INTO row_count;
    
    IF row_count > 1000000 THEN
        RETURN 'ERROR: Table has ' || row_count || ' rows. Use TRUNCATE for large tables.';
    ELSE
        EXECUTE IMMEDIATE 'DELETE FROM ' || schema_name || '.' || table_name;
        RETURN 'Deleted ' || row_count || ' rows from ' || table_name;
    END IF;
EXCEPTION
    WHEN OTHER THEN
        RETURN 'ERROR: ' || SQLERRM;
END;
$$;

Python Stored Procedures (Snowpark)

-- Create Python stored procedure
CREATE OR REPLACE PROCEDURE process_data_python(input_table VARCHAR, output_table VARCHAR)
RETURNS VARCHAR
LANGUAGE PYTHON
RUNTIME_VERSION = '3.10'
PACKAGES = ('snowflake-snowpark-python')
HANDLER = 'main'
AS
$$
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col, avg, count, sum as snowpark_sum

def main(session: Session, input_table: str, output_table: str) -> str:
    # Read input data
    df = session.table(input_table)
    
    # Perform transformations
    result = df.groupBy("region", "product_category").agg(
        count("*").alias("transaction_count"),
        snowpark_sum("amount").alias("total_revenue"),
        avg("amount").alias("avg_transaction")
    ).filter(col("total_revenue") > 10000)
    
    # Write to output table
    result.write.mode("overwrite").save_as_table(output_table)
    
    return f"Processed data into {output_table}"
$$;

Real-World Scenario: Microsoft

Question: "How would you implement a real-time data quality framework using UDFs and stored procedures?"

Solution: Data Quality Framework

-- 1. Create quality rules table
CREATE TABLE data_quality_rules (
    rule_id NUMBER AUTOINCREMENT,
    table_name VARCHAR(100),
    column_name VARCHAR(100),
    rule_type VARCHAR(50),
    rule_expression VARCHAR(2000),
    severity VARCHAR(20),
    enabled BOOLEAN DEFAULT TRUE,
    created_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);

-- Insert quality rules
INSERT INTO data_quality_rules (table_name, column_name, rule_type, rule_expression, severity)
VALUES 
    ('ORDERS', 'ORDER_ID', 'NOT_NULL', 'ORDER_ID IS NOT NULL', 'CRITICAL'),
    ('ORDERS', 'AMOUNT', 'RANGE', 'AMOUNT > 0 AND AMOUNT < 1000000', 'HIGH'),
    ('ORDERS', 'STATUS', 'ENUM', 'STATUS IN (''PENDING'', ''SHIPPED'', ''DELIVERED'', ''CANCELLED'')', 'MEDIUM'),
    ('ORDERS', 'ORDER_DATE', 'FUTURE_CHECK', 'ORDER_DATE <= CURRENT_DATE()', 'HIGH');

-- 2. Create quality check UDF
CREATE OR REPLACE FUNCTION check_quality_rule(
    column_value VARCHAR, 
    rule_expression VARCHAR
)
RETURNS BOOLEAN
LANGUAGE SQL
AS
    'TRY_CAST(column_value AS NUMBER) IS NOT NULL OR column_value IS NULL';

-- 3. Create quality check procedure
CREATE OR REPLACE PROCEDURE run_data_quality_checks(table_name VARCHAR)
RETURNS VARIANT
LANGUAGE SQL
AS
$$
DECLARE
    results ARRAY DEFAULT ARRAY_CONSTRUCT();
    rule_record RECORD;
    violation_count NUMBER;
    total_count NUMBER;
BEGIN
    -- Get all enabled rules for the table
    FOR rule_record IN 
        SELECT rule_id, column_name, rule_type, rule_expression, severity
        FROM data_quality_rules
        WHERE table_name = :table_name AND enabled = TRUE
    DO
        -- Execute quality check
        EXECUTE IMMEDIATE 
            'SELECT COUNT(*) FROM ' || :table_name || 
            ' WHERE NOT (' || rule_record.rule_expression || ')'
            INTO violation_count;
        
        EXECUTE IMMEDIATE 
            'SELECT COUNT(*) FROM ' || :table_name
            INTO total_count;
        
        -- Add result
        results := ARRAY_APPEND(results, OBJECT_CONSTRUCT(
            'rule_id', rule_record.rule_id,
            'column', rule_record.column_name,
            'rule_type', rule_record.rule_type,
            'severity', rule_record.severity,
            'violations', violation_count,
            'total_rows', total_count,
            'violation_pct', ROUND(violation_count * 100.0 / NULLIF(total_count, 0), 2)
        ));
    END FOR;
    
    -- Log results
    INSERT INTO data_quality_results (table_name, check_timestamp, results)
    VALUES (:table_name, CURRENT_TIMESTAMP(), PARSE_JSON(TO_VARIANT(:results)));
    
    RETURN OBJECT_CONSTRUCT(
        'table', :table_name,
        'rules_checked', ARRAY_SIZE(:results),
        'timestamp', CURRENT_TIMESTAMP()
    );
END;
$$;

-- 4. Schedule quality checks
CREATE OR REPLACE TASK daily_quality_checks
    WAREHOUSE = admin_wh
    SCHEDULE = 'USING CRON 0 6 * * * UTC'
AS
    CALL run_data_quality_checks('ORDERS');

Real-World Scenario: Meta

Question: "How do you handle performance issues with UDFs? The UDF is slowing down queries significantly."

UDF Performance Optimization

-- 1. Identify slow UDF calls
SELECT 
    query_id,
    query_text,
    total_elapsed_time_ms,
    compilation_time_ms,
    execution_time_ms
FROM snowflake.account_usage.query_history
WHERE query_text LIKE '%calculate_tax%'  -- Your UDF name
  AND start_time >= DATEADD(hour, -24, CURRENT_TIMESTAMP())
ORDER BY total_elapsed_time_ms DESC;

-- 2. Alternative: Inline UDF logic in SQL
-- Instead of UDF:
SELECT order_id, calculate_tax(amount, 8.5) AS tax FROM orders;

-- Inline the logic:
SELECT 
    order_id, 
    amount * 8.5 / 100 AS tax
FROM orders;

-- 3. Use Table Functions for complex transformations
CREATE OR REPLACE FUNCTION complex_transform(data ARRAY)
RETURNS TABLE (key VARCHAR, value VARCHAR)
LANGUAGE SQL
AS
    $$
    SELECT 
        f.key,
        f.value::VARCHAR
    FROM LATERAL FLATTEN(input => data) f
    $$;

-- 4. Consider Snowpark for complex transformations
-- (Better performance than JavaScript UDFs for complex logic)

Best Practices

Use CaseRecommended Approach
Simple row-level transformsSQL UDFs
Complex string/JSON parsingJavaScript UDFs
DDL/DML operationsStored Procedures
ML/Complex data processingSnowpark (Python/Java)
Data quality checksStored Procedures + UDFs
AggregationsSQL Aggregate UDFs

⚠️

Performance Tips:

  1. Prefer SQL UDFs over JavaScript β€” SQL UDFs are compiled and optimized
  2. Avoid UDFs in WHERE clauses β€” They prevent predicate pushdown
  3. Use lateral flatten sparingly β€” It's expensive in UDFs
  4. Consider inlining β€” Sometimes inline SQL is faster than a UDF call
  5. Test with PROFILE β€” Use PROFILE to identify UDF bottlenecks

Advertisement