Semi-Structured Data: VARIANT, JSON, XML & Array Functions
Architecture Diagram 1: VARIANT Storage Architecture
Architecture Diagram 2: Semi-Structured Data Parsing Pipeline
Architecture Diagram 3: VARIANT Storage Architecture
Architecture Diagram 4: JSON Parsing Pipeline
Architecture Diagram 5: XML Processing Architecture
VARIANT is Snowflake's universal container for semi-structured data (JSON, XML, BSON, Avro, Parquet). It automatically parses and stores hierarchical data in a compressed binary format, enabling SQL queries on nested structures using dot notation and bracket notation without explicit schema definitions.
FLATTEN converts semi-structured arrays into individual rows for relational processing. It supports output modes (OBJECT, ARRAY, BOTH) and preserves array indexes. The function uses lazy evaluation β materializing elements only when accessed by the query.
Path Resolution Complexity
- Dot notation (data:customer.name): O(depth) metadata lookup
- Bracket notation (data:items[0]): O(depth + index) lookup
- FLATTEN: O(array_size) β materializes elements on demand
- Performance: Frequently queried paths benefit from path statistics cache
- Optimization: Extract hot paths into dedicated columns for analytics workloads
Use TRY_PARSE_JSON over PARSE_JSON to handle malformed data gracefully. Extract frequently queried VARIANT fields into dedicated columns using materialized views for high-performance analytics. Prefer Parquet over raw JSON for analytics workloads.
- VARIANT stores JSON/XML/BSON natively with automatic parsing
- Dot notation (data:key) for object access, bracket notation for arrays
- FLATTEN converts arrays to rows for relational processing
- Schema evolution: New fields added without DDL changes (missing fields return NULL)
- Performance tradeoff: VARIANT is flexible but slower than extracted columns for analytics
Detailed Explanation
What is the VARIANT Data Type?
VARIANT is Snowflake's universal container for semi-structured data (JSON, XML, BSON, Avro, Parquet). It automatically parses and stores hierarchical data in compressed binary format.
VARIANT Storage and Access
- Hybrid storage: short strings inline, long strings via pointers (8β16 bytes)
- Two access patterns:
- Dot notation β
data:customer.namefor object navigation - Bracket notation β
data:items[0]for special characters or dynamic paths
- Dot notation β
- Path resolution: O(depth) metadata lookup
- Path statistics cache improves performance for frequently queried paths
JSON Processing Deep Dive
- Handles all standard JSON types: objects, arrays, strings, numbers, booleans, null
- Edge cases: malformed JSON, duplicate keys, non-standard numerics (configurable)
FLATTENconverts arrays to rows for relational processing (lazy evaluation)- Type casting required:
data:amount::NUMBER
XML Processing
- Native via
VARIANT+XMLGETfunction - Load with
TYPE = XMLfile format - Supports namespaces, attributes, mixed content, CDATA sections
- Incremental processing for large documents
Array and Object Functions
| Function | Purpose |
|---|---|
ARRAY_AGG | Aggregate rows into arrays |
ARRAY_CONSTRUCT | Build arrays from elements |
ARRAY_CAT | Concatenate arrays |
OBJECT_CONSTRUCT | Create JSON objects |
OBJECT_KEYS | Extract object keys |
XMLGET | Extract XML elements |
Schema Evolution and Flexibility
- New fields added without DDL changes (missing fields return NULL)
- Tradeoff: flexibility vs. performance β queries on many paths have metadata overhead
- Recommendation: extract hot paths into dedicated columns via materialized views for analytics
Key Takeaway: Use VARIANT for flexible ingestion, extract frequently queried fields into relational columns for performance.
Key Concepts Table
| Function | Purpose | Input | Output |
|---|---|---|---|
| FLATTEN | Convert arrays to rows | ARRAY VARIANT | Table of VARIANT |
| PARSE_JSON | Parse JSON string | STRING | VARIANT |
| TRY_PARSE_JSON | Safe JSON parse | STRING | VARIANT (NULL on error) |
| OBJECT_CONSTRUCT | Create JSON object | Key-value pairs | VARIANT |
| ARRAY_CONSTRUCT | Create JSON array | Elements | VARIANT |
| XMLGET | Extract XML element | VARIANT, path | VARIANT |
| TO_VARIANT | Convert to VARIANT | Any type | VARIANT |
| Data Type | Storage | Access Pattern | Performance |
|---|---|---|---|
| STRING | Raw text | Full scan required | Slow |
| VARIANT | Parsed binary | Path navigation | Fast |
| OBJECT | Parsed binary | Key lookup | Very fast |
| ARRAY | Parsed binary | Index access | Very fast |
| GEOGRAPHY | Binary geometry | Spatial functions | Fast |
| Metric | VARIANT | Flattened Columns | Description |
|---|---|---|---|
| Storage Efficiency | High (compressed) | Medium (redundant) | VARIANT stores once |
| Query Performance | Medium | High | Flattened columns avoid parsing |
| Flexibility | Very High | Low | VARIANT adapts to schema changes |
| Analytics | Limited | Full | Flattened columns enable all SQL |
| Indexing | Limited | Full | Flattened columns support clustering |
Code Examples
-- Example 1: Create table with VARIANT column
CREATE TABLE raw_json_data (
id INTEGER,
data VARIANT,
load_timestamp TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);
-- Example 2: Ingest JSON data
COPY INTO raw_json_data (id, data)
FROM @json_stage/
FILE_FORMAT = (
TYPE = JSON
STRIP_OUTER_ARRAY = FALSE
IGNORE_UTF8_ERRORS = TRUE
);
-- Example 3: Query JSON data with dot notation
SELECT
data:order_id::STRING as order_id,
data:customer.name::STRING as customer_name,
data:customer.email::STRING as customer_email,
data:total::NUMBER(10,2) as order_total,
data:currency::STRING as currency
FROM raw_json_data
WHERE data:customer.name::STRING LIKE '%Smith%';
-- Example 4: Flatten JSON arrays
SELECT
data:order_id::STRING as order_id,
f.value:product_id::STRING as product_id,
f.value:name::STRING as product_name,
f.value:quantity::INTEGER as quantity,
f.value:price::NUMBER(10,2) as price
FROM raw_json_data,
LATERAL FLATTEN(input => data:items) f;
-- Example 5: Aggregate flattened data back to arrays
SELECT
data:order_id::STRING as order_id,
ARRAY_AGG(
OBJECT_CONSTRUCT(
'product_id', f.value:product_id::STRING,
'name', f.value:name::STRING,
'quantity', f.value:quantity::INTEGER
)
) as items_array
FROM raw_json_data,
LATERAL FLATTEN(input => data:items) f
GROUP BY 1;
-- Example 6: Create structured table from semi-structured data
CREATE TABLE orders_structured AS
SELECT
data:order_id::STRING as order_id,
data:customer.name::STRING as customer_name,
data:customer.email::STRING as customer_email,
data:total::NUMBER(10,2) as order_total,
data:currency::STRING as currency,
data:timestamp::TIMESTAMP_NTZ as order_timestamp,
COUNT(f.value) as item_count
FROM raw_json_data,
LATERAL FLATTEN(input => data:items) f
GROUP BY 1, 2, 3, 4, 5, 6;
-- Example 7: XML processing
CREATE TABLE xml_data (
id INTEGER,
data VARIANT
);
COPY INTO xml_data
FROM @xml_stage/
FILE_FORMAT = (TYPE = XML);
-- Query XML data
SELECT
XMLGET(data, 'order_id'):t::STRING as order_id,
XMLGET(data, 'customer'):t:name:t::STRING as customer_name,
XMLGET(data, 'total'):t::NUMBER(10,2) as total
FROM xml_data;
-- Example 8: Advanced array manipulation
SELECT
data:order_id::STRING as order_id,
ARRAY_CONSTRUCT('item1', 'item2', 'item3') as static_array,
ARRAY_CAT(
data:tags::ARRAY,
ARRAY_CONSTRUCT('new_tag')
) as extended_tags,
ARRAY_INTERSECTION(
data:tags::ARRAY,
ARRAY_CONSTRUCT('tag1', 'tag3')
) as matching_tags
FROM raw_json_data
WHERE ARRAY_SIZE(data:tags::ARRAY) > 0;
-- Example 9: Object manipulation
SELECT
data:order_id::STRING as order_id,
OBJECT_CONSTRUCT(
'order_id', data:order_id::STRING,
'total', data:total::NUMBER,
'item_count', ARRAY_SIZE(data:items::ARRAY)
) as order_summary,
OBJECT_KEYS(data:customer::OBJECT) as customer_fields
FROM raw_json_data;
-- Example 10: Schema evolution handling
-- Add new field to JSON without schema change
INSERT INTO raw_json_data (id, data)
SELECT
1001,
PARSE_JSON('{
"order_id": "ORD-NEW-001",
"customer": {"name": "New Customer"},
"new_field": "This field didn''t exist before",
"nested": {"level1": {"level2": "deep value"}}
}');
-- Query handles missing fields gracefully
SELECT
data:order_id::STRING,
data:new_field::STRING,
data:nonexistent::STRING, -- Returns NULL
data:nested.level1.level2::STRING
FROM raw_json_data;
Performance Metrics
| Operation | Time Complexity | Typical Latency | Optimization |
|---|---|---|---|
| JSON Parse | O(n) where n = document size | 1-10ms per document | Use STRIP_OUTER_ARRAY |
| Path Resolution | O(log m) where m = depth | 0.1-1ms | Use explicit paths |
| FLATTEN | O(k) where k = array size | 1-100ms | Limit array size |
| XMLGET | O(d) where d = document depth | 1-5ms | Cache path results |
| OBJECT_CONSTRUCT | O(p) where p = properties | 0.5-2ms | Minimize nesting |
| File Format | Ingestion Speed | Query Performance | Storage Efficiency |
|---|---|---|---|
| JSON | Fast | Medium | Medium |
| Parquet | Very Fast | Fast | High |
| Avro | Fast | Medium | High |
| XML | Slow | Slow | Medium |
| CSV | Very Fast | Fast (if structured) | Low |
Best Practices
-
Extract frequently queried fields: For high-performance analytics, extract commonly accessed JSON fields into dedicated columns using materialized views or ETL processes.
-
Use appropriate file formats: Prefer Parquet over JSON for analytics workloads due to built-in compression, schema information, and predicate pushdown.
-
Limit array sizes: Avoid extremely large arrays in VARIANT columns. Use FLATTEN with LIMIT for sampling and consider breaking large arrays into multiple rows.
-
Implement schema validation: Use JSON Schema validation in ingestion pipelines to ensure data quality before loading into VARIANT columns.
-
Use TRY_ functions: Prefer TRY_PARSE_JSON over PARSE_JSON to handle malformed data gracefully without failing the entire load.
-
Optimize FLATTEN usage: Use FLATTEN only when necessary and consider materializing flattened results for repeated queries.
-
Leverage path statistics: Query performance improves with consistent path usage. Avoid dynamic paths that vary per document.
-
Consider data modeling: For frequently joined or aggregated data, normalize semi-structured data into relational tables.
-
Monitor VARIANT column sizes: Large VARIANT documents can impact query performance. Consider splitting very large documents into multiple columns or tables.
-
Use appropriate compression: Enable compression on VARIANT columns to reduce storage costs and improve I/O performance.
See Also
- Snowpipe Ingestion β JSON/XML ingestion patterns
- Optimization Techniques β Materialized views for VARIANT data
- Snowflake Architecture β Micro-partition storage format
- Stored Procedures β Processing semi-structured data
- PySpark Iceberg Tables β Semi-structured data with Iceberg
- Delta Lake Fundamentals β Delta Lake semi-structured support