Snowflake Unstructured Data Handling
Snowflake extends beyond traditional structured data by providing robust support for unstructured and semi-structured data, enabling organizations to store, query, and analyze diverse data types.
Architecture Overview
The Snowflake unstructured data architecture flows through four layers:
Layer 1: Cloud Storage
- Stores raw files in S3, Azure Blob, or Google Cloud Storage
- Supports images, videos, documents, audio, JSON, and XML
- Files remain in native cloud format until staged
Layer 2: Snowflake Stage Layer
- Internal, External, Named, User, or Table stages
- Stages files for efficient loading into Snowflake
- Supports compression and encryption
Layer 3: Processing Layer
- COPY INTO for bulk loading
- Snowpipe for continuous ingestion
- Streams and Tasks for real-time processing
Layer 4: Storage Layer
- Data lands in VARIANT columns for semi-structured data
- Semi-structured tables for JSON/Parquet data
- Hybrid Tables for structured + unstructured references
Key Concepts
What is a VARIANT Data Type?
- Schema-on-read data type for semi-structured data
- Stores JSON, Avro, ORC, Parquet natively in Snowflake
- No need to define schema before loading
- Supports nested structures and arrays
What is a File Format?
- Named object that defines file structure
- Specifies type (JSON, Parquet, CSV, etc.)
- Controls compression, encoding, and error handling
- Reusable across multiple loading operations
File Format Creation
Creating a JSON File Format
CREATE OR REPLACE FILE FORMAT json_format
TYPE = 'JSON'
STRIP_OUTER_ARRAY = TRUE
TRIM_SPACE = TRUE
IGNORE_UTF8_ERRORS = FALSE,
DATE_FORMAT = 'YYYY-MM-DD'
COMMENT = 'Standard JSON format for API responses';
Creating a Parquet File Format
CREATE OR REPLACE FILE FORMAT parquet_format
TYPE = 'PARQUET'
TRIM_SPACE = TRUE
REPLACE_INVALID_CHARACTERS = TRUE
COMMENT = 'Parquet format for columnar data';
File Format Options Comparison
| Format | Compression | Nesting | Schema | Use Case |
|---|---|---|---|---|
| JSON | GZIP/SNAPPY | Deep | Schema-on-read | APIs, logs |
| AVRO | DEFLATE | Nested | Schema-in-file | Streaming |
| ORC | ZLIB | Nested | Schema-in-file | Hadoop ecosystem |
| PARQUET | SNAPPY | Nested | Schema-in-file | Analytics |
Stage Operations
Internal Stage Types
-- User stage (~/)
PUT file://data.json @~/staged/ AUTO_COMPRESS=TRUE;
-- Table stage
PUT file://data.csv @%my_table/;
-- Named stage
CREATE STAGE my_named_stage
FILE_FORMAT = json_format
COMMENT = 'Named stage for daily loads';
PUT file://data.json @my_named_stage/ OVERWRITE=TRUE;
External Stage (S3 Example)
CREATE OR REPLACE STAGE s3_stage
URL = 's3://my-bucket/data/'
STORAGE_INTEGRATION = my_s3_integration
FILE_FORMAT = json_format
COMMENT = 'S3 bucket for raw data';
LIST @s3_stage;
Loading Unstructured Data
COPY INTO with Variant
CREATE TABLE raw_json_data (
id INT AUTOINCREMENT PRIMARY KEY,
raw_data VARIANT,
file_name VARCHAR,
load_timestamp TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);
COPY INTO raw_json_data (raw_data, file_name)
FROM @s3_stage
FILE_FORMAT = json_format
ON_ERROR = 'CONTINUE'
FORCE = FALSE;
Querying Semi-Structured Data
SELECT
raw_data:id::STRING as user_id,
raw_data:name::STRING as user_name,
raw_data:email::STRING as email,
raw_data:address.city::STRING as city,
raw_data:preferences.theme::STRING as theme,
load_timestamp
FROM raw_json_data
WHERE raw_data:active::BOOLEAN = TRUE;
Flattening Nested Arrays
SELECT
raw_data:id::INT as user_id,
f.value:product::STRING as product_name,
f.value:quantity::INT as quantity,
f.value:price::DECIMAL(10,2) as price
FROM raw_json_data,
LATERAL FLATTEN(input => raw_data:orders) f;
Directory Table Pattern
-- Create table with directory table
CREATE TABLE documents (
id INT AUTOINCREMENT PRIMARY KEY,
document_name VARCHAR,
metadata VARIANT
) DIRECTORY = (ENABLE = TRUE);
-- Refresh directory table
ALTER TABLE documents REFRESH;
-- Query directory table metadata
SELECT
directory:relativePath as file_path,
directory:path as file_path_full,
directory:size as file_size,
directory:lastModified as last_modified
FROM (TABLE(INFORMATION_SCHEMA.DIRECTORY_TABLE) WHERE table_name = 'DOCUMENTS');
Data Quality for Unstructured Data
Why Validate Unstructured Data?
- Prevents bad data from entering your warehouse
- Ensures downstream analytics are accurate
- Reduces debugging time for data pipelines
- Supports compliance and data governance requirements
Key Validation Options
SKIP_BLANK_LINESβ Ignores empty lines in filesERROR_ON_COLUMN_COUNT_MISMATCHβ Fails on schema mismatchesTRIM_SPACEβ Removes leading/trailing whitespaceIGNORE_UTF8_ERRORSβ Handles encoding issues gracefully
Key Takeaway: Always validate unstructured data before loading. Use file format options and custom error handling to ensure data quality.
-- Create a validation query
SELECT
$1:id::STRING IS NOT NULL as has_id,
$1:name::STRING IS NOT NULL as has_name,
METADATA$FILENAME as file_name,
COUNT(*) as record_count
FROM @s3_stage
FILE_FORMAT = json_format
GROUP BY has_id, has_name, file_name;
Hybrid Tables for Unstructured References
CREATE HYBRID TABLE product_images (
product_id INT PRIMARY KEY,
product_name VARCHAR(100),
image_url VARCHAR(500),
thumbnail_url VARCHAR(500),
metadata VARIANT,
created_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);
-- Insert with metadata
INSERT INTO product_images (product_id, product_name, image_url, metadata)
VALUES
(1, 'Widget A', 's3://images/widget_a.jpg', '{"format":"JPEG","size":1024,"dimensions":{"width":800,"height":600}}');
Summary
What You Learned
- Snowflake stores unstructured data in cloud storage via stages
- VARIANT column type enables schema-on-read for semi-structured data
- FLATTEN function is essential for querying nested arrays
- Directory tables provide automatic metadata tracking
- Hybrid tables combine structured and unstructured data capabilities
Best Practices
- Use named file formats for consistency across loading operations
- Validate data before loading to prevent pipeline failures
- Leverage FLATTEN for nested array queries
- Use directory tables for automatic metadata management
- Consider Hybrid Tables for mixed structured/unstructured use cases