πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Snowflake Unstructured Data Handling

🟒 Free Lesson

Advertisement

Snowflake Unstructured Data Handling

Snowflake extends beyond traditional structured data by providing robust support for unstructured and semi-structured data, enabling organizations to store, query, and analyze diverse data types.

Four-Layer Architecture: Unstructured DataCloud StorageS3/GCS/AzureStage LayerInternal/ExternalProcessingCOPY/SnowpipeStorageVARIANT/ColumnQueryAnalyticsImages/VideoPUT + URL referenceJSON/XMLCOPY INTO VARIANTParquet/ORCSchema evolutionDocumentsPDF/DOC metadata
File Format PipelineJSONParquetAvroORCXMLSemi-structured formats automatically parsed into VARIANT columnsUse COPY INTO with FILE_FORMAT for efficient bulk loading

Architecture Overview

The Snowflake unstructured data architecture flows through four layers:


Layer 1: Cloud Storage

  • Stores raw files in S3, Azure Blob, or Google Cloud Storage
  • Supports images, videos, documents, audio, JSON, and XML
  • Files remain in native cloud format until staged

Layer 2: Snowflake Stage Layer

  • Internal, External, Named, User, or Table stages
  • Stages files for efficient loading into Snowflake
  • Supports compression and encryption

Layer 3: Processing Layer

  • COPY INTO for bulk loading
  • Snowpipe for continuous ingestion
  • Streams and Tasks for real-time processing

Layer 4: Storage Layer

  • Data lands in VARIANT columns for semi-structured data
  • Semi-structured tables for JSON/Parquet data
  • Hybrid Tables for structured + unstructured references

Key Concepts


What is a VARIANT Data Type?

  • Schema-on-read data type for semi-structured data
  • Stores JSON, Avro, ORC, Parquet natively in Snowflake
  • No need to define schema before loading
  • Supports nested structures and arrays

What is a File Format?

  • Named object that defines file structure
  • Specifies type (JSON, Parquet, CSV, etc.)
  • Controls compression, encoding, and error handling
  • Reusable across multiple loading operations

File Format Creation

Creating a JSON File Format

CREATE OR REPLACE FILE FORMAT json_format
  TYPE = 'JSON'
  STRIP_OUTER_ARRAY = TRUE
  TRIM_SPACE = TRUE
  IGNORE_UTF8_ERRORS = FALSE,
  DATE_FORMAT = 'YYYY-MM-DD'
  COMMENT = 'Standard JSON format for API responses';

Creating a Parquet File Format

CREATE OR REPLACE FILE FORMAT parquet_format
  TYPE = 'PARQUET'
  TRIM_SPACE = TRUE
  REPLACE_INVALID_CHARACTERS = TRUE
  COMMENT = 'Parquet format for columnar data';

File Format Options Comparison

FormatCompressionNestingSchemaUse Case
JSONGZIP/SNAPPYDeepSchema-on-readAPIs, logs
AVRODEFLATENestedSchema-in-fileStreaming
ORCZLIBNestedSchema-in-fileHadoop ecosystem
PARQUETSNAPPYNestedSchema-in-fileAnalytics

Stage Operations

Internal Stage Types

-- User stage (~/)
PUT file://data.json @~/staged/ AUTO_COMPRESS=TRUE;

-- Table stage
PUT file://data.csv @%my_table/;

-- Named stage
CREATE STAGE my_named_stage
  FILE_FORMAT = json_format
  COMMENT = 'Named stage for daily loads';

PUT file://data.json @my_named_stage/ OVERWRITE=TRUE;

External Stage (S3 Example)

CREATE OR REPLACE STAGE s3_stage
  URL = 's3://my-bucket/data/'
  STORAGE_INTEGRATION = my_s3_integration
  FILE_FORMAT = json_format
  COMMENT = 'S3 bucket for raw data';

LIST @s3_stage;

Loading Unstructured Data

COPY INTO with Variant

CREATE TABLE raw_json_data (
  id INT AUTOINCREMENT PRIMARY KEY,
  raw_data VARIANT,
  file_name VARCHAR,
  load_timestamp TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);

COPY INTO raw_json_data (raw_data, file_name)
FROM @s3_stage
FILE_FORMAT = json_format
ON_ERROR = 'CONTINUE'
FORCE = FALSE;

Querying Semi-Structured Data

SELECT
  raw_data:id::STRING as user_id,
  raw_data:name::STRING as user_name,
  raw_data:email::STRING as email,
  raw_data:address.city::STRING as city,
  raw_data:preferences.theme::STRING as theme,
  load_timestamp
FROM raw_json_data
WHERE raw_data:active::BOOLEAN = TRUE;

Flattening Nested Arrays

SELECT
  raw_data:id::INT as user_id,
  f.value:product::STRING as product_name,
  f.value:quantity::INT as quantity,
  f.value:price::DECIMAL(10,2) as price
FROM raw_json_data,
LATERAL FLATTEN(input => raw_data:orders) f;

Directory Table Pattern

-- Create table with directory table
CREATE TABLE documents (
  id INT AUTOINCREMENT PRIMARY KEY,
  document_name VARCHAR,
  metadata VARIANT
) DIRECTORY = (ENABLE = TRUE);

-- Refresh directory table
ALTER TABLE documents REFRESH;

-- Query directory table metadata
SELECT
  directory:relativePath as file_path,
  directory:path as file_path_full,
  directory:size as file_size,
  directory:lastModified as last_modified
FROM (TABLE(INFORMATION_SCHEMA.DIRECTORY_TABLE) WHERE table_name = 'DOCUMENTS');

Data Quality for Unstructured Data


Why Validate Unstructured Data?

  • Prevents bad data from entering your warehouse
  • Ensures downstream analytics are accurate
  • Reduces debugging time for data pipelines
  • Supports compliance and data governance requirements

Key Validation Options

  • SKIP_BLANK_LINES β€” Ignores empty lines in files
  • ERROR_ON_COLUMN_COUNT_MISMATCH β€” Fails on schema mismatches
  • TRIM_SPACE β€” Removes leading/trailing whitespace
  • IGNORE_UTF8_ERRORS β€” Handles encoding issues gracefully

Key Takeaway: Always validate unstructured data before loading. Use file format options and custom error handling to ensure data quality.

-- Create a validation query
SELECT
  $1:id::STRING IS NOT NULL as has_id,
  $1:name::STRING IS NOT NULL as has_name,
  METADATA$FILENAME as file_name,
  COUNT(*) as record_count
FROM @s3_stage
FILE_FORMAT = json_format
GROUP BY has_id, has_name, file_name;

Hybrid Tables for Unstructured References

CREATE HYBRID TABLE product_images (
  product_id INT PRIMARY KEY,
  product_name VARCHAR(100),
  image_url VARCHAR(500),
  thumbnail_url VARCHAR(500),
  metadata VARIANT,
  created_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);

-- Insert with metadata
INSERT INTO product_images (product_id, product_name, image_url, metadata)
VALUES
  (1, 'Widget A', 's3://images/widget_a.jpg', '{"format":"JPEG","size":1024,"dimensions":{"width":800,"height":600}}');

Summary

What You Learned

  1. Snowflake stores unstructured data in cloud storage via stages
  2. VARIANT column type enables schema-on-read for semi-structured data
  3. FLATTEN function is essential for querying nested arrays
  4. Directory tables provide automatic metadata tracking
  5. Hybrid tables combine structured and unstructured data capabilities

Best Practices

  1. Use named file formats for consistency across loading operations
  2. Validate data before loading to prevent pipeline failures
  3. Leverage FLATTEN for nested array queries
  4. Use directory tables for automatic metadata management
  5. Consider Hybrid Tables for mixed structured/unstructured use cases
⭐

Premium Content

Snowflake Unstructured Data Handling

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Snowflake Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement