Data Formats — JSON, Parquet, Avro, and ORC

Data engineers build and maintain the infrastructure that powers data pipelines, warehouses, and analytics systems. Choosing the right data format directly impacts storage costs, query performance, and pipeline reliability.

Overview

Format Comparison

Feature	JSON	Parquet	Avro	ORC
Structure	Row-oriented, text	Column-oriented, binary	Row-oriented, binary	Column-oriented, binary
Schema	Embedded or external	Embedded (footer)	Embedded in file	Embedded (footer)
Schema Evolution	Difficult	Supported	Excellent	Supported
Compression	Low (text-based)	Excellent (columnar)	Good	Excellent (columnar)
Read Performance	Slow (parse text)	Fast (column projection)	Fast (sequential)	Fast (column projection)
Write Performance	Fast	Medium	Fast	Medium
Splittable	Line-delimited only	Yes	Yes	Yes
Query Engine Support	Universal	Spark, Presto, BigQuery	Spark, Flink	Hive, Spark
Use Case	APIs, logs, config	Analytics, data warehouses	Streaming, data lakes	Hive-based analytics

JSON (JavaScript Object Notation)

Structure

{
  "order_id": "ORD-001",
  "customer": {
    "id": 1001,
    "name": "John Doe",
    "email": "john@example.com"
  },
  "items": [
    {"product": "Widget A", "qty": 2, "price": 29.99},
    {"product": "Widget B", "qty": 1, "price": 49.99}
  ],
  "total": 109.97,
  "created_at": "2024-01-15T10:30:00Z"
}

Newline-Delimited JSON (NDJSON)

import json
import pandas as pd

# Reading NDJSON (one JSON object per line)
df = pd.read_json('events.ndjson', lines=True)

# Writing NDJSON
df.to_json('output.ndjson', orient='records', lines=True)

JSON Pros and Cons

Pros	Cons
Human-readable	Verbose (large file sizes)
Universal support	No columnar pruning (must parse entire file)
Self-describing schema	Schema evolution is difficult
Great for APIs and streaming	No built-in compression

When to Use JSON

Scenario	Recommendation
API responses	Use JSON — universal format
Configuration files	Use JSON or YAML
Log ingestion	Use NDJSON for streaming
Analytics queries	Convert to Parquet
Data lake storage	Convert to Parquet or Avro

Apache Parquet

Structure

Reading and Writing Parquet

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Pandas: Write Parquet
df.to_parquet('data.parquet', engine='pyarrow', compression='snappy')

# Pandas: Read Parquet
df = pd.read_parquet('data.parquet', columns=['order_id', 'amount'])

# PyArrow: Write with partitioning
table = pa.Table.from_pandas(df)
pq.write_to_dataset(
    table,
    root_path='s3://bucket/data/',
    partition_cols=['year', 'month']
)

# PyArrow: Read specific partitions
dataset = pq.ParquetDataset(
    's3://bucket/data/',
    filters=[('year', '=', 2024)]
)
df = dataset.read().to_pandas()

Parquet Compression

Codec	Compression Ratio	Speed	Splittable	Best For
Snappy	Medium (2-3x)	Very fast	Yes	Default for most workloads
Gzip	High (3-5x)	Slow	No	Archival storage
Zstd	High (3-5x)	Fast	Yes	Best overall (modern default)
LZ4	Low (2x)	Very fast	Yes	Speed-critical pipelines
Brotli	Very high (4-6x)	Slow	Yes	Maximum compression

Parquet Column Pruning

# Only reads 2 of 100 columns — massive I/O savings
df = pd.read_parquet(
    'large_file.parquet',
    columns=['order_id', 'amount']
)

# PyArrow: Filtered read (predicate pushdown)
import pyarrow.compute as pc

table = pq.read_table(
    'data.parquet',
    columns=['order_id', 'amount'],
    filters=[('amount', '>', 1000)]
)

Apache Avro

Structure

{
  "type": "record",
  "name": "Order",
  "fields": [
    {"name": "order_id", "type": "string"},
    {"name": "amount", "type": "double"},
    {"name": "customer_id", "type": "int"},
    {"name": "status", "type": ["null", "string"], "default": null}
  ]
}

Reading and Writing Avro

import fastavro
from io import BytesIO

# Schema definition
schema = {
    "type": "record",
    "name": "Order",
    "fields": [
        {"name": "order_id", "type": "string"},
        {"name": "amount", "type": "double"},
        {"name": "customer_id", "type": "int"},
        {"name": "created_at", "type": "string"}
    ]
}

# Write Avro
records = [
    {"order_id": "ORD-001", "amount": 99.99, "customer_id": 1001, "created_at": "2024-01-15"},
    {"order_id": "ORD-002", "amount": 149.99, "customer_id": 1002, "created_at": "2024-01-15"}
]

with open('orders.avro', 'wb') as f:
    fastavro.writer(f, schema, records)

# Read Avro
with open('orders.avro', 'rb') as f:
    reader = fastavro.reader(f)
    records = list(reader)
    df = pd.DataFrame(records)

Avro Schema Evolution

// Version 1 schema
{
  "type": "record",
  "name": "Order",
  "fields": [
    {"name": "order_id", "type": "string"},
    {"name": "amount", "type": "double"}
  ]
}

// Version 2 schema (added field with default)
{
  "type": "record",
  "name": "Order",
  "fields": [
    {"name": "order_id", "type": "string"},
    {"name": "amount", "type": "double"},
    {"name": "currency", "type": "string", "default": "USD"}
  ]
}

Avro Pros and Cons

Pros	Cons
Excellent schema evolution	Not human-readable
Compact binary encoding	Requires schema registry for best results
Built-in schema with every file	Column projection requires reading full row
Great for streaming (Kafka)	Slower analytics queries vs Parquet

Apache ORC

Structure

Reading and Writing ORC

import pandas as pd

# Pandas: Write ORC (requires pyorc or fastorc)
# Note: pandas ORC support is limited; use PySpark for full support

# PySpark: Write ORC
# df_spark.write.orc("output.orc", compression="zstd")

# PySpark: Read ORC
# df_spark = spark.read.orc("data.orc")

ORC vs Parquet

Factor	ORC	Parquet
Ecosystem	Hive-centric	Universal (Spark, Presto, BigQuery)
Compression	Excellent (with Zlib)	Excellent (with Zstd)
Index	Built-in bloom filters, min/max	Footer-level statistics
ACID transactions	Yes (Hive 3+)	No (Delta Lake adds this)
Best for	Hive-based warehouses	Multi-engine analytics

Format Selection Guide

Decision Matrix

If you need...	Use this format
Human-readable data exchange	JSON
Efficient columnar analytics	Parquet
Schema evolution in streaming	Avro
Hive-native data lake with ACID	ORC
Maximum compatibility across engines	Parquet
Kafka topic storage	Avro + Schema Registry
Data lake with partitioning	Parquet (hive-partitioned)
Archival with maximum compression	Parquet + Zstd

Schema Evolution Best Practices

Practice	Rationale
Add fields with defaults	New fields must have defaults for backward compatibility
Never remove required fields	Breaking change — use nullability instead
Use a Schema Registry	Central schema management (Confluent Schema Registry for Kafka)
Version your schemas	Include schema version in metadata
Test reader/writer compatibility	Ensure old readers can read new data and vice versa

MathSummary Takeaways

JSON is universal but slow for analytics — use it for APIs and configuration; convert to Parquet for storage.
Parquet is the default for data lakes — columnar storage enables column projection and predicate pushdown, reducing I/O by orders of magnitude.
Avro excels at schema evolution — the schema is embedded in every file, making it ideal for streaming pipelines and data lakes.
ORC is Hive-centric — use Parquet unless your stack is exclusively Hive-based.
Compression matters — Zstd is the modern default; Snappy is fast; Gzip is high-compression but not splittable.
Partition your Parquet files — partition by date or business key to avoid scanning irrelevant data.
Use Schema Registry for streaming — ensures producers and consumers agree on data format.
Column pruning saves I/O — reading 2 columns from a 100-column Parquet file reads ~2% of the data.

Practice Exercises

Format comparison: Convert a 1GB CSV file to Parquet, Avro, and ORC. Compare file sizes, write times, and read times.
Parquet partitioning: Write a script that partitions a dataset by year/month into Parquet files and queries specific partitions efficiently.
Avro schema evolution: Create an Avro file with schema v1, then read it with schema v2 (new field with default). Verify backward compatibility.
Column pruning: Profile the I/O savings of reading 3 columns from a 50-column Parquet file vs reading all columns.
Compression benchmarking: Write the same dataset using Snappy, Zstd, and Gzip. Compare file sizes and decompression speed.

Data Formats — JSON, Parquet, Avro, and ORC

Data Formats — JSON, Parquet, Avro, and ORC

Overview

Format Comparison

JSON (JavaScript Object Notation)

Structure

Newline-Delimited JSON (NDJSON)

JSON Pros and Cons

When to Use JSON

Apache Parquet

Structure

Reading and Writing Parquet

Parquet Compression

Parquet Column Pruning

Apache Avro

Structure

Reading and Writing Avro

Avro Schema Evolution

Avro Pros and Cons

Apache ORC

Structure

Reading and Writing ORC

ORC vs Parquet

Format Selection Guide

Decision Matrix

Schema Evolution Best Practices

MathSummary Takeaways

See Also

Practice Exercises

Premium Content

Need Expert Data Engineering Help?