🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Data Formats — JSON, Parquet, Avro, and ORC

Data Engineering FoundationsData Engineering Fundamentals🟢 Free Lesson

Advertisement

Data Formats — JSON, Parquet, Avro, and ORC

Data engineers build and maintain the infrastructure that powers data pipelines, warehouses, and analytics systems. Choosing the right data format directly impacts storage costs, query performance, and pipeline reliability.

Data Format Comparison ChartJSONParquetAvroORCRow-oriented, textColumn-oriented, binaryRow-oriented, binaryColumn-oriented, binarySchema: Embedded or extSchema: FooterSchema: EmbeddedSchema: FooterCompression: LowCompression: ExcellentCompression: GoodCompression: ExcellentRead: Slow (parse text)Read: Fast (columnar)Read: Fast (sequential)Read: Fast (columnar)Best for: APIs, logsBest for: Analytics, DWHBest for: Streaming, lakesBest for: Hive analytics

Overview

Format Comparison

FeatureJSONParquetAvroORC
StructureRow-oriented, textColumn-oriented, binaryRow-oriented, binaryColumn-oriented, binary
SchemaEmbedded or externalEmbedded (footer)Embedded in fileEmbedded (footer)
Schema EvolutionDifficultSupportedExcellentSupported
CompressionLow (text-based)Excellent (columnar)GoodExcellent (columnar)
Read PerformanceSlow (parse text)Fast (column projection)Fast (sequential)Fast (column projection)
Write PerformanceFastMediumFastMedium
SplittableLine-delimited onlyYesYesYes
Query Engine SupportUniversalSpark, Presto, BigQuerySpark, FlinkHive, Spark
Use CaseAPIs, logs, configAnalytics, data warehousesStreaming, data lakesHive-based analytics

JSON (JavaScript Object Notation)

Structure

{
  "order_id": "ORD-001",
  "customer": {
    "id": 1001,
    "name": "John Doe",
    "email": "john@example.com"
  },
  "items": [
    {"product": "Widget A", "qty": 2, "price": 29.99},
    {"product": "Widget B", "qty": 1, "price": 49.99}
  ],
  "total": 109.97,
  "created_at": "2024-01-15T10:30:00Z"
}

Newline-Delimited JSON (NDJSON)

import json
import pandas as pd

# Reading NDJSON (one JSON object per line)
df = pd.read_json('events.ndjson', lines=True)

# Writing NDJSON
df.to_json('output.ndjson', orient='records', lines=True)

JSON Pros and Cons

ProsCons
Human-readableVerbose (large file sizes)
Universal supportNo columnar pruning (must parse entire file)
Self-describing schemaSchema evolution is difficult
Great for APIs and streamingNo built-in compression

When to Use JSON

ScenarioRecommendation
API responsesUse JSON — universal format
Configuration filesUse JSON or YAML
Log ingestionUse NDJSON for streaming
Analytics queriesConvert to Parquet
Data lake storageConvert to Parquet or Avro

Apache Parquet

Structure

Reading and Writing Parquet

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Pandas: Write Parquet
df.to_parquet('data.parquet', engine='pyarrow', compression='snappy')

# Pandas: Read Parquet
df = pd.read_parquet('data.parquet', columns=['order_id', 'amount'])

# PyArrow: Write with partitioning
table = pa.Table.from_pandas(df)
pq.write_to_dataset(
    table,
    root_path='s3://bucket/data/',
    partition_cols=['year', 'month']
)

# PyArrow: Read specific partitions
dataset = pq.ParquetDataset(
    's3://bucket/data/',
    filters=[('year', '=', 2024)]
)
df = dataset.read().to_pandas()

Parquet Compression

CodecCompression RatioSpeedSplittableBest For
SnappyMedium (2-3x)Very fastYesDefault for most workloads
GzipHigh (3-5x)SlowNoArchival storage
ZstdHigh (3-5x)FastYesBest overall (modern default)
LZ4Low (2x)Very fastYesSpeed-critical pipelines
BrotliVery high (4-6x)SlowYesMaximum compression

Parquet Column Pruning

# Only reads 2 of 100 columns — massive I/O savings
df = pd.read_parquet(
    'large_file.parquet',
    columns=['order_id', 'amount']
)

# PyArrow: Filtered read (predicate pushdown)
import pyarrow.compute as pc

table = pq.read_table(
    'data.parquet',
    columns=['order_id', 'amount'],
    filters=[('amount', '>', 1000)]
)

Apache Avro

Structure

{
  "type": "record",
  "name": "Order",
  "fields": [
    {"name": "order_id", "type": "string"},
    {"name": "amount", "type": "double"},
    {"name": "customer_id", "type": "int"},
    {"name": "status", "type": ["null", "string"], "default": null}
  ]
}

Reading and Writing Avro

import fastavro
from io import BytesIO

# Schema definition
schema = {
    "type": "record",
    "name": "Order",
    "fields": [
        {"name": "order_id", "type": "string"},
        {"name": "amount", "type": "double"},
        {"name": "customer_id", "type": "int"},
        {"name": "created_at", "type": "string"}
    ]
}

# Write Avro
records = [
    {"order_id": "ORD-001", "amount": 99.99, "customer_id": 1001, "created_at": "2024-01-15"},
    {"order_id": "ORD-002", "amount": 149.99, "customer_id": 1002, "created_at": "2024-01-15"}
]

with open('orders.avro', 'wb') as f:
    fastavro.writer(f, schema, records)

# Read Avro
with open('orders.avro', 'rb') as f:
    reader = fastavro.reader(f)
    records = list(reader)
    df = pd.DataFrame(records)

Avro Schema Evolution

// Version 1 schema
{
  "type": "record",
  "name": "Order",
  "fields": [
    {"name": "order_id", "type": "string"},
    {"name": "amount", "type": "double"}
  ]
}

// Version 2 schema (added field with default)
{
  "type": "record",
  "name": "Order",
  "fields": [
    {"name": "order_id", "type": "string"},
    {"name": "amount", "type": "double"},
    {"name": "currency", "type": "string", "default": "USD"}
  ]
}

Avro Pros and Cons

ProsCons
Excellent schema evolutionNot human-readable
Compact binary encodingRequires schema registry for best results
Built-in schema with every fileColumn projection requires reading full row
Great for streaming (Kafka)Slower analytics queries vs Parquet

Apache ORC

Structure

Reading and Writing ORC

import pandas as pd

# Pandas: Write ORC (requires pyorc or fastorc)
# Note: pandas ORC support is limited; use PySpark for full support

# PySpark: Write ORC
# df_spark.write.orc("output.orc", compression="zstd")

# PySpark: Read ORC
# df_spark = spark.read.orc("data.orc")

ORC vs Parquet

FactorORCParquet
EcosystemHive-centricUniversal (Spark, Presto, BigQuery)
CompressionExcellent (with Zlib)Excellent (with Zstd)
IndexBuilt-in bloom filters, min/maxFooter-level statistics
ACID transactionsYes (Hive 3+)No (Delta Lake adds this)
Best forHive-based warehousesMulti-engine analytics

Format Selection Guide

Decision Matrix

If you need...Use this format
Human-readable data exchangeJSON
Efficient columnar analyticsParquet
Schema evolution in streamingAvro
Hive-native data lake with ACIDORC
Maximum compatibility across enginesParquet
Kafka topic storageAvro + Schema Registry
Data lake with partitioningParquet (hive-partitioned)
Archival with maximum compressionParquet + Zstd

Schema Evolution Best Practices

PracticeRationale
Add fields with defaultsNew fields must have defaults for backward compatibility
Never remove required fieldsBreaking change — use nullability instead
Use a Schema RegistryCentral schema management (Confluent Schema Registry for Kafka)
Version your schemasInclude schema version in metadata
Test reader/writer compatibilityEnsure old readers can read new data and vice versa

MathSummary Takeaways

  1. JSON is universal but slow for analytics — use it for APIs and configuration; convert to Parquet for storage.
  2. Parquet is the default for data lakes — columnar storage enables column projection and predicate pushdown, reducing I/O by orders of magnitude.
  3. Avro excels at schema evolution — the schema is embedded in every file, making it ideal for streaming pipelines and data lakes.
  4. ORC is Hive-centric — use Parquet unless your stack is exclusively Hive-based.
  5. Compression matters — Zstd is the modern default; Snappy is fast; Gzip is high-compression but not splittable.
  6. Partition your Parquet files — partition by date or business key to avoid scanning irrelevant data.
  7. Use Schema Registry for streaming — ensures producers and consumers agree on data format.
  8. Column pruning saves I/O — reading 2 columns from a 100-column Parquet file reads ~2% of the data.

See Also

Practice Exercises

  1. Format comparison: Convert a 1GB CSV file to Parquet, Avro, and ORC. Compare file sizes, write times, and read times.

  2. Parquet partitioning: Write a script that partitions a dataset by year/month into Parquet files and queries specific partitions efficiently.

  3. Avro schema evolution: Create an Avro file with schema v1, then read it with schema v2 (new field with default). Verify backward compatibility.

  4. Column pruning: Profile the I/O savings of reading 3 columns from a 50-column Parquet file vs reading all columns.

  5. Compression benchmarking: Write the same dataset using Snappy, Zstd, and Gzip. Compare file sizes and decompression speed.

Premium Content

Data Formats — JSON, Parquet, Avro, and ORC

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Data Engineering Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement