Data Engineering FoundationsData Engineering Fundamentals🟢 Free Lesson
Advertisement
Data Formats — JSON, Parquet, Avro, and ORC
Data engineers build and maintain the infrastructure that powers data pipelines, warehouses, and analytics systems. Choosing the right data format directly impacts storage costs, query performance, and pipeline reliability.
import pandas as pd
# Pandas: Write ORC (requires pyorc or fastorc)
# Note: pandas ORC support is limited; use PySpark for full support
# PySpark: Write ORC
# df_spark.write.orc("output.orc", compression="zstd")
# PySpark: Read ORC
# df_spark = spark.read.orc("data.orc")
ORC vs Parquet
Factor
ORC
Parquet
Ecosystem
Hive-centric
Universal (Spark, Presto, BigQuery)
Compression
Excellent (with Zlib)
Excellent (with Zstd)
Index
Built-in bloom filters, min/max
Footer-level statistics
ACID transactions
Yes (Hive 3+)
No (Delta Lake adds this)
Best for
Hive-based warehouses
Multi-engine analytics
Format Selection Guide
Decision Matrix
If you need...
Use this format
Human-readable data exchange
JSON
Efficient columnar analytics
Parquet
Schema evolution in streaming
Avro
Hive-native data lake with ACID
ORC
Maximum compatibility across engines
Parquet
Kafka topic storage
Avro + Schema Registry
Data lake with partitioning
Parquet (hive-partitioned)
Archival with maximum compression
Parquet + Zstd
Schema Evolution Best Practices
Practice
Rationale
Add fields with defaults
New fields must have defaults for backward compatibility
Never remove required fields
Breaking change — use nullability instead
Use a Schema Registry
Central schema management (Confluent Schema Registry for Kafka)
Version your schemas
Include schema version in metadata
Test reader/writer compatibility
Ensure old readers can read new data and vice versa
MathSummary Takeaways
JSON is universal but slow for analytics — use it for APIs and configuration; convert to Parquet for storage.
Parquet is the default for data lakes — columnar storage enables column projection and predicate pushdown, reducing I/O by orders of magnitude.
Avro excels at schema evolution — the schema is embedded in every file, making it ideal for streaming pipelines and data lakes.
ORC is Hive-centric — use Parquet unless your stack is exclusively Hive-based.
Compression matters — Zstd is the modern default; Snappy is fast; Gzip is high-compression but not splittable.
Partition your Parquet files — partition by date or business key to avoid scanning irrelevant data.
Use Schema Registry for streaming — ensures producers and consumers agree on data format.
Column pruning saves I/O — reading 2 columns from a 100-column Parquet file reads ~2% of the data.