Data Systems

Data Lake Architecture

Data lakes store raw data at any scale for diverse analytics. Master the architecture of modern data platforms: storage layers, file formats, schema evolution, and the emerging lakehouse paradigm.

Raw Storage — Store data as-is, schema-on-read
Scalability — Petabytes of data on cheap object storage
Flexibility — Support for structured, semi-structured, and unstructured data

Data lakes turn data hoarding into data-driven decisions.

Data Lake Fundamentals

DfData Lake

A data lake is a centralized repository that stores raw data in its native format at any scale. Unlike data warehouses that require predefined schemas, data lakes support schema-on-read, allowing diverse data types (structured, semi-structured, unstructured) to be stored and processed flexibly.

Data Lake vs Data Warehouse vs Lakehouse

Storage Layers

Layer	Purpose	Technologies
Raw/Bronze	Original data, immutable	S3, ADLS, GCS
Cleansed/Silver	Validated, deduplicated	Delta Lake, Iceberg
Curated/Gold	Aggregated, business-ready	Materialized views

The medallion architecture (Bronze → Silver → Gold) is a common pattern for organizing data lake layers. Each layer adds quality and transformation while maintaining the ability to trace back to raw data.

File Formats

Format	Compression	Schema Evolution	Splittable	Use Case
CSV	Low	No	Yes	Simple data exchange
JSON	Moderate	Yes	No	Semi-structured data
Parquet	High	Limited	Yes	Analytics (columnar)
ORC	High	Yes	Yes	Hive ecosystem
Avro	High	Yes	Yes	Row-based, streaming

Columnar Compression Advantage

S_{columnar} = \frac{S_{row}}{R_{compression}} \text{ where } R \gg 1 \text{ for homogeneous columns}

Here,

$S_{columnar}$ =Storage with columnar format
$S_{row}$ =Storage with row format
$R_{compression}$ =Compression ratio for columnar data

Parquet Compression for Analytics

For a 1TB CSV file with columns: user_id (INT), event_type (STRING), timestamp (TIMESTAMP), value (FLOAT):

Row format (CSV): 1TB Columnar format (Parquet): ~100GB (10x compression)

Columns with low cardinality (event_type with 10 values) compress especially well with dictionary encoding. Analytics queries that scan only 3 of 20 columns read only the needed columns.

The Modern Data Stack

Data Governance

Aspect	Description
Cataloging	Track what data exists and where
Lineage	Track data transformations and provenance
Quality	Validate data meets quality standards
Access Control	Restrict access based on roles
Compliance	GDPR, CCPA, HIPAA requirements

A data lake without governance becomes a data swamp. Raw data without cataloging, quality checks, and access controls is worse than no data—it creates false confidence in bad data.

Practice Exercises

Architecture Design: Design a data lake architecture for an e-commerce company that ingests data from PostgreSQL, Kafka, and third-party APIs. What storage formats would you use at each layer?
Format Selection: Compare Parquet and ORC for a Spark-based analytics workload. What are the trade-offs in terms of compression, schema evolution, and ecosystem support?
Governance Plan: Design a data governance plan for a data lake containing PII (personally identifiable information). What cataloging, access control, and compliance measures would you implement?
Cost Estimation: Estimate the monthly storage cost for a data lake with 100TB of raw data, 20TB of processed data, and 5TB of materialized views on AWS S3.

Key Takeaways:

Data lakes store raw data in native format with schema-on-read
The medallion architecture (Bronze → Silver → Gold) organizes data layers
Columnar formats (Parquet, ORC) compress 10x for analytics workloads
The modern data stack separates ingestion, storage, processing, and consumption
Data governance prevents data swamps through cataloging, quality, and access control
The lakehouse paradigm combines data lake flexibility with warehouse reliability

What to Learn Next

-> Batch Processing MapReduce, Spark, and distributed batch processing.

-> Stream Processing Real-time data processing with Flink, Spark Streaming, and Kafka Streams.

-> Kafka Deep Dive Event streaming, partitioning, and exactly-once semantics.

-> Choosing the Right Database Systematic framework for database selection.

-> Event-Driven Architecture Event sourcing, CQRS, and message-driven systems.

-> Observability Logging, metrics, tracing, and monitoring distributed systems.

Data Lake Architecture

Data Lake Architecture

Data Lake Fundamentals

DfData Lake

Data Lake vs Data Warehouse vs Lakehouse

Storage Layers

File Formats

Columnar Compression Advantage

Parquet Compression for Analytics

The Modern Data Stack

Data Governance

Practice Exercises

What to Learn Next

Premium Content

Need Expert System Design Help?