Data Systems
Data Lake Architecture
Data lakes store raw data at any scale for diverse analytics. Master the architecture of modern data platforms: storage layers, file formats, schema evolution, and the emerging lakehouse paradigm.
- Raw Storage — Store data as-is, schema-on-read
- Scalability — Petabytes of data on cheap object storage
- Flexibility — Support for structured, semi-structured, and unstructured data
Data lakes turn data hoarding into data-driven decisions.
Data Lake Fundamentals
DfData Lake
A data lake is a centralized repository that stores raw data in its native format at any scale. Unlike data warehouses that require predefined schemas, data lakes support schema-on-read, allowing diverse data types (structured, semi-structured, unstructured) to be stored and processed flexibly.
Data Lake vs Data Warehouse vs Lakehouse
Storage Layers
| Layer | Purpose | Technologies |
|---|---|---|
| Raw/Bronze | Original data, immutable | S3, ADLS, GCS |
| Cleansed/Silver | Validated, deduplicated | Delta Lake, Iceberg |
| Curated/Gold | Aggregated, business-ready | Materialized views |
The medallion architecture (Bronze → Silver → Gold) is a common pattern for organizing data lake layers. Each layer adds quality and transformation while maintaining the ability to trace back to raw data.
File Formats
| Format | Compression | Schema Evolution | Splittable | Use Case |
|---|---|---|---|---|
| CSV | Low | No | Yes | Simple data exchange |
| JSON | Moderate | Yes | No | Semi-structured data |
| Parquet | High | Limited | Yes | Analytics (columnar) |
| ORC | High | Yes | Yes | Hive ecosystem |
| Avro | High | Yes | Yes | Row-based, streaming |
Columnar Compression Advantage
Here,
- =Storage with columnar format
- =Storage with row format
- =Compression ratio for columnar data
Parquet Compression for Analytics
For a 1TB CSV file with columns: user_id (INT), event_type (STRING), timestamp (TIMESTAMP), value (FLOAT):
Row format (CSV): 1TB Columnar format (Parquet): ~100GB (10x compression)
Columns with low cardinality (event_type with 10 values) compress especially well with dictionary encoding. Analytics queries that scan only 3 of 20 columns read only the needed columns.
The Modern Data Stack
Data Governance
| Aspect | Description |
|---|---|
| Cataloging | Track what data exists and where |
| Lineage | Track data transformations and provenance |
| Quality | Validate data meets quality standards |
| Access Control | Restrict access based on roles |
| Compliance | GDPR, CCPA, HIPAA requirements |
A data lake without governance becomes a data swamp. Raw data without cataloging, quality checks, and access controls is worse than no data—it creates false confidence in bad data.
Practice Exercises
-
Architecture Design: Design a data lake architecture for an e-commerce company that ingests data from PostgreSQL, Kafka, and third-party APIs. What storage formats would you use at each layer?
-
Format Selection: Compare Parquet and ORC for a Spark-based analytics workload. What are the trade-offs in terms of compression, schema evolution, and ecosystem support?
-
Governance Plan: Design a data governance plan for a data lake containing PII (personally identifiable information). What cataloging, access control, and compliance measures would you implement?
-
Cost Estimation: Estimate the monthly storage cost for a data lake with 100TB of raw data, 20TB of processed data, and 5TB of materialized views on AWS S3.
Key Takeaways:
- Data lakes store raw data in native format with schema-on-read
- The medallion architecture (Bronze → Silver → Gold) organizes data layers
- Columnar formats (Parquet, ORC) compress 10x for analytics workloads
- The modern data stack separates ingestion, storage, processing, and consumption
- Data governance prevents data swamps through cataloging, quality, and access control
- The lakehouse paradigm combines data lake flexibility with warehouse reliability
What to Learn Next
-> Batch Processing MapReduce, Spark, and distributed batch processing.
-> Stream Processing Real-time data processing with Flink, Spark Streaming, and Kafka Streams.
-> Kafka Deep Dive Event streaming, partitioning, and exactly-once semantics.
-> Choosing the Right Database Systematic framework for database selection.
-> Event-Driven Architecture Event sourcing, CQRS, and message-driven systems.
-> Observability Logging, metrics, tracing, and monitoring distributed systems.