Data Lake Architecture: ADLS Gen2 Zone-Based Design
Enterprise data lake design with zone-based architecture, governance, and performance optimization
Zone-Based Data Lake Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ZONE-BASED DATA LAKE ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ADLS GEN2 ACCOUNT β β
β β β β
β β RAW ZONE (Landing) β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β /raw/ β β β
β β β βββ source_system_a/ β β β
β β β β βββ YYYY/MM/DD/ β β β
β β β β βββ *.parquet (original format) β β β
β β β βββ source_system_b/ β β β
β β β βββ source_system_c/ β β β
β β β β β β
β β β Retention: 90 days β Archive tier β β β
β β β Format: Original (CSV, JSON, Parquet) β β β
β β β Immutability: Write-once, read-many β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β CURATED ZONE (Analytics-Ready) β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β /curated/ β β β
β β β βββ dimensions/ β β β
β β β β βββ dim_customers/ (Delta) β β β
β β β β βββ dim_products/ (Delta) β β β
β β β β βββ dim_dates/ (Delta) β β β
β β β βββ facts/ β β β
β β β β βββ fact_sales/ (Delta, partitioned) β β β
β β β β βββ fact_inventory/ (Delta) β β β
β β β βββ aggregates/ β β β
β β β βββ daily_sales_summary/ (Delta) β β β
β β β β β β
β β β Format: Delta Lake (ACID transactions) β β β
β β β Schema: Star/snowflake schema β β β
β β β Partitioning: By query patterns (date, region) β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β SANDBOX ZONE (Exploration) β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β /sandbox/ β β β
β β β βββ user_a/ β β β
β β β βββ user_b/ β β β
β β β βββ experiments/ β β β
β β β β β β
β β β Retention: 30 days auto-cleanup β β β
β β β Access: Data scientists, analysts β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β ARCHIVE ZONE (Compliance) β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β /archive/ β β β
β β β βββ 2023/ β β β
β β β βββ 2022/ β β β
β β β β β β
β β β Tier: Archive access tier β β β
β β β Retention: 7 years (compliance) β β β
β β β Access: RESTRICTED (audit only) β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Lifecycle Management Policy
{
"rules": [
{
"enabled": true,
"name": "RawZoneLifecycle",
"type": "Lifecycle",
"definition": {
"actions": {
"baseBlob": {
"tierToCool": {
"daysAfterModificationGreaterThan": 30
},
"tierToArchive": {
"daysAfterModificationGreaterThan": 90
},
"delete": {
"daysAfterModificationGreaterThan": 365
}
}
},
"filters": {
"blobTypes": ["blockBlob"],
"prefixMatch": ["raw/"]
}
}
},
{
"enabled": true,
"name": "SandboxCleanup",
"type": "Lifecycle",
"definition": {
"actions": {
"baseBlob": {
"delete": {
"daysAfterModificationGreaterThan": 30
}
}
},
"filters": {
"blobTypes": ["blockBlob"],
"prefixMatch": ["sandbox/"]
}
}
}
]
}
Data Lake Security Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA LAKE SECURITY LAYERS β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β LAYER 1: NETWORK SECURITY β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Private Endpoints (no public access) β β
β β β’ NSG rules on compute subnets β β
β β β’ VNet Service Endpoints β β
β β β’ Firewall rules (IP whitelisting) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β LAYER 2: IDENTITY & ACCESS β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Azure AD authentication β β
β β β’ Managed Identities for services β β
β β β’ RBAC at Storage Account/Container/Directory level β β
β β β’ POSIX ACLs for fine-grained access β β
β β β’ Azure AD Groups for role management β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β LAYER 3: DATA PROTECTION β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Encryption at rest (Microsoft-managed keys) β β
β β β’ Encryption in transit (TLS 1.2) β β
β β β’ Customer-managed keys (CMK) in Key Vault β β
β β β’ Soft delete (recovery from accidental deletion) β β
β β β’ Versioning (point-in-time recovery) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β LAYER 4: MONITORING & AUDIT β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Diagnostic settings β Log Analytics β β
β β β’ Storage analytics logs β β
β β β’ Azure Monitor alerts β β
β β β’ Microsoft Purview data scanning β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Python SDK for Data Lake Management
from azure.storage.filedatalake import DataLakeServiceClient
from azure.identity import DefaultAzureCredential
import json
credential = DefaultAzureCredential()
client = DataLakeServiceClient(
account_url="https://stdatalake001.dfs.core.windows.net",
credential=credential
)
# Create zone directories
zones = ["raw", "curated", "sandbox", "archive"]
for zone in zones:
client.get_file_system_client("datalake").create_directory(zone)
print(f"Created zone: {zone}")
# Set lifecycle management
import requests
token = credential.get_token("https://storage.azure.com/.default")
# Get current policy
response = requests.get(
"https://stdatalake001.blob.core.windows.net/?comp=serviceset",
headers={"Authorization": f"Bearer {token.token}"}
)
Interview Questions
Q1: How do you implement a data lake zone architecture? A: Create separate directories/containers for each zone (raw, curated, sandbox, archive). Implement lifecycle management for tiering. Use ACLs per zone. Use Delta Lake format in curated zone. Document zone purposes and access policies.
Q2: What are the performance best practices for ADLS Gen2? A: 1) Use hierarchical namespace, 2) Avoid small files (aim for 1GB+ per file), 3) Use partitioning for query patterns, 4) Enable ADLS Gen2 API for Hadoop compatibility, 5) Use Parallel File System Operations for bulk uploads.
Q3: How do you handle data quality in a data lake? A: 1) Schema validation at ingestion, 2) Data quality rules in transformation, 3) Great Expectations for automated validation, 4) Monitoring for data drift, 5) Quarantine zone for failed records, 6) Alerting for quality issues.