Data Lake Architecture: ADLS Gen2 Zone-Based Design

Enterprise data lake design with zone-based architecture, governance, and performance optimization

Zone-Based Data Lake Architecture

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                    ZONE-BASED DATA LAKE ARCHITECTURE                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    ADLS GEN2 ACCOUNT                         │   │
│  │                                                               │   │
│  │  RAW ZONE (Landing)                                          │   │
│  │  ┌─────────────────────────────────────────────────────┐    │   │
│  │  │ /raw/                                                │    │   │
│  │  │ ├── source_system_a/                                 │    │   │
│  │  │ │   └── YYYY/MM/DD/                                  │    │   │
│  │  │ │       └── *.parquet (original format)              │    │   │
│  │  │ ├── source_system_b/                                 │    │   │
│  │  │ └── source_system_c/                                 │    │   │
│  │  │                                                      │    │   │
│  │  │ Retention: 90 days → Archive tier                    │    │   │
│  │  │ Format: Original (CSV, JSON, Parquet)                │    │   │
│  │  │ Immutability: Write-once, read-many                  │    │   │
│  │  └─────────────────────────────────────────────────────┘    │   │
│  │                                                               │   │
│  │  CURATED ZONE (Analytics-Ready)                              │   │
│  │  ┌─────────────────────────────────────────────────────┐    │   │
│  │  │ /curated/                                            │    │   │
│  │  │ ├── dimensions/                                      │    │   │
│  │  │ │   ├── dim_customers/ (Delta)                       │    │   │
│  │  │ │   ├── dim_products/ (Delta)                        │    │   │
│  │  │ │   └── dim_dates/ (Delta)                           │    │   │
│  │  │ ├── facts/                                           │    │   │
│  │  │ │   ├── fact_sales/ (Delta, partitioned)             │    │   │
│  │  │ │   └── fact_inventory/ (Delta)                      │    │   │
│  │  │ └── aggregates/                                      │    │   │
│  │  │     └── daily_sales_summary/ (Delta)                 │    │   │
│  │  │                                                      │    │   │
│  │  │ Format: Delta Lake (ACID transactions)               │    │   │
│  │  │ Schema: Star/snowflake schema                        │    │   │
│  │  │ Partitioning: By query patterns (date, region)       │    │   │
│  │  └─────────────────────────────────────────────────────┘    │   │
│  │                                                               │   │
│  │  SANDBOX ZONE (Exploration)                                  │   │
│  │  ┌─────────────────────────────────────────────────────┐    │   │
│  │  │ /sandbox/                                            │    │   │
│  │  │ ├── user_a/                                          │    │   │
│  │  │ ├── user_b/                                          │    │   │
│  │  │ └── experiments/                                     │    │   │
│  │  │                                                      │    │   │
│  │  │ Retention: 30 days auto-cleanup                      │    │   │
│  │  │ Access: Data scientists, analysts                    │    │   │
│  │  └─────────────────────────────────────────────────────┘    │   │
│  │                                                               │   │
│  │  ARCHIVE ZONE (Compliance)                                   │   │
│  │  ┌─────────────────────────────────────────────────────┐    │   │
│  │  │ /archive/                                            │    │   │
│  │  │ ├── 2023/                                            │    │   │
│  │  │ └── 2022/                                            │    │   │
│  │  │                                                      │    │   │
│  │  │ Tier: Archive access tier                            │    │   │
│  │  │ Retention: 7 years (compliance)                      │    │   │
│  │  │ Access: RESTRICTED (audit only)                      │    │   │
│  │  └─────────────────────────────────────────────────────┘    │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Lifecycle Management Policy

{
  "rules": [
    {
      "enabled": true,
      "name": "RawZoneLifecycle",
      "type": "Lifecycle",
      "definition": {
        "actions": {
          "baseBlob": {
            "tierToCool": {
              "daysAfterModificationGreaterThan": 30
            },
            "tierToArchive": {
              "daysAfterModificationGreaterThan": 90
            },
            "delete": {
              "daysAfterModificationGreaterThan": 365
            }
          }
        },
        "filters": {
          "blobTypes": ["blockBlob"],
          "prefixMatch": ["raw/"]
        }
      }
    },
    {
      "enabled": true,
      "name": "SandboxCleanup",
      "type": "Lifecycle",
      "definition": {
        "actions": {
          "baseBlob": {
            "delete": {
              "daysAfterModificationGreaterThan": 30
            }
          }
        },
        "filters": {
          "blobTypes": ["blockBlob"],
          "prefixMatch": ["sandbox/"]
        }
      }
    }
  ]
}

Data Lake Security Architecture

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                    DATA LAKE SECURITY LAYERS                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  LAYER 1: NETWORK SECURITY                                     │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • Private Endpoints (no public access)                   │   │
│  │ • NSG rules on compute subnets                           │   │
│  │ • VNet Service Endpoints                                 │   │
│  │ • Firewall rules (IP whitelisting)                       │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  LAYER 2: IDENTITY & ACCESS                                    │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • Azure AD authentication                                │   │
│  │ • Managed Identities for services                        │   │
│  │ • RBAC at Storage Account/Container/Directory level      │   │
│  │ • POSIX ACLs for fine-grained access                     │   │
│  │ • Azure AD Groups for role management                    │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  LAYER 3: DATA PROTECTION                                      │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • Encryption at rest (Microsoft-managed keys)            │   │
│  │ • Encryption in transit (TLS 1.2)                        │   │
│  │ • Customer-managed keys (CMK) in Key Vault               │   │
│  │ • Soft delete (recovery from accidental deletion)        │   │
│  │ • Versioning (point-in-time recovery)                    │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  LAYER 4: MONITORING & AUDIT                                   │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • Diagnostic settings → Log Analytics                    │   │
│  │ • Storage analytics logs                                 │   │
│  │ • Azure Monitor alerts                                   │   │
│  │ • Microsoft Purview data scanning                        │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Python SDK for Data Lake Management

from azure.storage.filedatalake import DataLakeServiceClient
from azure.identity import DefaultAzureCredential
import json

credential = DefaultAzureCredential()
client = DataLakeServiceClient(
    account_url="https://stdatalake001.dfs.core.windows.net",
    credential=credential
)

# Create zone directories
zones = ["raw", "curated", "sandbox", "archive"]
for zone in zones:
    client.get_file_system_client("datalake").create_directory(zone)
    print(f"Created zone: {zone}")

# Set lifecycle management
import requests
token = credential.get_token("https://storage.azure.com/.default")

# Get current policy
response = requests.get(
    "https://stdatalake001.blob.core.windows.net/?comp=serviceset",
    headers={"Authorization": f"Bearer {token.token}"}
)

Interview Questions

Q1: How do you implement a data lake zone architecture? A: Create separate directories/containers for each zone (raw, curated, sandbox, archive). Implement lifecycle management for tiering. Use ACLs per zone. Use Delta Lake format in curated zone. Document zone purposes and access policies.

Q2: What are the performance best practices for ADLS Gen2? A: 1) Use hierarchical namespace, 2) Avoid small files (aim for 1GB+ per file), 3) Use partitioning for query patterns, 4) Enable ADLS Gen2 API for Hadoop compatibility, 5) Use Parallel File System Operations for bulk uploads.

Q3: How do you handle data quality in a data lake? A: 1) Schema validation at ingestion, 2) Data quality rules in transformation, 3) Great Expectations for automated validation, 4) Monitoring for data drift, 5) Quarantine zone for failed records, 6) Alerting for quality issues.