Azure Data Lake Storage Gen2: Hierarchical, ACLs & POSIX
Enterprise data lake with hierarchical namespace, POSIX ACLs, and high-performance analytics capabilities
ADLS Gen2 Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ADLS GEN2 ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β STORAGE ACCOUNT (StorageV2 with HNS enabled) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β FILE SYSTEM: datalake β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β β
β β β raw/ β β β
β β β βββ 2024/ β β β
β β β β βββ 01/ β β β
β β β β β βββ sales/ β β β
β β β β β β βββ day=2024-01-01/ β β β
β β β β β β β βββ part-00000.parquet β β β
β β β β β β β βββ part-00001.parquet β β β
β β β β β β βββ day=2024-01-02/ β β β
β β β β β β βββ part-00000.parquet β β β
β β β β β βββ inventory/ β β β
β β β β β βββ customers/ β β β
β β β β βββ 02/ β β β
β β β βββ curated/ β β β
β β β β βββ dimensions/ β β β
β β β β βββ facts/ β β β
β β β βββ sandbox/ β β β
β β β βββ archive/ β β β
β β β β β β
β β β POSIX ACLs: β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β User/Group β Access β Type β Scope β β β β
β β β β βββββββββββββββββββββββββββββββββββββββββββββββ β β β β
β β β β dataeng-group β rwx β Allow β default β β β β
β β β β analysts-group β r-x β Allow β default β β β β
β β β β others β --- β Deny β default β β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ACL Configuration
# Set ACL on directory
az storage fs access set \
--path "raw/2024/01" \
--account-name stdatalake001 \
--file-system datalake \
--group "dataeng-group@company.com" \
--permission rwx
# Set default ACL (inherited by new items)
az storage fs access set \
--path "raw/2024/01" \
--account-name stdatalake001 \
--file-system datalake \
--acl "default:user::rwx,default:group::r-x,default:other::---"
# Get current ACL
az storage fs access show \
--path "raw/2024/01" \
--account-name stdatalake001 \
--file-system datalake
# Remove ACL entry
az storage fs access remove \
--path "raw/2024/01" \
--account-name stdatalake001 \
--file-system datalake \
--acl "group::r-x"
POSIX ACL Permission Model
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β POSIX ACL PERMISSION MODEL β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ACL ENTRY TYPES: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β user::rwx - Owner permissions β β
β β group::r-x - Group permissions β β
β β other::--- - Others permissions β β
β β β β
β β user:<name>:rwx - Named user β β
β β group:<name>:r-x - Named group β β
β β mask::rwx - Maximum effective permissions β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β PERMISSION EVALUATION: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 1. If user is owner β apply user:: permissions β β
β β 2. If user is in named group β apply group:<name> β β
β β 3. If user is in primary group β apply group:: β β
β β 4. Otherwise β apply other:: permissions β β
β β 5. Effective permissions = ACL β© mask β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ACCESS CHECK ALGORITHM: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β IF is_owner(user): β β
β β effective = (acl_user & mask) β β
β β ELIF in_named_group(user, acl_group): β β
β β effective = (acl_group & mask) β β
β β ELIF in_primary_group(user, group): β β
β β effective = (acl_group & mask) β β
β β ELSE: β β
β β effective = acl_other β β
β β β β
β β Return: (effective & requested_permissions) == requested β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Data Lake Architecture Zones
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ZONE-BASED DATA LAKE ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β INGESTION ZONE (Raw) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Raw, unprocessed data β β
β β β’ Original format preserved β β
β β β’ Immutable (write-once, read-many) β β
β β β’ Schema-on-read β β
β β β’ TTL: 90 days β Archive tier β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ (ETL/ELT) β
β PROCESSING ZONE (Staging) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Intermediate transformations β β
β β β’ Data quality validation β β
β β β’ Schema enforcement (schema-on-write) β β
β β β’ Temporary storage (auto-cleanup) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ (Transform) β
β CURATED ZONE (Analytics) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Cleaned, validated data β β
β β β’ Star/snowflake schema β β
β β β’ Optimized for query performance β β
β β β’ Partitioned by query patterns β β
β β β’ Delta Lake format for ACID transactions β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ (Serve) β
β SERVING ZONE (Consumption) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Materialized views β β
β β β’ Aggregate tables β β
β β β’ Power BI datasets β β
β β β’ ML feature stores β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β SANDBOX ZONE (Exploration) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Ad-hoc analysis β β
β β β’ Data scientist experimentation β β
β β β’ Self-service analytics β β
β β β’ TTL: 30 days auto-cleanup β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Python SDK for ADLS Gen2
from azure.storage.filedatalake import DataLakeServiceClient
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
client = DataLakeServiceClient(
account_url="https://stdatalake001.dfs.core.windows.net",
credential=credential
)
# Create directory with ACL
file_system = client.get_file_system_client("datalake")
file_system.create_directory(
"raw/2024/01/sales",
metadata={"project": "dataengineering", "retention": "90days"}
)
# Upload with metadata
file_client = client.get_file_client("datalake", "raw/2024/01/sales/data.parquet")
with open("local_file.parquet", "rb") as f:
file_client.upload_data(f, overwrite=True)
# List directories with properties
paths = list(file_system.list_paths(path="raw/", include_deleted=False))
for path in paths:
print(f"Name: {path.name}")
print(f"Is Directory: {path.is_directory}")
print(f"Last Modified: {path.last_modified}")
print(f"Content Length: {path.content_length}")
Interview Questions
Q1: Explain the difference between POSIX ACLs and Azure RBAC for ADLS Gen2. A: POSIX ACLs provide file/directory-level permissions with fine-grained control (user, group, mask, other). Azure RBAC provides role-based access at the storage account, container, or directory level. Use POSIX for data lake workloads; RBAC for administrative access.
Q2: How do you implement a data lake zone architecture in ADLS Gen2? A: Create separate containers or directories for each zone (raw, curated, sandbox). Use lifecycle management to tier data. Implement ACLs per zone. Use Delta Lake format in curated zone for ACID transactions.
Q3: What are the performance best practices for ADLS Gen2? A: 1) Use hierarchical namespace for directory operations, 2) Avoid too many small files (aim for 1GB+ per file), 3) Use partitioning for query patterns, 4) Enable ADLS Gen2 API for Hadoop compatibility, 5) Use Parallel File System Operations for bulk uploads.