Azure Cloud Overview & Global Infrastructure
Understanding Microsoft Azure's global footprint, resource management, and foundational services for data engineering
Azure Global Infrastructure
Azure operates the second-largest cloud infrastructure globally with 60+ announced regions spanning 140+ countries. As a data engineer, understanding this infrastructure is critical for designing high-availability, low-latency data solutions.
Regions and Availability Zones
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AZURE GLOBAL INFRASTRUCTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Region β β Region β β Region β β
β β East US 2 β β West Europe β β Southeast A β β
β β β β β β β β
β β βββββββββββββ β βββββββββββββ β βββββββββββββ β
β β β AZ-1 ββ β β AZ-1 ββ β β AZ-1 ββ β
β β β ββββββββ ββ β β ββββββββ ββ β β ββββββββ ββ β
β β β βDC/FC β ββ β β βDC/FC β ββ β β βDC/FC β ββ β
β β β ββββββββ ββ β β ββββββββ ββ β β ββββββββ ββ β
β β βββββββββββββ β βββββββββββββ β βββββββββββββ β
β β βββββββββββββ β βββββββββββββ β βββββββββββββ β
β β β AZ-2 ββ β β AZ-2 ββ β β AZ-2 ββ β
β β β ββββββββ ββ β β ββββββββ ββ β β ββββββββ ββ β
β β β βDC/FC β ββ β β βDC/FC β ββ β β βDC/FC β ββ β
β β β ββββββββ ββ β β ββββββββ ββ β β ββββββββ ββ β
β β βββββββββββββ β βββββββββββββ β βββββββββββββ β
β β βββββββββββββ β βββββββββββββ β βββββββββββββ β
β β β AZ-3 ββ β β AZ-3 ββ β β AZ-3 ββ β
β β β ββββββββ ββ β β ββββββββ ββ β β ββββββββ ββ β
β β β βDC/FC β ββ β β βDC/FC β ββ β β βDC/FC β ββ β
β β β ββββββββ ββ β β ββββββββ ββ β β ββββββββ ββ β
β β βββββββββββββ β βββββββββββββ β βββββββββββββ β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β
β DC = Data Center FC = Floor Controller AZ = Availability Zoneβ
β Each AZ has independent power, cooling, networking β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Concepts
| Concept | Description | Data Engineering Impact |
|---|---|---|
| Region | Geographic area with 1-3+ AZs | Data residency, latency optimization |
| Availability Zone | Physically separate datacenter | HA for Synapse, Databricks clusters |
| Region Pair | Paired regions for DR | Geo-redundant backup of ADLS |
| Resource Group | Logical container for resources | Organize data engineering assets |
| Subscription | Billing and access boundary | Cost allocation per project |
| Management Group | Policy hierarchy | Enterprise governance |
βΉοΈ
Pro Tip: When designing data pipelines, always place your compute (ADF Integration Runtime, Databricks) in the same region as your data storage (ADLS, Synapse) to avoid data transfer costs and latency.
Resource Hierarchy
Management Group (Enterprise)
βββ Subscription: Data Engineering Dev
β βββ Resource Group: rg-datalake-dev
β β βββ Storage Account: stdatalake001
β β βββ Key Vault: kv-secrets-dev
β βββ Resource Group: rg-synapse-dev
β βββ Synapse Workspace: syn-workspace-dev
β βββ Synapse Managed Vnet
βββ Subscription: Data Engineering Prod
β βββ Resource Group: rg-datalake-prod
β β βββ Storage Account: stdatalake001
β β βββ Key Vault: kv-secrets-prod
β βββ Resource Group: rg-synapse-prod
β βββ Synapse Workspace: syn-workspace-prod
β βββ Synapse Managed Vnet
ARM Template Example
{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"storageAccountName": {
"type": "string",
"metadata": { "description": "ADLS Gen2 storage account name" }
},
"location": {
"type": "string",
"defaultValue": "[resourceGroup().location]"
}
},
"resources": [
{
"type": "Microsoft.Storage/storageAccounts",
"apiVersion": "2023-01-01",
"name": "[parameters('storageAccountName')]",
"location": "[parameters('location')]",
"sku": { "name": "Standard_LRS", "tier": "Standard" },
"kind": "StorageV2",
"properties": {
"isHnsEnabled": true,
"supportsHttpsTrafficOnly": true,
"minimumTlsVersion": "TLS1_2",
"accessTier": "Hot",
"encryption": {
"services": {
"blob": { "enabled": true },
"file": { "enabled": true }
},
"keySource": "Microsoft.Storage"
},
"networkAcls": {
"defaultAction": "Deny",
"virtualNetworkRules": [],
"ipRules": []
}
},
"tags": {
"Environment": "Production",
"Project": "DataEngineering"
}
}
],
"outputs": {
"storageAccountId": {
"type": "string",
"value": "[resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccountName'))]"
}
}
}
Core Services for Data Engineering
Compute Services Comparison
| Service | Use Case | Pricing Model | Best For |
|---|---|---|---|
| Azure Functions | Event-driven ETL | Per execution | Lightweight transformations |
| Azure Data Factory | Orchestration | Per activity run | Complex ETL/ELT workflows |
| Azure Databricks | Big data processing | Per DBU | Spark-based transformations |
| Synapse Serverless | Ad-hoc queries | Per TB scanned | Lake exploration |
| Synapse Dedicated | Reserved compute | Per DWU | Data warehousing |
| Azure ML | ML pipelines | Per compute | Feature engineering |
Storage Services Comparison
| Service | Throughput | Latency | Use Case |
|---|---|---|---|
| ADLS Gen2 | High | Low | Data lake, analytics |
| Blob Storage | Very High | Very Low | Object storage, media |
| Cosmos DB | Very High | Single-digit ms | NoSQL, real-time |
| Azure Files | Moderate | Low | Shared file systems |
| Azure NetApp Files | Ultra High | Sub-ms | HPC, SAP HANA |
Azure Data Engineering Architecture Pattern
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TYPICAL DATA ENGINEERING ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β DATA SOURCES INGESTION PROCESSING β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β On-Prem ββββββ¬βββ>β ADF ββββββββ>β Synapse β β
β β Database β β β β β Serverlessβ β
β ββββββββββββ β ββββββββββββ ββββββ¬ββββββ β
β β β β
β ββββββββββββ β ββββββββββββ ββββββΌββββββ β
β β REST API ββββββΌβββ>β Event ββββββββ>β Synapse β β
β β β β β Hubs β β Dedicatedβ β
β ββββββββββββ β ββββββββββββ ββββββ¬ββββββ β
β β β β
β ββββββββββββ β ββββββββββββ ββββββΌββββββ β
β β IoT ββββββ΄βββ>β Stream ββββββββ>β Cosmos β β
β β Devices β βAnalytics β β DB β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β
β STORAGE GOVERNANCE SERVING β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β ADLS β<βββββββ>β Purview ββββββββ>β Power BI β β
β β Gen2 β β β β β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β β β
β β ββββββββββββ ββββββββββββ β β
β ββββββββββββββ>βKey Vault β<ββββββββ Azure AD β<βββββββββ β
β ββββββββββββ ββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Azure CLI for Data Engineering Setup
#!/bin/bash
# Create Resource Group
az group create \
--name "rg-dataengineering-prod" \
--location "eastus2" \
--tags Environment=Production Project=DataEngineering
# Create Storage Account with HNS (ADLS Gen2)
az storage account create \
--name "stdatalakeprodeastus2" \
--resource-group "rg-dataengineering-prod" \
--location "eastus2" \
--sku Standard_LRS \
--kind StorageV2 \
--enable-hierarchical-namespace true \
--min-tls-version TLS1_2 \
--allow-blob-public-access false \
--https-only true
# Create containers for data lake zones
az storage container create \
--name "raw" \
--account-name "stdatalakeprodeastus2"
az storage container create \
--name "curated" \
--account-name "stdatalakeprodeastus2"
az storage container create \
--name "sandbox" \
--account-name "stdatalakeprodeastus2"
# Create Synapse Workspace
az synapse workspace create \
--name "syn-prod-workspace" \
--resource-group "rg-dataengineering-prod" \
--location "eastus2" \
--storage-account "stdatalakeprodeastus2" \
--file-system "synapsefs" \
--sql-admin-login-user "sqladmin" \
--sql-admin-login-password "YourPassword123!"
# Create Synapse SQL Pool (Dedicated)
az synapse sql pool create \
--name "SQLPool01" \
--workspace-name "syn-prod-workspace" \
--resource-group "rg-dataengineering-prod" \
--performance-level DW100c
β οΈ
Important: Always enable HTTPS-only access and TLS 1.2 minimum for all storage accounts. Disable public blob access to prevent data leaks. Use Managed Identities instead of connection strings.
SLA and Performance Guarantees
| Service | SLA | RPO | RTO |
|---|---|---|---|
| ADLS Gen2 (RA-GRS) | 99.99% | <15 min | <30 min |
| Synapse Dedicated Pool | 99.9% | Point-in-time restore | Hours |
| Azure Functions | 99.95% | N/A | Seconds |
| Event Hubs | 99.95% | 0 (with capture) | Minutes |
| Cosmos DB (Multi-region) | 99.999% | 0 | 0 |
| Databricks | 99.9% | N/A | Minutes |
Pricing Tiers Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AZURE PRICING MODELS β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β PAY-AS-YOU-GO RESERVED SPOT/DEV TEST β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Best for: β β Best for: β β Best for: β β
β β Development β β Production β β Non-prod β β
β β Testing β β Stable work β β Dev/test β β
β β Variable β β Predictable β β Batch jobs β β
β β β β β β β β
β β Savings: 0% β β Savings: 30- β β Savings: 60- β β
β β β β 72% β β 90% β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β
β HYBRID BENEFIT COST MANAGEMENT β
β ββββββββββββββββ ββββββββββββββββ β
β β Use existing β β Budgets β β
β β Windows/SQL β β Alerts β β
β β licenses β β Advisor β β
β β β β Cost Analysisβ β
β β Savings: 40% β β β β
β ββββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Best Practices Summary
- Always use Managed Identities instead of storage keys or connection strings
- Enable soft delete on storage accounts for accidental deletion protection
- Use Private Endpoints to keep traffic off the public internet
- Tag all resources consistently for cost management and governance
- Use Availability Zones for production workloads requiring high availability
- Implement Azure Policy to enforce security standards across subscriptions
- Monitor costs using Azure Cost Management and set up budget alerts
- Use ARM/Bicep templates for infrastructure as code (IaC) to ensure consistency
Interview Questions
Q1: Explain the difference between Azure Regions and Availability Zones. A: Regions are geographic areas containing multiple datacenters. Availability Zones are physically separate datacenters within a region, each with independent power, cooling, and networking. For data engineering, use Availability Zones for high availability of critical services like Synapse and Databricks clusters.
Q2: Why should you deploy compute and storage in the same Azure region? A: Deploying in the same region eliminates data transfer costs (which can be significant at scale) and minimizes network latency. For example, ADF Integration Runtime in East US reading from ADLS in East US avoids the $0.01/GB transfer fee.
Q3: What is the benefit of using Azure Resource Groups for data engineering projects? A: Resource Groups provide logical organization, simplified access control (RBAC at RG level), cost tracking per project, and easy cleanup of resources when a project is complete.