Azure Data Factory: Pipelines, Datasets & Triggers
Enterprise ETL/ELT orchestration with Azure Data Factory pipelines, activities, and monitoring
ADF Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AZURE DATA FACTORY ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ADF FACTORY β β
β β β β
β β LINKED SERVICES DATASETS PIPELINES β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββ β β
β β β ADLS Gen2 βββββββ>β Parquet DS βββββ>β Pipeline β β β
β β β SQL Server βββββββ>β CSV DS β β β β β
β β β Cosmos DB βββββββ>β JSON DS β β Activitiesβ β β
β β β Event Hubs βββββββ>β Avro DS β β β β β
β β β REST API βββββββ>β Binary DS β β Triggers β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββ β β
β β β β
β β INTEGRATION RUNTIMES MONITORING GIT INTEGRATIONβ β
β β ββββββββββββββββββββ ββββββββββββββββ ββββββββββββ β β
β β β Auto Resolve IR β β Pipeline Runsβ β Azure β β β
β β β Self-Hosted IR β β Activity Runsβ β DevOps β β β
β β β Managed VNet IR β β Trigger Runs β β GitHub β β β
β β β Spark IR β β Alerts β β β β β
β β ββββββββββββββββββββ ββββββββββββββββ ββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β DATA FLOW (Visual ETL): β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Source ββ> Filter ββ> Derive ββ> Join ββ> Aggregate ββ> Sink β β
β β (ADLS) (Row (Add (Lookup) (Group By) (ADLS) β β
β β Filter) Columns) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Pipeline JSON Example
{
"name": "pl_daily_sales_etl",
"properties": {
"activities": [
{
"name": "CopySalesData",
"type": "Copy",
"typeProperties": {
"source": {
"type": "AzureBlobStorageSource",
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true
},
"formatSettings": {
"type": "JsonReadSettings"
}
},
"sink": {
"type": "AzureDataLakeStorageGen2Sink",
"storeSettings": {
"type": "AzureDataLakeGen2WriteSettings",
"copyBehavior": "PreserveHierarchy"
},
"formatSettings": {
"type": "ParquetWriteSettings"
}
}
},
"inputs": [
{
"name": "ds_raw_sales"
}
],
"outputs": [
{
"name": "ds_staging_sales"
}
]
},
{
"name": "TransformAndLoad",
"type": "DatabricksNotebook",
"typeProperties": {
"notebookPath": "/Repos/data_engineering/sales_transformation"
},
"dependsOn": [
{
"activity": "CopySalesData",
"dependencyConditions": ["Succeeded"]
}
],
"policy": {
"timeout": "0.1:0:0",
"retry": 1,
"retryIntervalInSeconds": 30
}
},
{
"name": "LoadToSynapse",
"type": "SqlPoolStoredProcedure",
"typeProperties": {
"storedProcedureName": "sp_load_fact_sales"
},
"dependsOn": [
{
"activity": "TransformAndLoad",
"dependencyConditions": ["Succeeded"]
}
]
}
],
"parameters": {
"date": {
"type": "String",
"defaultValue": "@formatDateTime(utcNow(), 'yyyy-MM-dd')"
},
"source": {
"type": "String",
"defaultValue": "sales_api"
}
},
"variables": {
"retryCount": {
"type": "Int32",
"defaultValue": 0
}
}
}
}
Linked Service Configuration
{
"name": "ls_adls_gen2",
"properties": {
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://stdatalake001.dfs.core.windows.net",
"accountKey": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "akv_dataengineering",
"type": "LinkedServiceReference"
},
"secretName": "adls-storage-key"
}
},
"connectVia": {
"referenceName": "AutoResolveIntegrationRuntime",
"type": "IntegrationRuntimeReference"
}
}
}
Trigger Types
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ADF TRIGGER TYPES β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β SCHEDULE TRIGGER β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Cron: 0 0 2 * * * (Daily at 2 AM) β β
β β Recurrence: Every 1 hour β β
β β Time Zone: UTC / Local β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β TUMBLING WINDOW TRIGGER β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Window Size: 1 Day β β
β β Frequency: Day β β
β β Anchor: 2024-01-01 β β
β β Parallel: 3 (process 3 windows concurrently) β β
β β MaxConcurrency: 10 β β
β β Retry: 3 attempts, 5 min interval β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β EVENT TRIGGER β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Event: Blob Created β β
β β Container: raw β β
β β Blob Path Begins With: sales/ β β
β β Event Type: Microsoft.Storage.BlobCreated β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β STORAGE EVENT TRIGGER β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Event: Blob Deleted β β
β β Subject Begins With: /blob/services/blob/containers/ β β
β β Ignore Blob Types: Append Blob β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Event Trigger JSON
{
"name": "tr_blob_arrival",
"properties": {
"type": "BlobEventsTrigger",
"typeProperties": {
"blobPathBeginsWith": "raw/sales/",
"blobPathEndsWith": ".json",
"ignoreEmptyBlobs": true
},
"pipelines": [
{
"pipelineReference": {
"referenceName": "pl_daily_sales_etl",
"type": "PipelineReference"
},
"parameters": {
"date": "@triggerBody().fileName",
"source": "blob_trigger"
}
}
]
}
}
β οΈ
Important: Event Triggers require an Event Grid-enabled storage account and a Managed Virtual Network with Event Grid private endpoints for production scenarios.
Self-Hosted Integration Runtime
{
"name": "ir-selfhosted-onprem",
"properties": {
"type": "SelfHosted",
"typeProperties": {
"linkedInfo": {
"type": "LinkedIntegrationRuntimeKey",
"key": "<EncryptedKey>"
}
},
"hostCaching": "Enabled",
"nodeCommunicationChannel": "ServiceEndpoint"
}
}
IR Node Configuration
# Install Self-Hosted IR on Windows
.\DataMovementLibraryRuntimeSetup.exe /quiet /InstallPath:"C:\DI\IR"
# Register IR node
.\RegisterLauncher.exe register -endpoint "https://adf-prod.azure.com" -authKey "<key>" -nodeName "IR-Node-01"
# Check IR status
.\StatusReporter.exe -endpoint "https://adf-prod.azure.com" -authKey "<key>"
Interview Questions
Q1: Explain the difference between Copy Activity and Data Flow in ADF. A: Copy Activity moves data as-is (or with minimal transformation) using optimized engines. Data Flow provides visual ETL with transformations (filter, derive, join, aggregate). Copy is faster for simple moves; Data Flow for complex transformations.
Q2: How do you handle schema changes in ADF pipelines? A: Use mapping data flows with schema drift enabled, or use ADF parameters to dynamically handle column changes. For Copy Activity, use "schema mapping" or "auto mapping" with schema validation.
Q3: What are the best practices for ADF pipeline monitoring? A: 1) Set up alerts for failed runs, 2) Use diagnostic settings to send logs to Log Analytics, 3) Implement custom logging with ADF parameters, 4) Use Power BI dashboards for pipeline metrics, 5) Set up auto-healing with retry policies.