Microsoft Purview: Data Map, Lineage & Classification
Unified data governance with automated discovery, classification, lineage, and data catalog capabilities
Purview Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MICROSOFT PURVIEW ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β PURVIEW ACCOUNT β β
β β β β
β β DATA MAP β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Sources Collections Assets β β β
β β β ββββββββββββ ββββββββββββ ββββββββββββ β β β
β β β β ADLS βββ>β Sales βββ>β Tables β β β β
β β β β Gen2 β β Data β β Files β β β β
β β β ββββββββββββ ββββββββββββ ββββββββββββ β β β
β β β ββββββββββββ ββββββββββββ ββββββββββββ β β β
β β β β Synapse βββ>β Finance βββ>β Views β β β β
β β β β SQL β β Data β β Procs β β β β
β β β ββββββββββββ ββββββββββββ ββββββββββββ β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β DATA CLASSIFICATION β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Built-in Classifiers Custom Classifiers β β β
β β β β’ Email Address β’ Employee ID β β β
β β β β’ Phone Number β’ Customer Code β β β
β β β β’ Credit Card β’ Internal Reference β β β
β β β β’ Social Security # β’ Product SKU β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β LINEAGE β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β ADF Pipeline ββ> Table ββ> View ββ> Power BI β β β
β β β Databricks Job ββ> Table ββ> Dashboard β β β
β β β Synapse SQL ββ> View ββ> Report β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β DATA GOVERNANCE β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Business Glossary Access Policies Policies β β β
β β β Data Domains Sensitivity Labels Rules β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Data Classification
# Purview SDK for classification
from azure.purview.datamap import PurviewDataMapClient
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
client = PurviewDataMapClient(
credential=credential,
account_name="purview-prod"
)
# Get classification recommendations
recommendations = client.classification.get_recommendation_list()
print("Recommended classifications:")
for rec in recommendations:
print(f" - {rec.name}: {rec.description}")
# Apply classification to entity
client.classification.add_classification(
entity_type="azure_datalake_gen2_path",
entity_guid="entity-guid-here",
classifications=[
{"typeName": "Microsoft.DataFactory.Sensitivity"},
{"typeName": "PII.Email"}
]
)
# Get classification details
classifications = client.classification.get_classification(
entity_type="azure_datalake_gen2_path",
entity_guid="entity-guid-here"
)
Lineage Tracking
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA LINEAGE FLOW β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β SOURCE PROCESSING DESTINATION β
β ββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β ADLS β β ADF Pipeline β β Synapse β β
β β Gen2 ββββββ>β βββββ>β SQL Pool β β
β β β β β’ Copy β β β β
β β raw/ β β β’ Transform β β curated/ β β
β β sales/ β β β’ Validate β β fact_sales β β
β ββββββββββββ ββββββββββββββββ ββββββββ¬ββββββββ β
β β β
β βΌ β
β ββββββββββββββββ β
β β Power BI β β
β β Dashboard β β
β β β β
β β Sales Report β β
β ββββββββββββββββ β
β β
β LINEAGE METADATA: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Pipeline: pl_sales_etl β β
β β β β β
β β ββ Source: /raw/sales/day=2024-01-15/*.parquet β β
β β β Type: Azure Data Lake Gen2 Path β β
β β β Format: Parquet β β
β β β β β
β β ββ Transform: spark_sales_transformation β β
β β β Type: Azure Databricks Notebook β β
β β β Output: Delta Lake format β β
β β β β β
β β ββ Sink: /curated/fact_sales/ β β
β β Type: Azure Data Lake Gen2 Path β β
β β Format: Delta Parquet β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Purview Scanning Configuration
# Create scan for ADLS Gen2
scan = client.scan.create_scan(
scan_name="adls-sales-scan",
collection_name="SalesData",
properties={
"dataSource": {
"type": "AzureDataLakeStorageGen2",
"properties": {
"url": "https://stdatalake001.dfs.core.windows.net",
"tenantId": "tenant-id"
}
},
"scanRuleset": {
"type": "System",
"name": "AzureDataLakeStorageGen2"
},
"schedule": {
"frequency": "Daily",
"time": "02:00"
}
}
)
# Run scan
client.scan.run_scan(
collection_name="SalesData",
scan_name="adls-sales-scan"
)
βΉοΈ
Pro Tip: Use Purview's automatic classification to discover sensitive data. Configure custom classifiers for domain-specific data patterns (e.g., internal customer codes).
Interview Questions
Q1: How does Purview help with data governance? A: Purview provides automated data discovery, classification of sensitive data, end-to-end lineage tracking, a business glossary for standardization, and access policies for data protectionβall in a unified portal.
Q2: Explain the difference between technical and business glossary in Purview. A: Technical glossary maps technical assets (tables, columns) to business terms. Business glossary defines domain-specific terminology and relationships. Together they bridge the gap between technical and business stakeholders.
Q3: How do you implement Purview scanning for a large data lake? A: Create multiple scans per data source (different schedules/filters), use collection hierarchies to organize assets, configure scan rulesets for specific data types, and use managed Private Endpoints for secure scanning.