Data Catalog: Purview, Metadata Scanning & Glossary
Enterprise data cataloging with Purview for metadata discovery, classification, and business glossary management
Data Catalog Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA CATALOG ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β DATA SOURCES SCANNING CATALOG β
β ββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β ADLS ββββββββ>β Purview βββββ>β Data Map β β
β β Gen2 β β Scanner β β β β
β ββββββββββββ ββββββββββββββββ β β’ Assets β β
β β β’ Lineage β β
β ββββββββββββ ββββββββββββββββ β β’ Classif. β β
β β Synapse ββββββββ>β Purview βββββ>β β β
β β SQL β β Scanner β β Collections β β
β ββββββββββββ ββββββββββββββββ β β β
β β β’ Domain A β β
β ββββββββββββ ββββββββββββββββ β β’ Domain B β β
β β Power BI ββββββββ>β Purview βββββ>β β’ Domain C β β
β β β β Integration β ββββββββββββββββ β
β ββββββββββββ ββββββββββββββββ β
β β
β DISCOVERY & GOVERNANCE: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β BUSINESS GLOSSARY DATA CLASSIFICATION β β
β β βββββββββββββββββββ βββββββββββββββββββ β β
β β β Sales Revenue β β PII.Email β β β
β β β Customer Segmentβ β PII.Phone β β β
β β β Product Categoryβ β Financial.Card β β β
β β β Order Status β β Custom.Code β β β
β β βββββββββββββββββββ βββββββββββββββββββ β β
β β β β
β β SEARCH & DISCOVERY ACCESS POLICIES β β
β β βββββββββββββββββββ βββββββββββββββββββ β β
β β β Full-text search β β RBAC per asset β β β
β β β Tag-based β β Sensitivity β β β
β β β Column-level β β labels β β β
β β β Impact analysis β β Compliance β β β
β β βββββββββββββββββββ βββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Purview Scanning Setup
from azure.purview.datamap import PurviewDataMapClient
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
client = PurviewDataMapClient(
credential=credential,
account_name="purview-prod"
)
# Create collection for data domain
collection = client.collection.create_collection(
collection={
"name": "SalesData",
"parentCollectionName": "Root",
"description": "Sales domain data assets"
}
)
# Create scan for ADLS Gen2
scan = client.scan.create_scan(
scan_name="adls-sales-scan",
collection_name="SalesData",
properties={
"dataSource": {
"type": "AzureDataLakeStorageGen2",
"properties": {
"url": "https://stdatalake001.dfs.core.windows.net",
"tenantId": "tenant-id"
}
},
"scanRuleset": {
"type": "System",
"name": "AzureDataLakeStorageGen2"
},
"schedule": {
"frequency": "Daily",
"time": "02:00"
}
}
)
# Run scan
client.scan.run_scan(
collection_name="SalesData",
scan_name="adls-sales-scan"
)
Business Glossary Management
# Create glossary term
term = client.glossary.create_glossary_term(
glossary_name="BusinessGlossary",
glossary_term={
"name": "Sales Revenue",
"description": "Total revenue from product sales",
"abbreviation": "Rev",
"termStatus": "Approved",
"steward": "data-team@company.com",
"relatedTerms": ["Net Revenue", "Gross Revenue"],
"synonyms": ["Sales Income", "Revenue"]
}
)
# Link term to data asset
client.relationship.create_relationship(
entity1_type="AtlasGlossaryTerm",
entity1_guid="term-guid",
entity2_type="azure_datalake_gen2_path",
entity2_guid="asset-guid",
relationshipType="AtlasGlossaryTermAtlasGlossaryTerm"
)
βΉοΈ
Pro Tip: Use Purview's automated classification to discover sensitive data. Create custom classifiers for domain-specific patterns (e.g., internal customer codes, product SKUs).
Interview Questions
Q1: How does Purview differ from traditional data catalogs? A: Purview is cloud-native, integrates with Azure services, provides automated scanning/classification, and supports hybrid environments. Traditional catalogs are often on-premises with manual metadata entry.
Q2: What is the benefit of linking glossary terms to data assets? A: Links business terminology to technical assets, enabling non-technical users to discover relevant data. Supports impact analysis when terms change and provides context for data governance.
Q3: How do you implement data catalog governance? A: 1) Define ownership per collection/domain, 2) Establish scanning schedules, 3) Configure classification rules, 4) Create business glossary, 5) Set up access policies, 6) Monitor catalog usage and quality.