Data Catalog Architecture
Implementation
Tag Templates
from google.cloud import datacatalog_v1
client = datacatalog_v1.DataCatalogClient()
# Create tag template for PII classification
template = client.create_tag_template(
request={
"parent": "projects/my-project/locations/us-central1",
"tag_template_id": "pii_classification",
"tag_template": {
"display_name": "PII Classification",
"fields": {
"pii_type": {
"display_name": "PII Type",
"type": {
"enum_type": {
"allowed_values": [
{"display_name": "None"},
{"display_name": "Email"},
{"display_name": "Phone"},
{"display_name": "SSN"},
{"display_name": "Credit Card"}
]
}
}
},
"data_owner": {
"display_name": "Data Owner",
"type": {"primitive_type": "STRING"}
},
"sla_hours": {
"display_name": "SLA (hours)",
"type": {"primitive_type": "DOUBLE"}
}
}
}
}
)
# Apply tag to a BigQuery table
tag = client.create_tag(
request={
"parent": "projects/my-project/locations/us-central1/entryGroups/@bigquery/entries/project~my-project~datasets~analytics~tables~users",
"tag": {
"template": template.name,
"fields": {
"pii_type": {"enum_value": {"display_name": "Email"}},
"data_owner": {"string_value": "data-team@company.com"},
"sla_hours": {"double_value": 24.0}
}
}
}
)
Data Lineage with Dataplex
from google.cloud import dataplex_v1
client = dataplex_v1.LineageClient()
# Create lineage event
lineage_event = client.create_lineage_event(
request={
"parent": "projects/my-project/locations/us-central1",
"lineage_event": {
"run": {
"pipeline": {
"name": "sales-etl-pipeline"
},
"start_time": "2025-01-15T10:00:00Z",
"end_time": "2025-01-15T10:05:00Z"
},
"source": {
"name": "gs://my-data-lake/bronze/sales/",
"type_": dataplex_v1.LineageEvent.Source.Type.STORAGE_BUCKET
},
"target": {
"name": "projects/my-project/datasets/analytics/tables/sales",
"type_": dataplex_v1.LineageEvent.Target.Type.BIGQUERY_TABLE
}
}
}
)
β¨
Best Practice: Implement tag templates for PII classification, data ownership, and SLA tracking. Use Dataplex for automated data discovery and lineage. Create business glossary terms for consistent terminology. Regularly audit metadata for accuracy.
Common Interview Questions
Q1: What is the purpose of Data Catalog?
Answer: Data Catalog provides centralized metadata management for discovering, understanding, and governing data assets. It enables search across all data resources, tracks lineage, and supports tag-based classification for compliance and governance.
Q2: How does Dataplex integrate with Data Catalog?
Answer: Dataplex automatically discovers and catalogs data assets in BigQuery, GCS, and Bigtable. It creates entries in Data Catalog with quality scores, lineage information, and technical metadata. Dataplex zones provide logical grouping for access control.
Q3: What are tag templates used for?
Answer: Tag templates define custom metadata schemas (PII type, data owner, SLA, quality score). They make metadata searchable and enforce consistent classification. Use tags for compliance (GDPR, HIPAA), data governance, and discovery.
Q4: How do you implement data lineage on GCP?
Answer: Use Dataplex Lineage API to track data movement between systems. Integrate lineage tracking in Dataflow and Dataproc pipelines. Visualize lineage in Dataplex UI. Use lineage for impact analysis and compliance auditing.
Q5: What are the benefits of data discovery?
Answer: 1) Find relevant datasets quickly, 2) Understand data context and quality, 3) Reduce data duplication, 4) Enable self-service analytics, 5) Support compliance requirements, 6) Improve data trust and adoption.