Data Catalog & Metadata Management

Master Google Cloud Data Catalog including metadata management, Dataplex integration, data lineage, and discovery patterns.

14 min readIntermediate

Data Catalog Architecture

🏗️ GCP Data Engineering Reference Architecture

Interview Tip: GCP's data engineering stack is serverless-first. Dataflow (Apache Beam) handles both streaming and batch. BigQuery is the flagship analytics service.

Implementation

Tag Templates

from google.cloud import datacatalog_v1

client = datacatalog_v1.DataCatalogClient()

# Create tag template for PII classification
template = client.create_tag_template(
    request={
        "parent": "projects/my-project/locations/us-central1",
        "tag_template_id": "pii_classification",
        "tag_template": {
            "display_name": "PII Classification",
            "fields": {
                "pii_type": {
                    "display_name": "PII Type",
                    "type": {
                        "enum_type": {
                            "allowed_values": [
                                {"display_name": "None"},
                                {"display_name": "Email"},
                                {"display_name": "Phone"},
                                {"display_name": "SSN"},
                                {"display_name": "Credit Card"}
                            ]
                        }
                    }
                },
                "data_owner": {
                    "display_name": "Data Owner",
                    "type": {"primitive_type": "STRING"}
                },
                "sla_hours": {
                    "display_name": "SLA (hours)",
                    "type": {"primitive_type": "DOUBLE"}
                }
            }
        }
    }
)

# Apply tag to a BigQuery table
tag = client.create_tag(
    request={
        "parent": "projects/my-project/locations/us-central1/entryGroups/@bigquery/entries/project~my-project~datasets~analytics~tables~users",
        "tag": {
            "template": template.name,
            "fields": {
                "pii_type": {"enum_value": {"display_name": "Email"}},
                "data_owner": {"string_value": "data-team@company.com"},
                "sla_hours": {"double_value": 24.0}
            }
        }
    }
)

Data Lineage with Dataplex

from google.cloud import dataplex_v1

client = dataplex_v1.LineageClient()

# Create lineage event
lineage_event = client.create_lineage_event(
    request={
        "parent": "projects/my-project/locations/us-central1",
        "lineage_event": {
            "run": {
                "pipeline": {
                    "name": "sales-etl-pipeline"
                },
                "start_time": "2025-01-15T10:00:00Z",
                "end_time": "2025-01-15T10:05:00Z"
            },
            "source": {
                "name": "gs://my-data-lake/bronze/sales/",
                "type_": dataplex_v1.LineageEvent.Source.Type.STORAGE_BUCKET
            },
            "target": {
                "name": "projects/my-project/datasets/analytics/tables/sales",
                "type_": dataplex_v1.LineageEvent.Target.Type.BIGQUERY_TABLE
            }
        }
    }
)

✨

Best Practice: Implement tag templates for PII classification, data ownership, and SLA tracking. Use Dataplex for automated data discovery and lineage. Create business glossary terms for consistent terminology. Regularly audit metadata for accuracy.

💬

Common Interview Questions

Q1: What is the purpose of Data Catalog?

Answer: Data Catalog provides centralized metadata management for discovering, understanding, and governing data assets. It enables search across all data resources, tracks lineage, and supports tag-based classification for compliance and governance.

Q2: How does Dataplex integrate with Data Catalog?

Answer: Dataplex automatically discovers and catalogs data assets in BigQuery, GCS, and Bigtable. It creates entries in Data Catalog with quality scores, lineage information, and technical metadata. Dataplex zones provide logical grouping for access control.

Q3: What are tag templates used for?

Answer: Tag templates define custom metadata schemas (PII type, data owner, SLA, quality score). They make metadata searchable and enforce consistent classification. Use tags for compliance (GDPR, HIPAA), data governance, and discovery.

Q4: How do you implement data lineage on GCP?

Answer: Use Dataplex Lineage API to track data movement between systems. Integrate lineage tracking in Dataflow and Dataproc pipelines. Visualize lineage in Dataplex UI. Use lineage for impact analysis and compliance auditing.

Q5: What are the benefits of data discovery?

Answer: 1) Find relevant datasets quickly, 2) Understand data context and quality, 3) Reduce data duplication, 4) Enable self-service analytics, 5) Support compliance requirements, 6) Improve data trust and adoption.

Data Catalog: Dataplex, Metadata & Lineage