Dataplex Deep Dive

Master Dataplex including data lakes, zones, assets, lineage tracking, data quality, and governance patterns.

18 min readAdvanced

Dataplex Architecture

🏗️ GCP Data Engineering Reference Architecture

Interview Tip: GCP's data engineering stack is serverless-first. Dataflow (Apache Beam) handles both streaming and batch. BigQuery is the flagship analytics service.

Implementation

from google.cloud import dataplex_v1

client = dataplex_v1.DataplexServiceClient()

# Create lake
lake = client.create_lake(
    request={
        "parent": "projects/my-project/locations/us-central1",
        "lake_id": "analytics-lake",
        "lake": {
            "display_name": "Analytics Lake",
            "description": "Central analytics data lake"
        }
    }
)

# Create raw zone
raw_zone = client.create_zone(
    request={
        "parent": lake.name,
        "zone_id": "raw-zone",
        "zone": {
            "display_name": "Raw Zone",
            "type_": dataplex_v1.Zone.Type.RAW,
            "resource_spec": {
                "location_type": dataplib_v1.Zone.ResourceSpec.LocationType.SINGLE_REGION
            }
        }
    }
)

# Create asset for GCS bucket
asset = client.create_asset(
    request={
        "parent": raw_zone.name,
        "asset_id": "raw-data-bucket",
        "asset": {
            "display_name": "Raw Data Bucket",
            "resource_spec": {
                "name": "//storage.googleapis.com/my-raw-bucket",
                "type_": dataplex_v1.Asset.ResourceSpec.Type.STORAGE_BUCKET
            },
            "discovery_spec": {
                "enabled": True,
                "schedule": "0 */6 * * *"
            }
        }
    }
)

Data Quality Rules

# Create data quality scan
scan = client.create_data_scan(
    request={
        "parent": "projects/my-project/locations/us-central1",
        "data_scan_id": "sales-quality",
        "data_scan": {
            "display_name": "Sales Quality Scan",
            "data": {
                "resource": "projects/my-project/datasets/analytics/tables/sales"
            },
            "data_quality_spec": {
                "rules": [
                    {
                        "dimension": "COMPLETENESS",
                        "column": "order_id",
                        "threshold": 1.0,
                        "non_null_expectation": {}
                    },
                    {
                        "dimension": "UNIQUENESS",
                        "column": "order_id",
                        "threshold": 1.0,
                        "uniqueness_expectation": {}
                    }
                ]
            }
        }
    }
)

✨

Best Practice: Organize data into zones based on sensitivity and processing stage. Enable auto-discovery for new assets. Implement data quality rules for each zone. Track lineage for compliance and impact analysis. Use policy tags for column-level security.

💬

Common Interview Questions

Q1: What is the purpose of zones in Dataplex?

Answer: Zones logically organize data by sensitivity, processing stage, or domain. They provide isolation, different access controls, and quality rules. Use Raw for unprocessed data, Curated for validated data.

Q2: How does Dataplex auto-discovery work?

Answer: Dataplex automatically scans assets on a schedule to discover schemas, data types, and metadata. It creates entries in Data Catalog, identifies PII, and updates data profiles. Discovery runs hourly, daily, or on-demand.

Q3: What is the difference between assets and zones?

Answer: Zones are logical groupings with access controls and quality rules. Assets are physical resources (BigQuery datasets, GCS buckets) attached to zones. Multiple assets can exist in a zone.

Q4: How do you implement data lineage in Dataplex?

Answer: Use the Lineage API to track data movement between systems. Integrate lineage tracking in Dataflow and Dataproc pipelines. Visualize lineage in the Dataplex UI. Use lineage for impact analysis and compliance.

Q5: What are the benefits of Dataplex for governance?

Answer: 1) Centralized data discovery, 2) Automated quality checks, 3) Lineage tracking, 4) Column-level security via policy tags, 5) Integration with Data Catalog, 6) Compliance reporting.

Dataplex: Data Lakes, Zones, Assets & Lineage