Dataform Architecture
SQLX Transformations
-- models/staging/stg_orders.sqlx
config {
type: "view",
schema: "staging",
description: "Staged orders data"
}
WITH source AS (
SELECT * FROM {{ ref("raw_orders") }}
),
transformed AS (
SELECT
order_id,
customer_id,
SAFE_CAST(amount AS FLOAT64) as amount,
order_date,
status,
CURRENT_TIMESTAMP() as loaded_at
FROM source
WHERE order_id IS NOT NULL
)
SELECT * FROM transformed
-- models/marts/daily_sales.sqlx
config {
type: "table",
schema: "marts",
description: "Daily sales summary",
tags: ["daily", "sales"]
}
WITH orders AS (
SELECT * FROM {{ ref("stg_orders") }}
),
daily_summary AS (
SELECT
DATE(order_date) as order_date,
COUNT(*) as order_count,
SUM(amount) as total_revenue,
AVG(amount) as avg_order_value,
COUNT(DISTINCT customer_id) as unique_customers
FROM orders
WHERE status = 'completed'
GROUP BY 1
)
SELECT * FROM daily_summary
Assertions
-- assertions/order_quality.sqlx
config {
type: "assertion",
dataset: "marts",
description: "Validate daily sales data quality"
}
SELECT *
FROM {{ ref("daily_sales") }}
WHERE total_revenue < 0
OR order_count = 0
OR unique_customers < 0
Scheduling
# Schedule Dataform workflow via API
from google.cloud import dataform_v1beta1
client = dataform_v1beta1.DataformClient()
# Create workflow invocation
invocation = client.create_workflow_invocation(
request={
"parent": "projects/my-project/locations/us-central1/repositories/my-repo/workspaces/main",
"workflow_invocation": {
"workflow_config": {
"name": "daily_etl"
}
}
}
)
β¨
Best Practice: Use SQLX for modular, reusable transformations. Implement assertions for data quality checks. Use ref() for dependency management. Schedule workflows via Cloud Composer or Cloud Scheduler. Version control all SQLX files.
Common Interview Questions
Q1: What is SQLX in Dataform?
Answer: SQLX is Dataform's templating language for SQL transformations. It supports Jinja templating, ref() for dependencies, config blocks for metadata, and assertions for data quality. SQLX compiles to standard SQL for BigQuery execution.
Q2: How does Dataform handle dependencies?
Answer: Dataform automatically resolves dependencies using the ref() function. When you reference {{ ref("model_name") }, Dataform compiles this to the correct table/view reference. Dependencies are visualized in the DAG.
Q3: What are assertions in Dataform?
Answer: Assertions are data quality checks that validate your transformations. They return rows that violate the assertion. If any rows are returned, the assertion fails. Use assertions to check completeness, validity, and business rules.
Q4: How do you schedule Dataform workflows?
Answer: Use Cloud Composer (Airflow) with the Dataform operator, Cloud Scheduler with the Dataform API, or Dataform's built-in scheduling. Cloud Composer is recommended for complex orchestration with dependencies.
Q5: What is the benefit of Dataform over raw SQL?
Answer: 1) Version control via Git, 2) Dependency management with ref(), 3) Built-in data quality assertions, 4) Compilation and validation, 5) Documentation generation, 6) Integration with BigQuery.