Cloud Data Prep Overview
Cloud Data Prep is a visual data preparation tool built on Apache Cloud Dataprep (Trifacta). It allows data engineers and analysts to explore, clean, and transform data without writing code.
Key Features
Data Preparation Workflow
Transformation Recipes
Common Transformations
// Data Prep recipe steps (TRQL - Trifacta Recipe Language)
// 1. Rename columns
rename col: `old_name` to: `new_name`
// 2. Change data types
settype col: `amount` as: float
// 3. Filter rows
filter rowtype: `status` == "active"
// 4. Extract date parts
derive col: `year` as: year(`order_date`)
derive col: `month` as: month(`order_date`)
// 5. Conditional transformation
derive col: `category` as:
if(`amount` > 1000, "premium",
if(`amount` > 100, "standard", "basic"))
// 6. String operations
derive col: `domain` as: extract(`email`, "@(.+)$")
// 7. Aggregation
aggregate value: sum(`amount`) group: `product_category`
// 8. Join datasets
join col: `user_id` with: `users` on: `user_id`
Data Quality Rules
// Validation rules
validate col: `email` regex: `^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`
validate col: `phone` regex: `^\+?[1-9]\d{1,14}$`
validate col: `amount` range: 0 to 1000000
validate col: `date` format: "yyyy-MM-dd"
// Custom validation
validate col: `status` in: ["active", "inactive", "pending"]
// Row-level validation
validate row: `amount` > 0 and `quantity` > 0
Integration with Data Engineering
Export to BigQuery
# Data Prep API for programmatic access
import requests
def create_dataprep_job(api_key, recipe_id, output_table):
"""Create Data Prep job via API."""
headers = {
'Authorization': f'Bearer {api_key}',
'Content-Type': 'application/json'
}
job_config = {
"wrangledDataset": {
"id": recipe_id
},
"execution": {
"runsOn": {
"type": "dataproc",
"projectId": "my-project",
"region": "us-central1"
}
},
"outputs": [{
"type": "bigquery",
"table": output_table
}]
}
response = requests.post(
"https://api.dataprep.trifacta.com/v4/job",
headers=headers,
json=job_config
)
return response.json()
Always monitor your BigQuery costs using INFORMATION_SCHEMA. Set up budget alerts at 50%, 80%, and 100% thresholds.
Cost Optimization
# Data Prep pricing components
pricing = {
"dataprep_units": {
"description": "DPUs consumed per transformation",
"cost_per_unit": "$0.05 per DPU-hour"
},
"dataflow_workers": {
"description": "Compute for running recipes",
"cost": "Standard Dataflow pricing"
},
"storage": {
"description": "GCS for intermediate data",
"cost": "Standard GCS pricing"
}
}
# Cost optimization strategies
strategies = {
"profile_before_transform": "Understand data quality before processing",
"use_sampling": "Profile on samples, run full on production",
"incremental_processing": "Process only new/changed data",
"right_size_workers": "Don't over-provision Dataflow workers",
"schedule_off_peak": "Run jobs during off-peak hours for lower costs"
}
βΉοΈ
Cost Tip: Data Prep charges per DPU-hour (Data Preparation Unit). Profile your data first to understand complexity, then optimize transformations. Use sampling for initial exploration and schedule production jobs during off-peak hours.
Common Interview Questions
Q1: When would you use Cloud Data Prep vs. writing code?
Answer: Cloud Data Prep is ideal for ad-hoc data exploration, business analyst self-service, and quick prototyping. Use code (Python/SQL) for production pipelines requiring version control, testing, and complex logic. Data Prep excels at visual data profiling and interactive transformations.
Q2: How does Data Prep integrate with BigQuery?
Answer: Data Prep can read directly from BigQuery tables and write transformed data back. Use the BigQuery connection for seamless integration. For large datasets, export to GCS first, then load into BigQuery for optimal performance.
Q3: What is the DPU model in Data Prep?
Answer: DPU (Data Preparation Unit) measures computational effort per transformation. Complex operations like joins consume more DPUs than simple filters. Understanding DPU consumption helps optimize costs and plan resource allocation.
Q4: How do you version control Data Prep recipes?
Answer: Data Prep maintains built-in version history for all recipe changes. For external version control, export recipes as JSON and store in Git. Use the Data Prep API to automate recipe deployment across environments.
Q5: What data formats does Data Prep support?
Answer: Data Prep supports CSV, JSON, Parquet, Avro, Excel, and fixed-width formats. For best performance with large datasets, use columnar formats (Parquet/Avro). For web applications, JSON is commonly used.