πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Amazon Athena for Data Engineers

AWS Data EngineeringAthena Serverless Queries & Federated Access⭐ Premium

Advertisement

πŸ” Amazon Athena

Master Athena serverless queries, federated access, workgroups, CTAS, and cost optimization.

Module: AWS Data Engineering β€’ Topic 12 of 65 β€’ Premium Content

Athena Architecture

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    AMAZON ATHENA ARCHITECTURE                                β”‚
β”‚                                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  QUERY CLIENTS                                                      β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚    β”‚
β”‚  β”‚  β”‚ Athena   β”‚  β”‚ QuickSightβ”‚  β”‚ Jupyter  β”‚  β”‚ JDBC/ODBCβ”‚          β”‚    β”‚
β”‚  β”‚  β”‚ Console  β”‚  β”‚          β”‚  β”‚ Notebook β”‚  β”‚ Driver   β”‚          β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜          β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚          β–Ό              β–Ό              β–Ό              β–Ό                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  ATHENA ENGINE (Serverless, Auto-scaling)                           β”‚    β”‚
β”‚  β”‚                                                                     β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚    β”‚
β”‚  β”‚  β”‚  Query Processing                                              β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  β€’ SQL parsing and optimization                                β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  β€’ Distributed execution (Presto/Trino)                        β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  β€’ Auto-scaling compute                                        β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  β€’ Result caching                                              β”‚  β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚    β”‚
β”‚  β”‚                                                                     β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚    β”‚
β”‚  β”‚  β”‚  DATA CATALOG (Glue Data Catalog)                             β”‚  β”‚    β”‚
β”‚  β”‚  β”‚                                                               β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  β€’ Database definitions                                       β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  β€’ Table schemas                                              β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  β€’ Partition metadata                                         β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  β€’ SerDe information                                          β”‚  β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                β”‚                                           β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                         β”‚
β”‚              β–Ό                 β–Ό                 β–Ό                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚  β”‚  S3 Data Lake   β”‚  β”‚  Federated      β”‚  β”‚  Results        β”‚            β”‚
β”‚  β”‚  (Primary)      β”‚  β”‚  Sources        β”‚  β”‚  (S3)           β”‚            β”‚
β”‚  β”‚                 β”‚  β”‚                 β”‚  β”‚                 β”‚            β”‚
β”‚  β”‚  Parquet/ORC    β”‚  β”‚  RDS/Redshift   β”‚  β”‚  Query results  β”‚            β”‚
β”‚  β”‚  JSON/CSV       β”‚  β”‚  DynamoDB       β”‚  β”‚  cached 30 days β”‚            β”‚
β”‚  β”‚  Avro           β”‚  β”‚  ElastiCache    β”‚  β”‚                 β”‚            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Athena Query Examples

Basic Queries

-- Standard SQL query on S3 data
SELECT
    customer_id,
    product_category,
    COUNT(*) as order_count,
    SUM(amount) as total_amount
FROM orders
WHERE year = 2024 AND month = 1
GROUP BY customer_id, product_category
ORDER BY total_amount DESC
LIMIT 100;

-- Query with complex predicates
SELECT
    DATE_TRUNC('hour', event_time) as hour_bucket,
    event_type,
    COUNT(*) as event_count,
    COUNT(DISTINCT user_id) as unique_users
FROM clickstream_events
WHERE year = 2024
  AND month = 1
  AND day = 15
  AND event_type IN ('page_view', 'click', 'purchase')
GROUP BY DATE_TRUNC('hour', event_time), event_type
ORDER BY hour_bucket;

CTAS (Create Table As Select)

-- Create optimized table from query results
CREATE TABLE analytics_db.daily_sales_summary
WITH (
    format = 'PARQUET',
    parquet_compression = 'SNAPPY',
    partitioned_by = ARRAY['year', 'month', 'day'],
    external_location = 's3://data-lake-curated/daily-sales-summary/'
) AS
SELECT
    sale_date,
    customer_id,
    product_id,
    SUM(amount) as total_amount,
    COUNT(*) as transaction_count,
    AVG(amount) as avg_amount
FROM raw_db.sales
WHERE year = 2024
GROUP BY sale_date, customer_id, product_id;

-- Create table with specific distribution
CREATE TABLE analytics_db.customer_metrics
WITH (
    format = 'PARQUET',
    parquet_compression = 'SNAPPY',
    bucketed_by = ARRAY['customer_id'],
    bucket_count = 16,
    external_location = 's3://data-lake-curated/customer-metrics/'
) AS
SELECT
    customer_id,
    COUNT(DISTINCT order_id) as total_orders,
    SUM(amount) as lifetime_value,
    MIN(order_date) as first_order_date,
    MAX(order_date) as last_order_date
FROM raw_db.orders
GROUP BY customer_id;

UNLOAD (Export Query Results)

-- Export query results to S3
UNLOAD (
    'SELECT * FROM analytics_db.daily_sales_summary WHERE year = 2024 AND month = 1'
)
TO 's3://data-export/sales-summary/2024/01/'
FORMAT PARQUET
PARTITIONED BY (day)
IAM_ROLE 'arn:aws:iam::123456789012:role/AthenaUnloadRole'
MAX_FILE_SIZE = '128 MB';

Athena Workgroups

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    ATHENA WORKGROUPS                                          β”‚
β”‚                                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  WORKGROUP: analytics-team                                          β”‚    β”‚
β”‚  β”‚                                                                     β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚    β”‚
β”‚  β”‚  β”‚  QUERY CONFIGURATION                                          β”‚  β”‚    β”‚
β”‚  β”‚  β”‚                                                               β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  β€’ Engine version: Athena v3                                  β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  β€’ Results location: s3://athena-results/analytics/           β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  β€’ Encryption: SSE-S3 or SSE-KMS                              β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  β€’ Bytes scanned cutoff: 10 GB                                β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  β€’ Query timeout: 30 minutes                                  β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  β€’ Enforcement: Required (prevent full scans)                 β”‚  β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚    β”‚
β”‚  β”‚                                                                     β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚    β”‚
β”‚  β”‚  β”‚  QUERY METRICS                                                β”‚  β”‚    β”‚
β”‚  β”‚  β”‚                                                               β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  β€’ Query count                                                β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  β€’ Data scanned (GB)                                          β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  β€’ Execution time (ms)                                        β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  β€’ Cost estimate                                              β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  β€’ Query plan (in CloudWatch)                                 β”‚  β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  WORKGROUP: ad-hoc-analysis                                         β”‚    β”‚
β”‚  β”‚                                                                     β”‚    β”‚
β”‚  β”‚  β€’ Higher cutoff: 100 GB                                           β”‚    β”‚
β”‚  β”‚  β€’ No enforcement (for exploration)                                 β”‚    β”‚
β”‚  β”‚  β€’ Separate billing                                                β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Workgroup Configuration

import boto3

athena = boto3.client('athena')

# Create workgroup
response = athena.create_work_group(
    Name='analytics-team',
    Description='Workgroup for analytics team',
    Configuration={
        'ResultConfiguration': {
            'OutputLocation': 's3://athena-results/analytics/',
            'EncryptionConfiguration': {
                'EncryptionOption': 'SSE_S3'
            }
        },
        'EnforceWorkGroupConfiguration': True,
        'PublishCloudWatchMetricsEnabled': True,
        'BytesScannedCutoffPerQuery': 10737418240,  # 10 GB
        'RequesterPaysEnabled': False,
        'EngineVersion': {
            'SelectedEngineVersion': 'AUTO',
            'EffectiveEngineVersion': 'Athena engine version 3'
        }
    },
    Tags={'Team': 'analytics', 'Environment': 'production'}
)

Athena Federated Query

# Federated query to RDS MySQL
federated_query = """
SELECT
    r.customer_id,
    r.customer_name,
    r.email,
    a.order_count,
    a.total_spend
FROM mysql_catalog.production_db.customers r
LEFT JOIN (
    SELECT
        customer_id,
        COUNT(*) as order_count,
        SUM(amount) as total_spend
    FROM athenaanalytics_db.orders
    WHERE year = 2024
    GROUP BY customer_id
) a ON r.customer_id = a.customer_id
WHERE r.created_at > DATE('2023-01-01')
"""

# Execute federated query
response = athena.start_query_execution(
    QueryString=federated_query,
    QueryExecutionContext={
        'Database': 'default'
    },
    WorkGroup='analytics-team'
)

# Federated query to DynamoDB
dynamodb_query = """
SELECT
    d.customer_id,
    d.profile_data,
    o.order_count
FROM dynamodb_catalog.customer_profiles d
JOIN athenaanalytics_db.orders_summary o
ON d.customer_id = o.customer_id
WHERE d.status = 'active'
"""

ℹ️

Pro Tip: Use federated queries sparingly. They scan external data sources directly, which can be slow and expensive. For frequent queries, use ETL to materialize results in S3.

Cost Optimization

StrategyImplementationSavings
Columnar FormatsUse Parquet/ORC50-90% less data scanned
PartitioningPartition by date keysReduce scanned partitions
Predicate FilteringAlways use WHERE clausesAvoid full table scans
BucketingBucket by high-cardinality keysReduce shuffle
Result CachingAthena caches results automaticallyFree
Workgroup LimitsSet bytes scanned cutoffPrevent runaway costs
-- Cost-optimized query example
-- Good: Uses partitions and filters
SELECT *
FROM orders
WHERE year = 2024
  AND month = 1
  AND day = 15
  AND customer_id = 'CUST-001';

-- Bad: Full table scan (expensive!)
SELECT *
FROM orders
WHERE customer_id = 'CUST-001';

⚠️

Cost Warning: Athena charges 5perTBscanned.Afullscanofa1TBtablecosts5 per TB scanned. A full scan of a 1 TB table costs5. Always partition data and filter queries to minimize cost.

Interview Questions & Answers

Q1: How does Athena pricing work?

Answer: Athena charges $5 per TB of data scanned. Key cost optimization strategies:

  • Use columnar formats (Parquet/ORC)
  • Partition data by query patterns
  • Always use WHERE clauses to filter
  • Set workgroup bytes scanned limits

Q2: What is CTAS and when should you use it?

Answer: CTAS (Create Table As Select) creates a new table from query results. Use it to:

  • Materialize aggregated results
  • Create optimized table formats
  • Partition data for faster queries
  • Convert formats (CSV to Parquet)

Q3: What is the difference between Athena v2 and v3?

Answer:

  • Athena v2: Presto 0.217, standard features
  • Athena v3: Trino 351, improved performance, new SQL functions

Athena v3 offers 2-10x better performance for most queries.

Q4: How do federated queries work in Athena?

Answer: Federated queries use Lambda connectors to query external data sources:

  1. Install connector (e.g., MySQL, DynamoDB)
  2. Register data source in Glue Catalog
  3. Query using catalog prefix (e.g., mysql_catalog.db.table)

Data is scanned at the source and filtered before returning.

Q5: What is the maximum query result size in Athena?

Answer:

  • Console: 10 MB display limit (can download full results)
  • API: 100 MB result set limit
  • UNLOAD: No limit (writes directly to S3)

For large result sets, use UNLOAD to export to S3.

Summary

Amazon Athena is the go-to serverless query service for data lakes. Key takeaways:

  • Serverless: No infrastructure management, auto-scaling
  • Pricing: $5 per TB scanned - optimize with columnar formats
  • Partitioning: Essential for cost and performance
  • CTAS: Materialize query results as optimized tables
  • Federated: Query external databases via Lambda connectors
  • Workgroups: Manage access, costs, and configurations
  • Formats: Prefer Parquet/ORC for 50-90% cost reduction

Advertisement