Streaming Data Interview Q&A
25 interview questions on real-time streaming, Event Hubs, Stream Analytics, and processing patterns
Question 1: What is the difference between Event Hubs and Event Grid?
Answer: Event Hubs: High-throughput event streaming (millions of events/sec), time-retained log. Event Grid: Reactive event routing (smart routing, filtering), push-based. Use Event Hubs for data ingestion pipelines; Event Grid for event-driven architectures.
Question 2: How do you ensure message ordering in Event Hubs?
Answer: Event Hubs guarantees ordering within a partition using sequence numbers. Use the same partition key for related events. For global ordering, use a single partition (limits throughput).
Question 3: What is the purpose of Consumer Groups in Event Hubs?
Answer: Consumer Groups enable multiple consumers to read the same Event Hub independently, each maintaining their own offset position. Essential for parallel processing scenarios.
Question 4: How do you handle late-arriving data in Stream Analytics?
Answer: Configure late arrival policy (up to 5 days), use watermark delay, implement out-of-order handling with LATE ARRIVAL tolerance, and retroactively update window results.
Question 5: What is the difference between Tumbling, Hopping, and Session windows?
Answer: Tumbling: Fixed-size, non-overlapping. Hopping: Fixed-size with overlap. Session: Dynamic size based on activity gaps. Use tumbling for periodic aggregations; session for user/device activity.
Question 6: How do you scale Event Hubs for higher throughput?
Answer: Increase TU/CU, add partitions, use partition keys evenly, scale consumers, and use Premium tier for dedicated resources. Monitor utilization to determine scaling needs.
Question 7: What is the benefit of Event Hubs Capture?
Answer: Auto-archives events to ADLS/Blob Storage as Avro/Parquet files. Enables batch analytics alongside real-time Stream Analytics. No additional charges for Capture.
Question 8: How do you handle backpressure in streaming pipelines?
Answer: Event Hubs buffers events (7-day retention). Stream Analytics processes at its pace (SU). Monitor lag metrics and scale SU as needed. Implement circuit breaker patterns.
Question 9: What is the difference between at-least-once and exactly-once delivery?
Answer: At-least-once: Message delivered one or more times (duplicates possible). Exactly-once: Message delivered exactly once (no duplicates). Event Hubs provides at-least-once; use deduplication for exactly-once semantics.
Question 10: How do you implement anomaly detection in Stream Analytics?
Answer: Use built-in functions: AnomalyDetection_SpikeAndDip (spikes/dips), AnomalyDetection_ChangePoint (trend changes). Configure confidence level and sensitivity parameters.
Question 11: What is the maximum number of Streaming Units per job?
Answer: 120 SU per Stream Analytics job. Each SU provides 1 MB/s ingress and 2 MB/s egress. Scale based on throughput requirements and complexity.
Question 12: How do you handle message ordering across partitions?
Answer: Cross-partition ordering is not guaranteed. Use partition keys for ordering within partitions. For global ordering, use single partition (limits throughput) or implement application-level ordering.
Question 13: What is the benefit of temporal joins in Stream Analytics?
Answer: Join streaming data with reference data based on timestamps. Ensures data consistency in streaming scenarios. Use DATEDIFF to define join window.
Question 14: How do you handle poison messages in Event Hubs?
Answer: Implement dead-letter queue pattern, monitor error rates, use retry policies with exponential backoff, and alert on repeated failures. Route problematic messages for manual inspection.
Question 15: What is the difference between Stream Analytics and Azure Functions?
Answer: Stream Analytics: SQL-like windowed aggregations, temporal joins, CEP. Azure Functions: Custom logic, API calls, simple transformations. Use Stream Analytics for analytics; Functions for actions.
Question 16: How do you implement exactly-once processing?
Answer: Use checkpointing in Stream Analytics, idempotent operations in consumers, deduplication with unique keys, and transactional writes to downstream systems.
Question 17: What is the benefit of partitioning in Event Hubs?
Answer: Enables parallel consumption, provides ordering within partitions, scales throughput, and isolates consumers. Choose partition count based on throughput requirements.
Question 18: How do you handle schema changes in streaming?
Answer: Use schema registry (Event Hubs), schema evolution in Stream Analytics, and flexible consumers that handle unknown fields. Validate schema at ingestion.
Question 19: What is the difference between pull and push models?
Answer: Pull: Consumer polls for new events (higher latency). Push: Events pushed to consumer (lower latency). Event Hubs supports both via consumer API and Event Grid integration.
Question 20: How do you test streaming pipelines?
Answer: Use test data generators, replay from Event Hubs Capture files, simulate load with Azure Load Testing, and validate window results against expected outputs.
Question 21: What is the benefit of windowed aggregations?
Answer: Provides time-bounded analytics (e.g., 5-minute averages), reduces data volume, enables trend analysis, and supports real-time dashboards.
Question 22: How do you handle idempotent processing?
Answer: Use unique message IDs, checkpoint processing, deduplication logic, and transactional writes. Ensure processing can be safely replayed without side effects.
Question 23: What is the difference between Event Hubs Basic and Standard tier?
Answer: Basic: 1 TU, no Capture, no SSL. Standard: 40 TU, Capture, SSL, Premium features. Use Standard for production; Basic for development/testing.
Question 24: How do you monitor streaming pipeline health?
Answer: Monitor Event Hub metrics (throughput, lag), Stream Analytics SU utilization, checkpoint progress, and custom metrics. Set up alerts for anomalies.
Question 25: What is the difference between batch and stream processing?
Answer: Batch: Process large datasets periodically. Stream: Process events continuously. Use batch for historical analysis; stream for real-time insights.