Streaming Data Interview Q&A

25 interview questions on real-time streaming, Event Hubs, Stream Analytics, and processing patterns

Question 1: What is the difference between Event Hubs and Event Grid?

Answer: Event Hubs: High-throughput event streaming (millions of events/sec), time-retained log. Event Grid: Reactive event routing (smart routing, filtering), push-based. Use Event Hubs for data ingestion pipelines; Event Grid for event-driven architectures.

Question 2: How do you ensure message ordering in Event Hubs?

Answer: Event Hubs guarantees ordering within a partition using sequence numbers. Use the same partition key for related events. For global ordering, use a single partition (limits throughput).

Question 3: What is the purpose of Consumer Groups in Event Hubs?

Answer: Consumer Groups enable multiple consumers to read the same Event Hub independently, each maintaining their own offset position. Essential for parallel processing scenarios.

Question 4: How do you handle late-arriving data in Stream Analytics?

Answer: Configure late arrival policy (up to 5 days), use watermark delay, implement out-of-order handling with LATE ARRIVAL tolerance, and retroactively update window results.

Question 5: What is the difference between Tumbling, Hopping, and Session windows?

Answer: Tumbling: Fixed-size, non-overlapping. Hopping: Fixed-size with overlap. Session: Dynamic size based on activity gaps. Use tumbling for periodic aggregations; session for user/device activity.

Question 6: How do you scale Event Hubs for higher throughput?

Answer: Increase TU/CU, add partitions, use partition keys evenly, scale consumers, and use Premium tier for dedicated resources. Monitor utilization to determine scaling needs.

Question 7: What is the benefit of Event Hubs Capture?

Answer: Auto-archives events to ADLS/Blob Storage as Avro/Parquet files. Enables batch analytics alongside real-time Stream Analytics. No additional charges for Capture.

Question 8: How do you handle backpressure in streaming pipelines?

Answer: Event Hubs buffers events (7-day retention). Stream Analytics processes at its pace (SU). Monitor lag metrics and scale SU as needed. Implement circuit breaker patterns.

Question 9: What is the difference between at-least-once and exactly-once delivery?

Answer: At-least-once: Message delivered one or more times (duplicates possible). Exactly-once: Message delivered exactly once (no duplicates). Event Hubs provides at-least-once; use deduplication for exactly-once semantics.

Question 10: How do you implement anomaly detection in Stream Analytics?

Answer: Use built-in functions: AnomalyDetection_SpikeAndDip (spikes/dips), AnomalyDetection_ChangePoint (trend changes). Configure confidence level and sensitivity parameters.

Question 11: What is the maximum number of Streaming Units per job?

Answer: 120 SU per Stream Analytics job. Each SU provides 1 MB/s ingress and 2 MB/s egress. Scale based on throughput requirements and complexity.

Question 12: How do you handle message ordering across partitions?

Answer: Cross-partition ordering is not guaranteed. Use partition keys for ordering within partitions. For global ordering, use single partition (limits throughput) or implement application-level ordering.

Question 13: What is the benefit of temporal joins in Stream Analytics?

Answer: Join streaming data with reference data based on timestamps. Ensures data consistency in streaming scenarios. Use DATEDIFF to define join window.

Question 14: How do you handle poison messages in Event Hubs?

Answer: Implement dead-letter queue pattern, monitor error rates, use retry policies with exponential backoff, and alert on repeated failures. Route problematic messages for manual inspection.

Question 15: What is the difference between Stream Analytics and Azure Functions?

Answer: Stream Analytics: SQL-like windowed aggregations, temporal joins, CEP. Azure Functions: Custom logic, API calls, simple transformations. Use Stream Analytics for analytics; Functions for actions.

Question 16: How do you implement exactly-once processing?

Answer: Use checkpointing in Stream Analytics, idempotent operations in consumers, deduplication with unique keys, and transactional writes to downstream systems.

Question 17: What is the benefit of partitioning in Event Hubs?

Answer: Enables parallel consumption, provides ordering within partitions, scales throughput, and isolates consumers. Choose partition count based on throughput requirements.

Question 18: How do you handle schema changes in streaming?

Answer: Use schema registry (Event Hubs), schema evolution in Stream Analytics, and flexible consumers that handle unknown fields. Validate schema at ingestion.

Question 19: What is the difference between pull and push models?

Answer: Pull: Consumer polls for new events (higher latency). Push: Events pushed to consumer (lower latency). Event Hubs supports both via consumer API and Event Grid integration.

Question 20: How do you test streaming pipelines?

Answer: Use test data generators, replay from Event Hubs Capture files, simulate load with Azure Load Testing, and validate window results against expected outputs.

Question 21: What is the benefit of windowed aggregations?

Answer: Provides time-bounded analytics (e.g., 5-minute averages), reduces data volume, enables trend analysis, and supports real-time dashboards.

Question 22: How do you handle idempotent processing?

Answer: Use unique message IDs, checkpoint processing, deduplication logic, and transactional writes. Ensure processing can be safely replayed without side effects.

Question 23: What is the difference between Event Hubs Basic and Standard tier?

Answer: Basic: 1 TU, no Capture, no SSL. Standard: 40 TU, Capture, SSL, Premium features. Use Standard for production; Basic for development/testing.

Question 24: How do you monitor streaming pipeline health?

Answer: Monitor Event Hub metrics (throughput, lag), Stream Analytics SU utilization, checkpoint progress, and custom metrics. Set up alerts for anomalies.

Question 25: What is the difference between batch and stream processing?

Answer: Batch: Process large datasets periodically. Stream: Process events continuously. Use batch for historical analysis; stream for real-time insights.