principlepythonModerate
Streaming vs batch: when to use Kafka+Flink vs Airflow+Spark
Viewed 0 times
streaming vs batch processingreal-time etlkafka flink sparkmicro-batch streaminglambda kappa architecture
Problem
Teams default to batch processing for everything, including use cases that require sub-minute latency (fraud detection, real-time recommendations), or over-engineer batch workloads as streaming, adding complexity without benefit.
Solution
Match the processing model to latency requirements:
# Use batch when:
# - Latency of minutes to hours is acceptable
# - Historical reprocessing is common
# - Data arrives in bulk (nightly dumps, daily files)
# Use streaming when:
# - Sub-second to sub-minute latency required
# - Events must be acted on as they arrive (fraud, alerts)
# - Continuous aggregations (sliding windows)
# Lambda architecture: both, with reconciliation
# Kappa architecture: streaming only, replay from Kafka for batch
# Micro-batch (Spark Structured Streaming) is a middle ground:
df = spark.readStream.format('kafka') \
.option('kafka.bootstrap.servers', 'broker:9092') \
.option('subscribe', 'events') \
.load()
query = df.writeStream.format('delta') \
.trigger(processingTime='1 minute') \
.start('s3://lake/events/')
# Use batch when:
# - Latency of minutes to hours is acceptable
# - Historical reprocessing is common
# - Data arrives in bulk (nightly dumps, daily files)
# Use streaming when:
# - Sub-second to sub-minute latency required
# - Events must be acted on as they arrive (fraud, alerts)
# - Continuous aggregations (sliding windows)
# Lambda architecture: both, with reconciliation
# Kappa architecture: streaming only, replay from Kafka for batch
# Micro-batch (Spark Structured Streaming) is a middle ground:
df = spark.readStream.format('kafka') \
.option('kafka.bootstrap.servers', 'broker:9092') \
.option('subscribe', 'events') \
.load()
query = df.writeStream.format('delta') \
.trigger(processingTime='1 minute') \
.start('s3://lake/events/')
Why
Streaming systems (Flink, Kafka Streams) are stateful, fault-tolerant, and designed for continuous event-by-event processing but are operationally complex. Batch systems (Spark, dbt) are simpler, cheaper, and easier to debug. The latency requirement is the primary decision driver.
Gotchas
- Spark Structured Streaming 'micro-batch' mode has 1-30 second latency minimum, not true streaming
- Streaming pipelines require handling out-of-order events, watermarks, and state store management
- Kafka retention period limits how far back you can replay — size it for your reprocessing window
- Exactly-once semantics in streaming require coordinated transactions between Kafka and the sink
Context
Choosing a data processing architecture for a new pipeline with latency requirements
Revisions (0)
No revisions yet.