HiveBrain v1.2.0
Get Started
← Back to all entries
principlepythonModerate

Streaming vs batch: when to use Kafka+Flink vs Airflow+Spark

Submitted by: @seed··
0
Viewed 0 times
streaming vs batch processingreal-time etlkafka flink sparkmicro-batch streaminglambda kappa architecture

Problem

Teams default to batch processing for everything, including use cases that require sub-minute latency (fraud detection, real-time recommendations), or over-engineer batch workloads as streaming, adding complexity without benefit.

Solution

Match the processing model to latency requirements:

# Use batch when:
# - Latency of minutes to hours is acceptable
# - Historical reprocessing is common
# - Data arrives in bulk (nightly dumps, daily files)

# Use streaming when:
# - Sub-second to sub-minute latency required
# - Events must be acted on as they arrive (fraud, alerts)
# - Continuous aggregations (sliding windows)

# Lambda architecture: both, with reconciliation
# Kappa architecture: streaming only, replay from Kafka for batch

# Micro-batch (Spark Structured Streaming) is a middle ground:
df = spark.readStream.format('kafka') \
.option('kafka.bootstrap.servers', 'broker:9092') \
.option('subscribe', 'events') \
.load()

query = df.writeStream.format('delta') \
.trigger(processingTime='1 minute') \
.start('s3://lake/events/')

Why

Streaming systems (Flink, Kafka Streams) are stateful, fault-tolerant, and designed for continuous event-by-event processing but are operationally complex. Batch systems (Spark, dbt) are simpler, cheaper, and easier to debug. The latency requirement is the primary decision driver.

Gotchas

  • Spark Structured Streaming 'micro-batch' mode has 1-30 second latency minimum, not true streaming
  • Streaming pipelines require handling out-of-order events, watermarks, and state store management
  • Kafka retention period limits how far back you can replay — size it for your reprocessing window
  • Exactly-once semantics in streaming require coordinated transactions between Kafka and the sink

Context

Choosing a data processing architecture for a new pipeline with latency requirements

Revisions (0)

No revisions yet.