Neural-Path/Notes
35 min

Stream Processing & Real-Time Data

Batch pipelines process data that has already arrived; stream processing handles data as it arrives. When the business needs fraud decisions in milliseconds, recommendation updates in seconds, or dashboards reflecting activity from the last minute, batch isn't fast enough. Kafka provides durable, replayable event streams; Flink provides stateful computation over those streams with exactly-once semantics.

How It Works

Stream windowing — 15 events over 60 seconds

Fixed-size, non-overlapping windows. Each event belongs to exactly one window.

0s10s20s30s40s50s60s
[020s]n=5 sum=66
[2040s]n=5 sum=95
[4060s]n=5 sum=96

Toggle between the three window types above. The window type you choose determines what question you can answer:

  • Tumbling: "how many events per 20-second period?" — non-overlapping, clean buckets
  • Sliding: "how many events in any rolling 20-second span?" — can detect spikes that straddle bucket boundaries
  • Session: "how many events per user session?" — session gaps define boundaries naturally

Batch pipelines process a file of yesterday's data; stream processing answers "what is happening right now?" These are not faster and slower versions of the same thing — they require different infrastructure choices because stream processing must handle events that arrive out of order, must maintain state across millions of concurrent event sequences, and must provide exactly-once guarantees despite node failures. Kafka and Flink solve these problems differently from any batch system.

Kafka: durable event streams

Apache Kafka is a distributed log, not a queue. The key distinction: messages are not deleted when consumed. They persist for a configurable retention period (typically 7–30 days) and can be replayed from any offset. This makes Kafka the backbone for stream processing pipelines: multiple consumers can each maintain their own read position independently, and a consumer that fails can restart from its last committed offset without losing events.

Data is organized into topics (logical streams of related events), which are split into partitions. Partitions are the unit of parallelism: each partition is an ordered, append-only log stored on a broker. A consumer group distributes partitions across its members — with N consumers and M partitions, each consumer handles M/N partitions. Kafka guarantees ordering within a partition, not across partitions. To ensure all events for a given key (e.g., user_id) are processed in order by the same consumer, produce events with that key so Kafka routes them to a consistent partition.

Offsets and consumer groups: each consumer tracks its position in each partition via an offset — a monotonically increasing integer. Consumer groups commit offsets to Kafka (or an external store) so that on restart, processing resumes from the last committed position, not the beginning of the log.

Retention vs compaction: time-based retention (log.retention.hours=168) keeps all events for a fixed window, then deletes old segments. Log compaction keeps only the latest value for each key — useful for topics where each message represents the current state of an entity (like a CDC topic), where you only need the latest value per key, not the full history.

Flink: stateful stream computation

Apache Flink processes Kafka (or other) streams with stateful operators: computations that aggregate or track information across multiple events. Flink manages state in memory, backed by a distributed snapshot (checkpoint) stored in object storage. If a Flink job crashes and restarts, it recovers from the last checkpoint with exactly-once semantics — each event is processed exactly once in the final output, even in the presence of failures.

The fundamental Flink operator for analytical workloads is the window: grouping events into bounded time ranges before computing an aggregation.

Watermarks and windows are necessary together because "event time" (when the event happened) and "processing time" (when Flink saw it) diverge in distributed systems. Without watermarks, Flink would never know when a window is complete — late events could always arrive. The watermark is Flink's estimate of "how far behind are the latest events I haven't seen yet?" It makes the window-closing decision tractable by accepting that events beyond the allowed lateness are either handled specially or dropped, rather than waiting forever.

Watermarks and late data: events arrive out of order. A mobile app event generated at 12:00:05 might arrive at the Flink processor at 12:00:15 due to network delay. Flink's watermark mechanism advances event time — a watermark at time T says "all events before T have arrived." When a window closes (watermark passes the window end), Flink emits the aggregate. You configure an allowed lateness — events arriving after the window closed but within the lateness bound update the previously emitted result; events beyond the bound are dropped or sent to a side output for manual handling.

Kafka Streams and ksqlDB

For simpler stream processing that doesn't require the full Flink runtime, Kafka Streams is a Java library that runs the processing logic inside your application — no separate cluster to operate. It handles stateful aggregations, joins, and windowing using Kafka itself as the state backend. ksqlDB provides a SQL interface on top of Kafka Streams, letting you write streaming queries in SQL syntax against Kafka topics.

Design Tradeoffs

Where Your Intuition Breaks

The instinct when adopting streaming is to convert all batch pipelines to streaming for freshness. This creates systems that are far more complex to operate without meaningful business benefit. The question is not "can we make this faster?" but "what decision or action does fresher data enable?" A fraud model that runs in 50ms changes behavior — fraud can be blocked before the transaction completes. An MRR dashboard that refreshes every 5 minutes instead of hourly does not change any decisions the finance team makes. The streaming system costs 5× more to operate and introduces exactly-once complexity, watermark tuning, and state management — all for a dashboard nobody checks more than once a day. Match stream processing investment to the latency at which behavior actually changes.

When to use streaming vs batch

BatchMicro-batch (Spark Streaming)True streaming (Flink/Kafka Streams)
LatencyMinutes to hoursSeconds to minutesMilliseconds to seconds
ThroughputHighestHighHigh
Exactly-onceEasyModerateComplex (requires checkpointing)
Operational overheadLowMediumHigh
State managementExternal DBCheckpoint storeBuilt-in + checkpoint store

Use streaming when: a decision or action needs to happen within seconds of an event (fraud detection, rate limiting, real-time personalization). Use micro-batch when: you need fresher data than hourly batch but true sub-second latency isn't required. Stay with batch when: data consumers are fine with hourly or daily freshness — it's much simpler to operate.

Processing guarantees

Stream processors offer three guarantee levels:

  • At-most-once: events may be lost on failure, never duplicated. Fastest.
  • At-least-once: events are never lost, may be processed multiple times. Outputs must be idempotent (writing the same event twice produces the same result).
  • Exactly-once: each event affects the output exactly once, even across failures. Requires distributed transactions or idempotent sinks. Flink achieves this with two-phase commit between Kafka and the sink.

Most streaming systems are at-least-once by default. Exactly-once is available but adds latency (checkpoint intervals) and requires compatible sinks. For many use cases (dashboards, feature computation), at-least-once with idempotent writes is sufficient and simpler to operate.

The state size problem

Stateful computations (windowed aggregations, user activity tracking, join tables) require Flink to maintain state in memory, backed by RocksDB for large states. State size grows with: number of unique keys × state per key × window duration. A session window tracking 10 million active users with 1KB of state per user requires 10GB of state. Design state schemas carefully — store only what the computation needs, and use TTL to expire state for inactive keys.

In Practice

Lambda vs Kappa architecture

Lambda architecture runs two pipelines in parallel: a batch layer (Hadoop/Spark, runs daily for correctness) and a speed layer (Kafka Streams/Flink, runs in real-time for freshness). Consumers merge results from both. The problem: maintaining two codebases for the same logic — and keeping them in sync — is expensive.

Kappa architecture eliminates the batch layer: use streaming for everything, but design the streaming pipeline to be replayable. When you need to reprocess historical data (bug fix, schema change), replay Kafka's retained logs through the streaming job from the beginning. This works when Kafka retention covers the history you need and the streaming job is designed for efficient replay.

Modern practice leans Kappa: maintain one streaming pipeline, keep at least 30 days of Kafka retention, and replay for corrections.

Kafka in production

Partition count is critical and difficult to change after creation. More partitions = more parallelism but more overhead. A rule of thumb: aim for partitions where each will handle under 10MB/s throughput, with a ceiling of ~10× your expected peak consumer count. Start with 12–48 partitions for most topics.

Consumer lag monitoring is the most important operational metric: how far behind is each consumer group from the latest offset? A growing lag means the consumer can't keep up with production. Alert on lag exceeding a threshold that would impact downstream SLAs.

Schema registry: as Kafka topics evolve, producers and consumers must agree on message schemas. A schema registry (Confluent Schema Registry) stores Avro or Protobuf schemas and enforces compatibility: consumers can always deserialize messages produced by any past producer version.

Production Patterns

Kafka consumer group configuration (Python)

Use kafka-python with explicit offset management. Set enable_auto_commit=False and commit after processing — auto-commit can advance the offset before your handler has successfully written results, making retries unsafe.

python
from kafka import KafkaConsumer
from kafka.structs import TopicPartition
import json
 
consumer = KafkaConsumer(
    "user-events",
    bootstrap_servers=["kafka-broker-1:9092", "kafka-broker-2:9092"],
    group_id="feature-pipeline-v2",
    auto_offset_reset="earliest",       # replay from beginning if no committed offset
    enable_auto_commit=False,           # commit manually after successful processing
    max_poll_records=500,               # tune to your processing latency budget
    session_timeout_ms=30_000,
    heartbeat_interval_ms=10_000,
    value_deserializer=lambda b: json.loads(b.decode("utf-8")),
)
 
for message in consumer:
    try:
        process_event(message.value)
        consumer.commit()               # commit only after successful write
    except Exception as exc:
        # dead-letter or alert; do NOT commit — Kafka will redeliver on restart
        send_to_dlq(message, exc)

Pin the group_id to a versioned name (feature-pipeline-v2) so you can run a new version in parallel against the same topic before cutting over. Changing group_id starts from auto_offset_reset — use it intentionally for replay, not by accident.

Flink windowed aggregation

Flink's Python DataStream API mirrors the Java API. For production jobs, use TUMBLE windows via the Table API — it compiles to optimized bytecode and integrates with Flink SQL.

python
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment, EnvironmentSettings
 
env = StreamExecutionEnvironment.get_execution_environment()
env.enable_checkpointing(60_000)        # checkpoint every 60 s for exactly-once
settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
t_env = StreamTableEnvironment.create(env, settings)
 
# Kafka source — schema matches Avro/JSON topic
t_env.execute_sql("""
    CREATE TABLE user_events (
        user_id     STRING,
        event_type  STRING,
        ts          TIMESTAMP(3),
        WATERMARK FOR ts AS ts - INTERVAL '10' SECOND   -- allow 10 s late arrival
    ) WITH (
        'connector' = 'kafka',
        'topic'     = 'user-events',
        'properties.bootstrap.servers' = 'kafka-broker-1:9092',
        'format'    = 'json'
    )
""")
 
# Tumbling 5-minute window per user
result = t_env.sql_query("""
    SELECT
        user_id,
        TUMBLE_START(ts, INTERVAL '5' MINUTE) AS window_start,
        COUNT(*)                               AS event_count
    FROM user_events
    GROUP BY user_id, TUMBLE(ts, INTERVAL '5' MINUTE)
""")
 
# Sink to Kafka or write to a JDBC target
result.execute_insert("user_event_aggregates")

Set parallelism to match your Kafka partition count — one Flink subtask per partition avoids cross-thread coordination overhead.

Consumer lag alerting with Prometheus

Expose consumer lag as a Prometheus metric and alert when it exceeds your SLA buffer. The kafka-python library doesn't expose lag natively; query the broker directly.

python
from kafka import KafkaAdminClient, KafkaConsumer
from prometheus_client import Gauge, start_http_server
import time
 
CONSUMER_LAG = Gauge(
    "kafka_consumer_lag",
    "Messages behind latest offset",
    ["topic", "partition", "group_id"],
)
 
def record_lag(bootstrap_servers: list[str], group_id: str, topic: str):
    admin = KafkaAdminClient(bootstrap_servers=bootstrap_servers)
    consumer = KafkaConsumer(bootstrap_servers=bootstrap_servers, group_id=group_id)
 
    end_offsets = consumer.end_offsets(
        [TopicPartition(topic, p) for p in consumer.partitions_for_topic(topic)]
    )
    committed = {
        tp: (consumer.committed(tp) or 0) for tp in end_offsets
    }
    for tp, end in end_offsets.items():
        lag = end - committed[tp]
        CONSUMER_LAG.labels(topic=tp.topic, partition=tp.partition, group_id=group_id).set(lag)
 
if __name__ == "__main__":
    start_http_server(8000)             # scrape at :8000/metrics
    while True:
        record_lag(["kafka-broker-1:9092"], "feature-pipeline-v2", "user-events")
        time.sleep(30)

Prometheus alert rule — fire when lag has been growing for 5 minutes:

yaml
groups:
  - name: kafka_consumer
    rules:
      - alert: KafkaConsumerLagHigh
        expr: kafka_consumer_lag{group_id="feature-pipeline-v2"} > 50000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Consumer lag exceeds 50k messages on {{ $labels.topic }}/{{ $labels.partition }}"
          runbook: "https://wiki.internal/runbooks/kafka-consumer-lag"

A lag threshold of 50k is a starting point — calibrate to your throughput. If your consumer processes 10k events/s and your SLA requires data within 2 minutes, your alert threshold should be under 1.2M messages (120s × 10k/s), with a warning at 50% of that.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.