Kafka Streams (Stateful Processing, Windowing and Aggregations)

Instead of waiting for large datasets to accumulate, stream processing analyzes events as they arrive. Applications continuously transform, aggregate, enrich, and react to incoming data, enabling businesses to make decisions within seconds rather than hours.

Apache Kafka provides this capability through Kafka Streams, a lightweight Java library that allows developers to build real-time stream processing applications directly on top of Kafka.

What is Kafka Streams?

Kafka Streams is a Java library designed for building stream processing applications.

Unlike frameworks such as Apache Spark or Apache Flink, Kafka Streams does not require a separate cluster. A Kafka Streams application runs as a normal Java process and relies on Kafka itself for scalability, durability, and fault tolerance.

At a high level, a Kafka Streams application reads data from one or more Kafka topics, processes that data continuously, and writes the results back into Kafka topics or external systems.

Input Topic
      ↓
Kafka Streams Application
      ↓
Output Topic

This seemingly simple architecture enables a wide variety of real-time processing patterns.

For example, an e-commerce platform might continuously calculate revenue metrics, update inventory counts, enrich order events with customer information, and detect suspicious transactions as events arrive.

All of these operations can be implemented using Kafka Streams.

Consider a large online marketplace. Every second thousands of events are generated:

Order Created
Payment Received
Order Shipped
Order Delivered
Product Viewed
Product Added To Cart
Customer Logged In

Without stream processing, these events might be stored in a database and analyzed later using nightly batch jobs. With Kafka Streams, the business can process events immediately.

As soon as an order is placed:

- Revenue dashboards can be updated.
- Inventory counts can be adjusted.
- Recommendation engines can learn from customer behavior.
- Fraud detection systems can evaluate transactions.
- Notifications can be generated.

This ability to react in real time is the primary reason stream processing has become so popular.

KStream vs KTable

A KStream represents a continuous sequence of events. Imagine the lifecycle of an order:

Order Created
Order Confirmed
Payment Received
Order Shipped
Order Delivered

Each event is important. Each event is processed. The stream preserves the complete history. This makes KStreams ideal for analytics, auditing, event sourcing, and event-driven architectures.

A KTable takes a different approach. Instead of preserving every event, it maintains only the latest state for a given key. For example:

Order-1001 → Delivered
Order-1002 → Processing
Order-1003 → Processing

The order may have gone through many states, but the KTable only exposes the most recent value. A useful way to remember the difference is:

KStream = Event History
KTable = Current State

Creating Your First Kafka Stream

Creating a Kafka Streams application is straightforward. A developer creates a stream builder and subscribes to a topic.

StreamsBuilder builder =
    new StreamsBuilder();

KStream<String, String> orders =
    builder.stream("orders");

Once the stream is created, records begin flowing through the processing topology automatically as they arrive in Kafka. The application can now perform filtering, transformations, joins, aggregations, and other operations.

Stateless Stream Processing

The simplest form of stream processing is known as stateless processing. In stateless processing, each event is processed independently. The application does not need to remember anything about previous events.

Imagine an e-commerce platform that wants to identify high-value orders.

orders.filter(
    (key, value) ->
        value.contains("HIGH_VALUE")
);

Each order is evaluated independently. No historical information is required. No state is stored. This makes stateless processing simple, scalable, and efficient.

Many common operations are stateless:

- Filtering
- Mapping
- Transformation
- Routing
- Data Enrichment

However, many business problems require knowledge of past events. This introduces stateful processing.

Stateful Stream Processing

Consider a fraud detection system. A customer typically spends between $100 and $500 per transaction. Suddenly a transaction worth $50,000 appears.

Transaction 1 = $150
Transaction 2 = $220
Transaction 3 = $300
Transaction 4 = $50,000

Determining whether the fourth transaction is suspicious requires historical context. The system must remember previous transactions. This is the essence of stateful processing.

Instead of processing each event independently, the application maintains information across events and uses that information when future events arrive. Examples include:

- Running Totals
- Order Counts
- Customer Activity
- Fraud Scores
- Inventory Levels
- Session Information

Stateful processing powers some of the most valuable stream processing use cases in modern systems.

State Stores

Kafka Streams manages state using structures called State Stores. A state store is essentially a local database maintained by the application.

Kafka Stream
      ↓
State Store
      ↓
Processing Logic

Whenever new events arrive, Kafka Streams updates the state store. For example, suppose we are tracking total orders per customer. Initially:

Customer-A = 2 Orders

A new order arrives. The state store updates automatically:

Customer-A = 3 Orders

This allows applications to maintain continuously updated business metrics.

One obvious concern is what happens when an application crashes. If state is stored locally, wouldn't it be lost?

Kafka Streams solves this problem using Changelog Topics. Every state update is also written to Kafka. If an application instance fails:

Application Crash
       ↓
Restart
       ↓
Replay Changelog
       ↓
Restore State

Kafka Streams automatically rebuilds the state store by replaying the changelog topic. This provides durability without sacrificing performance.

Example

The following Kafka Streams example groups transactions by customer and continuously maintains a running total.

StreamsBuilder builder = new StreamsBuilder();

KStream transactions =
        builder.stream("transactions");

KTable customerTotals =
        transactions.groupByKey()
                    .reduce(Double::sum);

customerTotals.toStream()
              .filter((customerId, totalAmount) ->
                      totalAmount > 10000)
              .foreach((customerId, totalAmount) ->
                      System.out.println(
                              "Potential Fraud Detected: "
                              + customerId
                              + " Total Amount = "
                              + totalAmount));

As transactions arrive, Kafka Streams automatically updates the state store.

Transaction 1 = $150
Running Total = $150

Transaction 2 = $220
Running Total = $370

Transaction 3 = $300
Running Total = $670

Transaction 4 = $50,000
Running Total = $50,670

Fraud Alert Triggered

Notice that the application is able to make a decision only because it remembers previous transactions. This stored information is the application's state.

Windowing (Events Over Time)

Many business questions involve time. For example:

"How many orders were placed in the last five minutes?"
"What was revenue during the previous hour?"
"How many failed login attempts occurred today?"

Kafka topics represent infinite streams of events. Without boundaries, calculating time-based metrics becomes impossible.

Windowing solves this problem. Windows divide an infinite stream into manageable chunks of time. Once events are grouped into windows, calculations can be performed.

Tumbling Windows

A tumbling window divides time into fixed, non-overlapping intervals. Suppose management wants to know how many orders were placed every 5 minutes. A tumbling window creates fixed, non-overlapping windows:

10:00 - 10:05 
10:05 - 10:10 
10:10 - 10:15

Each order belongs to exactly one window.

KStream<String, Order> orders = builder.stream("orders");

KTable<Windowed<String>, Long> orderCounts = orders
    .groupBy((key, order) -> "ALL_ORDERS")
    .windowedBy(
        TimeWindows.ofSizeWithNoGrace(
            Duration.ofMinutes(5)
        )
    )
    .count();

If the following orders arrive:

10:01 Order-1 
10:02 Order-2 
10:04 Order-3 
10:07 Order-4

The result becomes:

10:00 - 10:05 = 3 Orders 
10:05 - 10:10 = 1 Order

This type of window is commonly used for dashboards, reporting, and business metrics.

Sliding Windows

A sliding window continuously moves through time. Suppose a fraud detection system wants to identify customers making more than 5 transactions within any 5-minute period.

Unlike tumbling windows, sliding windows continuously move forward.

KStream<String, Transaction> transactions =
    builder.stream("transactions");

KTable<Windowed<String>, Long> transactionCounts =
    transactions
        .groupByKey()
        .windowedBy(
            SlidingWindows.ofTimeDifferenceWithNoGrace(
                Duration.ofMinutes(5)
            )
        )
        .count();

Consider these transactions:

10:01 Transaction-1 
10:02 Transaction-2 
10:03 Transaction-3 
10:04 Transaction-4 
10:05 Transaction-5 
10:06 Transaction-6

At 10:05:

Window: 10:00 - 10:05 
Count = 5

At 10:06:

Window: 10:01 - 10:06 
Count = 6

Sliding windows are ideal for fraud detection, anomaly detection, and operational monitoring because calculations continuously update as new events arrive.

Session Windows

Session windows group events based on user activity rather than fixed time boundaries. Suppose an e-commerce company wants to understand customer browsing sessions.

A session starts when the customer becomes active and ends after 30 minutes of inactivity.

KStream<String, UserEvent> userEvents =
    builder.stream("user-events");

KTable<Windowed<String>, Long> sessions =
    userEvents
        .groupByKey()
        .windowedBy(
            SessionWindows.ofInactivityGapWithNoGrace(
                Duration.ofMinutes(30)
            )
        )
        .count();

Consider a user's activity:

10:00 Login 
10:02 View Product 
10:05 Add To Cart 
10:08 Checkout

Kafka Streams treats these as one session:

Session 1 10:00 - 10:08

Now suppose the user returns later:

11:00 Login 
11:03 Browse Product

A new session begins:

Session 2 11:00 - 11:03

Session windows are commonly used for web analytics, user behavior analysis, clickstream processing, and customer journey tracking.

Aggregations

Aggregation is one of the most common stream processing operations.

Aggregation Example (Revenue Calculation)

Suppose a sales dashboard needs to display total revenue. Incoming orders:

Order-1 = $100 
Order-2 = $250 
Order-3 = $500

Kafka Streams can continuously maintain the running total.

KStream<String, Order> orders =
    builder.stream("orders");

KTable<String, Double> revenue =
    orders
        .groupBy(
            (key, order) -> "TOTAL_REVENUE"
        )
        .aggregate(
            () -> 0.0,
            (key, order, totalRevenue) ->
                totalRevenue + order.getAmount()
        );

As events arrive:

Revenue = $100 
Revenue = $350 
Revenue = $850

This pattern is commonly used for real-time dashboards and business reporting.

Joining Streams

One of the most powerful features of Kafka Streams is the ability to join multiple streams together and create richer events. Consider an e-commerce platform. The Orders topic contains:

{
    "orderId": 1001,
    "customerId": 500,
    "amount": 2500
}

The Customers topic contains:

{
    "customerId": 500,
    "name": "John Doe",
    "city": "Delhi"
}

Individually, neither event provides the complete business picture. Kafka Streams can combine them.

 Order Event + Customer Event = Enriched Order Event

The result becomes:

{
    "orderId": 1001,
    "customerId": 500,
    "customerName": "John Doe",
    "city": "Delhi",
    "amount": 2500
}

KStream-KTable Join Example

A very common pattern is enriching events using reference data stored in a KTable.

KStream<String, Order> orders =
    builder.stream("orders");

KTable<String, Customer> customers =
    builder.table("customers");

KStream<String, EnrichedOrder> enrichedOrders =
    orders.join(
        customers,
        (order, customer) ->
            new EnrichedOrder(
                order,
                customer
            )
    );

The resulting stream contains complete business information and can be consumed by analytics, reporting, recommendation engines, or downstream microservices.

This pattern is extremely common because it eliminates the need for database lookups during event processing. This reduces latency, improves scalability, and creates self-contained business events that are easier for downstream systems to consume.

Scaling Kafka Streams Applications

One of Kafka Streams' biggest advantages is its scalability model. Kafka Streams scales through partitions. Suppose a topic contains four partitions.

Partition-0
Partition-1
Partition-2
Partition-3

Kafka Streams can distribute processing across four application instances.

Instance-1
Instance-2
Instance-3
Instance-4

As traffic grows, additional partitions and application instances can be added. This allows Kafka Streams applications to process millions of events per second while maintaining fault tolerance.

A common question is how a global aggregation such as total revenue is calculated when different Kafka Streams instances process different partitions?

Kafka Streams first computes local aggregates on each partition and then automatically combines those partial results through an internal repartitioning and aggregation stage. This allows dashboards to display a single global metric while still benefiting from parallel processing across partitions.

Suppose the orders topic has 4 partitions.

Orders Topic 

Partition-0 
Partition-1 
Partition-2 
Partition-3

And Kafka Streams is running on 4 instances.

Partition-0 -> Instance-1 
Partition-1 -> Instance-2 
Partition-2 -> Instance-3 
Partition-3 -> Instance-4

Now imagine orders arrive:

Partition-0 ------------ $100 $200 
Partition-1 ------------ $300 
Partition-2 ------------ $400 $500 
Partition-3 ------------ $600

Each instance computes a local aggregation.

Instance-1 = $300 
Instance-2 = $300 
Instance-3 = $900 
Instance-4 = $600

These are only partial results. How do we get $2100 as the final answer? The answer is that Kafka Streams creates an internal repartition topic and performs a second aggregation stage. Think MapReduce.

Stage 1:

Partition Aggregation 
P0 -> $300 
P1 -> $300 
P2 -> $900 
P3 -> $600

Stage 2:

Combine Aggregates 
$300 $300 $900 $600 

Total Revenue = $2100

Kafka Streams handles this automatically. For example:

KTable<String, Double> revenue =
    orders
        .groupBy(
            (key, order) -> "TOTAL_REVENUE"
        )
        .aggregate(
            () -> 0.0,
            (key, order, total) ->
                total + order.getAmount()
        );

Notice something important: (key, order) -> "TOTAL_REVENUE". Every record is being grouped into the same key. Kafka Streams automatically repartitions all records having the key: TOTAL_REVENUE onto a single aggregation partition.

Eventually one instance owns: TOTAL_REVENUE and maintains: Revenue = $2100.

Conclusion

Kafka Streams transforms Kafka from a messaging platform into a real-time data processing engine.

By combining stream processing, state management, windowing, aggregations, joins, and fault-tolerant recovery mechanisms, Kafka Streams enables organizations to build applications that react to events immediately rather than waiting for batch jobs.