Apache Kafka provides this capability through Kafka Streams, a lightweight Java library that allows developers to build real-time stream processing applications directly on top of Kafka.
What is Kafka Streams?
Kafka Streams is a Java library designed for building stream processing applications.Unlike frameworks such as Apache Spark or Apache Flink, Kafka Streams does not require a separate cluster. A Kafka Streams application runs as a normal Java process and relies on Kafka itself for scalability, durability, and fault tolerance.
At a high level, a Kafka Streams application reads data from one or more Kafka topics, processes that data continuously, and writes the results back into Kafka topics or external systems.
Input Topic
โ
Kafka Streams Application
โ
Output Topic
This seemingly simple architecture enables a wide variety of real-time processing patterns.
For example, an e-commerce platform might continuously calculate revenue metrics, update inventory counts, enrich order events with customer information, and detect suspicious transactions as events arrive.
All of these operations can be implemented using Kafka Streams.
Consider a large online marketplace. Every second thousands of events are generated:
Order Created
Payment Received
Order Shipped
Order Delivered
Product Viewed
Product Added To Cart
Customer Logged In
Without stream processing, these events might be stored in a database and analyzed later using nightly batch jobs. With Kafka Streams, the business can process events immediately.
As soon as an order is placed:
- Revenue dashboards can be updated.
- Inventory counts can be adjusted.
- Recommendation engines can learn from customer behavior.
- Fraud detection systems can evaluate transactions.
- Notifications can be generated.
This ability to react in real time is the primary reason stream processing has become so popular.
KStream vs KTable
A KStream represents a continuous sequence of events. Imagine the lifecycle of an order:Order Created
Order Confirmed
Payment Received
Order Shipped
Order Delivered
Each event is important. Each event is processed. The stream preserves the complete history. This makes KStreams ideal for analytics, auditing, event sourcing, and event-driven architectures.
A KTable takes a different approach. Instead of preserving every event, it maintains only the latest state for a given key. For example:
Order-1001 โ Delivered
Order-1002 โ Processing
Order-1003 โ Processing
The order may have gone through many states, but the KTable only exposes the most recent value. A useful way to remember the difference is:
KStream = Event History
KTable = Current State
Creating Your First Kafka Stream
Creating a Kafka Streams application is straightforward. A developer creates a stream builder and subscribes to a topic.StreamsBuilder builder =
new StreamsBuilder();
KStream<String, String> orders =
builder.stream("orders");
Once the stream is created, records begin flowing through the processing topology automatically as they arrive in Kafka. The application can now perform filtering, transformations, joins, aggregations, and other operations.
Stateless Stream Processing
The simplest form of stream processing is known as stateless processing. In stateless processing, each event is processed independently. The application does not need to remember anything about previous events.Imagine an e-commerce platform that wants to identify high-value orders.
orders.filter(
(key, value) ->
value.contains("HIGH_VALUE")
);
Each order is evaluated independently. No historical information is required. No state is stored. This makes stateless processing simple, scalable, and efficient.
Many common operations are stateless:
- Filtering
- Mapping
- Transformation
- Routing
- Data Enrichment
However, many business problems require knowledge of past events. This introduces stateful processing.
Stateful Stream Processing
Consider a fraud detection system. A customer typically spends between $100 and $500 per transaction. Suddenly a transaction worth $50,000 appears.Transaction 1 = $150
Transaction 2 = $220
Transaction 3 = $300
Transaction 4 = $50,000
Determining whether the fourth transaction is suspicious requires historical context. The system must remember previous transactions. This is the essence of stateful processing.
Instead of processing each event independently, the application maintains information across events and uses that information when future events arrive. Examples include:
- Running Totals
- Order Counts
- Customer Activity
- Fraud Scores
- Inventory Levels
- Session Information
Stateful processing powers some of the most valuable stream processing use cases in modern systems.
State Stores
Kafka Streams manages state using structures called State Stores. A state store is essentially a local database maintained by the application.Kafka Stream
โ
State Store
โ
Processing Logic
Whenever new events arrive, Kafka Streams updates the state store. For example, suppose we are tracking total orders per customer. Initially:
Customer-A = 2 Orders
A new order arrives. The state store updates automatically:
Customer-A = 3 Orders
This allows applications to maintain continuously updated business metrics.
One obvious concern is what happens when an application crashes. If state is stored locally, wouldn't it be lost?
Kafka Streams solves this problem using Changelog Topics. Every state update is also written to Kafka. If an application instance fails:
Application Crash
โ
Restart
โ
Replay Changelog
โ
Restore State
Kafka Streams automatically rebuilds the state store by replaying the changelog topic. This provides durability without sacrificing performance.
Example
The following Kafka Streams example groups transactions by customer and continuously maintains a running total.StreamsBuilder builder = new StreamsBuilder();
KStream transactions =
builder.stream("transactions");
KTable customerTotals =
transactions.groupByKey()
.reduce(Double::sum);
customerTotals.toStream()
.filter((customerId, totalAmount) ->
totalAmount > 10000)
.foreach((customerId, totalAmount) ->
System.out.println(
"Potential Fraud Detected: "
+ customerId
+ " Total Amount = "
+ totalAmount));
As transactions arrive, Kafka Streams automatically updates the state store.
Transaction 1 = $150
Running Total = $150
Transaction 2 = $220
Running Total = $370
Transaction 3 = $300
Running Total = $670
Transaction 4 = $50,000
Running Total = $50,670
Fraud Alert Triggered
Notice that the application is able to make a decision only because it remembers previous transactions. This stored information is the application's state.
Windowing (Events Over Time)
Many business questions involve time. For example:"How many orders were placed in the last five minutes?"
"What was revenue during the previous hour?"
"How many failed login attempts occurred today?"
Kafka topics represent infinite streams of events. Without boundaries, calculating time-based metrics becomes impossible.
Windowing solves this problem. Windows divide an infinite stream into manageable chunks of time. Once events are grouped into windows, calculations can be performed.
Tumbling Windows
A tumbling window divides time into fixed, non-overlapping intervals. Suppose management wants to know how many orders were placed every 5 minutes. A tumbling window creates fixed, non-overlapping windows:10:00 - 10:05
10:05 - 10:10
10:10 - 10:15
Each order belongs to exactly one window.
KStream<String, Order> orders = builder.stream("orders");
KTable<Windowed<String>, Long> orderCounts = orders
.groupBy((key, order) -> "ALL_ORDERS")
.windowedBy(
TimeWindows.ofSizeWithNoGrace(
Duration.ofMinutes(5)
)
)
.count();
If the following orders arrive:
10:01 Order-1
10:02 Order-2
10:04 Order-3
10:07 Order-4
The result becomes:
10:00 - 10:05 = 3 Orders
10:05 - 10:10 = 1 Order
This type of window is commonly used for dashboards, reporting, and business metrics.
Sliding Windows
A sliding window continuously moves through time. Suppose a fraud detection system wants to identify customers making more than 5 transactions within any 5-minute period.Unlike tumbling windows, sliding windows continuously move forward.
KStream<String, Transaction> transactions =
builder.stream("transactions");
KTable<Windowed<String>, Long> transactionCounts =
transactions
.groupByKey()
.windowedBy(
SlidingWindows.ofTimeDifferenceWithNoGrace(
Duration.ofMinutes(5)
)
)
.count();
Consider these transactions:
10:01 Transaction-1
10:02 Transaction-2
10:03 Transaction-3
10:04 Transaction-4
10:05 Transaction-5
10:06 Transaction-6
At 10:05:
Window: 10:00 - 10:05
Count = 5
At 10:06:
Window: 10:01 - 10:06
Count = 6
Sliding windows are ideal for fraud detection, anomaly detection, and operational monitoring because calculations continuously update as new events arrive.
Session Windows
Session windows group events based on user activity rather than fixed time boundaries. Suppose an e-commerce company wants to understand customer browsing sessions.A session starts when the customer becomes active and ends after 30 minutes of inactivity.
KStream<String, UserEvent> userEvents =
builder.stream("user-events");
KTable<Windowed<String>, Long> sessions =
userEvents
.groupByKey()
.windowedBy(
SessionWindows.ofInactivityGapWithNoGrace(
Duration.ofMinutes(30)
)
)
.count();
Consider a user's activity:
10:00 Login
10:02 View Product
10:05 Add To Cart
10:08 Checkout
Kafka Streams treats these as one session:
Session 1 10:00 - 10:08
Now suppose the user returns later:
11:00 Login
11:03 Browse Product
A new session begins:
Session 2 11:00 - 11:03
Session windows are commonly used for web analytics, user behavior analysis, clickstream processing, and customer journey tracking.
Aggregations
Aggregation is one of the most common stream processing operations.Aggregation Example (Revenue Calculation)
Suppose a sales dashboard needs to display total revenue. Incoming orders:Order-1 = $100
Order-2 = $250
Order-3 = $500
Kafka Streams can continuously maintain the running total.
KStream<String, Order> orders =
builder.stream("orders");
KTable<String, Double> revenue =
orders
.groupBy(
(key, order) -> "TOTAL_REVENUE"
)
.aggregate(
() -> 0.0,
(key, order, totalRevenue) ->
totalRevenue + order.getAmount()
);
As events arrive:
Revenue = $100
Revenue = $350
Revenue = $850
This pattern is commonly used for real-time dashboards and business reporting.
Joining Streams
One of the most powerful features of Kafka Streams is the ability to join multiple streams together and create richer events. Consider an e-commerce platform. The Orders topic contains:{
"orderId": 1001,
"customerId": 500,
"amount": 2500
}
The Customers topic contains:
{
"customerId": 500,
"name": "John Doe",
"city": "Delhi"
}
Individually, neither event provides the complete business picture. Kafka Streams can combine them.
Order Event + Customer Event = Enriched Order Event
The result becomes:
{
"orderId": 1001,
"customerId": 500,
"customerName": "John Doe",
"city": "Delhi",
"amount": 2500
}
KStream-KTable Join Example
A very common pattern is enriching events using reference data stored in a KTable.KStream<String, Order> orders =
builder.stream("orders");
KTable<String, Customer> customers =
builder.table("customers");
KStream<String, EnrichedOrder> enrichedOrders =
orders.join(
customers,
(order, customer) ->
new EnrichedOrder(
order,
customer
)
);
The resulting stream contains complete business information and can be consumed by analytics, reporting, recommendation engines, or downstream microservices.
This pattern is extremely common because it eliminates the need for database lookups during event processing. This reduces latency, improves scalability, and creates self-contained business events that are easier for downstream systems to consume.
Scaling Kafka Streams Applications
One of Kafka Streams' biggest advantages is its scalability model. Kafka Streams scales through partitions. Suppose a topic contains four partitions.Partition-0
Partition-1
Partition-2
Partition-3
Kafka Streams can distribute processing across four application instances.
Instance-1
Instance-2
Instance-3
Instance-4
As traffic grows, additional partitions and application instances can be added. This allows Kafka Streams applications to process millions of events per second while maintaining fault tolerance.
A common question is how a global aggregation such as total revenue is calculated when different Kafka Streams instances process different partitions?
Kafka Streams first computes local aggregates on each partition and then automatically combines those partial results through an internal repartitioning and aggregation stage. This allows dashboards to display a single global metric while still benefiting from parallel processing across partitions.
Suppose the orders topic has 4 partitions.
Orders Topic
Partition-0
Partition-1
Partition-2
Partition-3
And Kafka Streams is running on 4 instances.
Partition-0 -> Instance-1
Partition-1 -> Instance-2
Partition-2 -> Instance-3
Partition-3 -> Instance-4
Now imagine orders arrive:
Partition-0 ------------ $100 $200
Partition-1 ------------ $300
Partition-2 ------------ $400 $500
Partition-3 ------------ $600
Each instance computes a local aggregation.
Instance-1 = $300
Instance-2 = $300
Instance-3 = $900
Instance-4 = $600
These are only partial results. How do we get $2100 as the final answer? The answer is that Kafka Streams creates an internal repartition topic and performs a second aggregation stage. Think MapReduce.
Stage 1:
Partition Aggregation
P0 -> $300
P1 -> $300
P2 -> $900
P3 -> $600
Stage 2:
Combine Aggregates
$300 $300 $900 $600
Total Revenue = $2100
Kafka Streams handles this automatically. For example:
KTable<String, Double> revenue =
orders
.groupBy(
(key, order) -> "TOTAL_REVENUE"
)
.aggregate(
() -> 0.0,
(key, order, total) ->
total + order.getAmount()
);
Notice something important: (key, order) -> "TOTAL_REVENUE". Every record is being grouped into the same key. Kafka Streams automatically repartitions all records having the key: TOTAL_REVENUE onto a single aggregation partition.
Eventually one instance owns: TOTAL_REVENUE and maintains: Revenue = $2100.
Conclusion
Kafka Streams transforms Kafka from a messaging platform into a real-time data processing engine.By combining stream processing, state management, windowing, aggregations, joins, and fault-tolerant recovery mechanisms, Kafka Streams enables organizations to build applications that react to events immediately rather than waiting for batch jobs.