In this blog post I’m going to write about how to use Kafka Streams for stateful stream processing. I’m going to assume that the reader knows the basic ABC’s (producers, consumers, brokers, topics) of Apache Kafka.
Problem Statement: We needed a system which would consume messages, perform real-time fast stateful processing of these messages, and then forward these processed messages downstream with features like scalability, fault-tolerance, high throughput and millisecond processing latency etc.
Historically at Yesware, we had used RabbitMQ as our messaging system with good results so far. But at the time when we had this requirement, the latest 0.10 release of Apache Kafka had introduced a new feature called Kafka Streams. So our options were:
- Use RabbitMQ (which we were quite familiar with) for messaging and at the downstream consumer level use a fast in-memory data store (like Redis – which we also had used quite extensively) to do stateful processing, and forward the processed messages to the final destination
- Use Kafka for messaging (which we had used sparingly), and try our luck with this new thing called Kafka Streams which looked quite promising – mainly because it builds on the highly scalable, elastic, distributed, fault-tolerant capabilities integrated natively within Kafka.
We decided to choose Kafka for messaging, because we had expected a big chunk of messages traffic per second and also just because we wanted to expand our boundaries. Now, with Kafka as our messaging service, we still could have used a stream processor framework existing outside of the Kafka system but then we would have faced lot of complexity (picture on the left), so choosing Kafka Streams was quite tempting indeed (picture on the right).
So let’s quickly go over the basic concepts, before we dive into the code snippets.