Looking ahead to Kafka Summit '22
Hopefully you are as excited as we are about the upcoming agenda for Kafka Summit 2022. Knowing that a decent portion of attendees are going to be attending for the first time and with the goal of learning some of the basics around Kafka, we thought to share some of the pro’s and con’s you will want to consider before kicking off your project. This post comes to you courtesy of our very own David Manzanares, we hope you enjoy it!
When to use Kafka (and what to consider first)
Kafka is a popular open source system for "streaming processing". There are many similar message broker systems, like AWS Kinesis, RedPanda, zeromq, etc. Most software systems are born without a message broker, just like most are born without a load balancer nor good metrics. However, as a system scales both in complexity and in usage, the need for these extra pieces is sometimes needed. In this post, I’ve compared the tradeoffs between using Kafka and not using a message broker system at all.
Data loss avoidance
When sending information directly from system A to system B, a failure in system B could result in data loss. The cause could be any number of issues, from a problem with the infrastructure to a bug which escaped testing and code review.
It’s here where Kafka shines. Kafka works as the middle-man. System A sends information to Kafka, while system B pulls that information from Kafka. If system B fails, Kafka still retains all the information sent by system A. As soon as the failure in system B is resolved, it will start to ‘catch-up’ by pulling the information sent by system A through Kafka from the last message it consumed.
The failure of system B is the least of our worries if instead system B is misprocessing information and generating bad results. This means that the data information processed is essentially useless. Kafka saves the day here too. In this situation, the fix entails stopping system B, fixing the issue, removing all the mis-processed information, and then replaying the messages from system A which are stored in Kafka. It’s like going back in time. Kafka can simply reprocess all information from a known point. This is known us changing the commit offsets for a consumer group ID.
When it comes to sending information directly from system A to system B, there’s a possibility that system A may generate more information than what system B is capable of processing - a temporary surge lasting just a few seconds or minutes. System B is overwhelmed, and the result could be data loss or complete system failure - a costly exercise.
Kafka saves the day again. Kafka doesn’t push the information to system system B. Instead system B pulls the data from Kafka. Kafka decouples the writing pace of system A from the reading pace of system B. If system B is slower than system A, there will be a lag. But, system B won't be overwhelmed, nor will there be any data loss if there is a sudden surge of data from system A. This is the basis of flow control.
An important benefit of using Kafka is that it allows different teams to work on data easily without stepping on each other's toes. Kafka "consumer group IDs" allow multiple systems to consume messages from the same topic (a Kafka message queue). That means that multiple systems can read and write the same data without any additional effort. This is, in part, why Kafka has been so successful; teams don’t have to decide in advance what they will do with the data captured. They can start consuming it any time, finding new uses for it over time.
Kafka has scalability built in from the ground up. Scaling Kafka is as simple as increasing partitions, and adding more brokers and consumers.
As complexity increases, so does the chance of introducing bugs, issues, and incidents. This is something to bear in mind when adding any new component, and Kafka does come with some complexity.
The key question to ask before introducing Kafka is whether data loss avoidance is critical to your use case. If it’s acceptable in your use case to process 99.99% of your data, there may be simpler alternatives that still provide great sharing capabilities and scalability.
Kafka brokers are difficult to maintain, especially as you scale. A bad setup will undermine any attempts to add reliability to your system. There are cloud-based services which manage this for you …for a price you may or may not find acceptable. Whether you decide to run your own Kafka cluster or rely on a managed service provider, Kafka will always be more expensive than direct communication. There’s no such thing as a “free lunch” in this regard.
Subpar CLI tooling
There are plenty of reasons to love Kafka, but the official CLI tools are not one of them. Admin operations will require you to use the CLI, but it will get in the way: from difficult to diagnose Java stacktraces due to a typo, to the need for overriding many default parameters you'd take for granted.
Painful Access Control
Security is always important, and some of the Kafka authentication methods can be a pain to work with. Moving certificates between teams is not a good experience.
Where to go next
We are going to be in attendance at Kafka Summit and hope to meet you there. If you’d like to set up dedicated time to meet with us about productizing Kafka streams, or anything else related to realtime data products you are building, simply fill out this form.
Last but not least, we’ve written a few other posts on Kafka in the past, so resharing them here in case you are interested in reviewing them: