Jun 25, 2021

Starting with Kafka

I just want to share my thoughts on Kafka after using it for a few months, always from a practical point of view. I don’t know anything more than the basics ...
Javier Santana
Co-founder

I just want to share my thoughts on Kafka after using it for a few months, always from a practical point of view. I don’t know anything more than the basics about Kafka internals. I knew the principles of log processing and so on but had never got my hands dirty. So this is just the opinion of a newcomer, not an expert.

Everything began because we started seeing event-gathering use cases: a website, with a reasonably high amount of traffic that wants to gather events, to measure in real-time what’s going on. Like a customized version of Google analytics but without sampling. We’ve spent the last few months working on an integration with Kafka to cover these use cases.

I have to admit that I was biased against Kafka, I thought it was a large, complex piece of technology that added nothing but an extra layer of complexity.

I still think the same, it is complex, adding Kafka to your stack will add a bunch of services and some extra decisions to make about configuration and so on. In most cases, a single process app sending data to a file and a small HTTP server serving it will perform much better with less downtime. If you compare the burden of understanding, setting-up, configuring and starting to use compared to, for example, Redis, it’s a massive amount of work.

Of course, when things get difficult is when Kafka shines and all that complexity gives you the flexibility to tackle massive amounts of load and complex producer, consumer, processor use cases. But even in this situation, you’ll need to know much more than just Kafka 101.

To be honest, this is true no matter what tech you use, a simple load balancer can become a nightmare to maintain under certain circumstances. High-load software projects always get complex somehow, in the same way that an F1 car needs a highly complex system, a good bunch of engineers and mechanics, and constant maintenance to go fast, so does a software project.

The key here is how soon you hit the complexity; ideally, a tool should become complex as your project does.

Anyway, the most interesting thing about Kafka is that it enables something pretty useful for projects at scale: a clear way to coordinate and measure and scale data consumption. I’m wondering why HTTP load balancers didn’t get the consumer approach of Kafka. And the same kind of metrics. And a non-HTTP protocol and a way to coordinate workers :)

Coordinate: Kafka doesn’t only rely on the server to do things, clients know things like “What partition I should consume.” or load-balancing between brokers.

Scale: when consuming it is clear when you need more computing power just by looking at the number of unprocessed messages. This sounds obvious to do but it’s not, at least to me. This is a pretty old queueing problem that is useful to understand and measure (there is a lot of theory written about queueing from the early days in telecommunication systems) but it’s usually hidden in most pieces of software.

Kafka is not the only tool that does these things, I guess all message brokers do this but I never used them.

Anyway, some practical information that I didn’t know but wish I had known:

  • You need to know more things than I expected to be able to use it: partitions, messages, assignments, brokers, consumers, different kinds of timeouts
 and that’s just the surface.
  • Setting up a cluster by yourself is a pain in the ass. Using Confluent cloud is expensive but it just works, that’s why people use it.
  • Even for a local install, Kafka is a PITA. I recommend using Redpanda, a single command and you have a Kafka compatible API up and running. No Zookeeper, no docker-compose up, no Kubernetes.
  • Kafka does not have a good HTTP interface (for a good reason). HTTP Proxy is the default choice but Confluent cloud does not provide it (Redpanda does). I think using a custom protocol is the right choice but there is an impedance mismatch between HTTP based systems and Kafka (although not in all cases).
  • Partition number, compression, and all the settings become relevant when you are in trouble.
  • Managing errors in consumers is hard. The good part is that the data is in Kafka so you can replay it but the bad part is that if the problem is a logic problem (not a temporal glitch) you are in bigger trouble. This article explains it pretty well.

So, if you ask me, you need to know what you are doing before you add Kafka to your stack but it does bring really good architectural patterns when your platform usage starts to grow.

We have been happily ingesting millions of events per minute for some months now, without any major problems. Actually, Kafka saved us from some complicated situations. Probably a fast HTTP server writing batches to a file would also have worked but I think that when you want to scale things up without much pain Kafka is a good solution.

‍

Do you like this post?

Related posts

Real-time Data Visualization: How to build faster dashboards
A new way to create intermediate Data Sources in Tinybird
Tinybird
Team
Jun 15, 2023
Export data from Tinybird to Amazon S3 with the S3 Sink
Tinybird
Team
Mar 21, 2024
Tinybird: A ksqlDB alternative when stateful stream processing isn't enough
To the limits of SQL... and beyond
Automating data workflows with plaintext files and Git
Chatting GraphQL with Jamie Barton of Grafbase
Tinybird
Team
Apr 24, 2023
What it takes to build a real-time recommendation system
We launched an open source ClickHouse Knowledge Base
Tinybird
Team
Oct 11, 2022
The definition of real-time data

Build fast data products, faster.

Try Tinybird and bring your data sources together and enable engineers to build with data in minutes. No credit card required, free to get started.
Need more? Contact sales for Enterprise support.