Flink has been around for years, adopted primarily by Alibaba and other tech giants. It wasn’t until recently that it began to be marketed to regular tech companies as the ultimate tool for ultra-low latency, complex event processing with exactly-once semantics. On paper, those sound like must-haves. Who wouldn’t want ultra-low latency?
In practice, shaving off those extra 50 milliseconds costs you massive complexity and operational pain. Everyone loves being paged at 4:00 AM because some distributed system went sideways, right?
This isn’t new. Aiven already pointed out Kafka’s 80% problem: Kafka is powerful, but overkill for most use cases. Flink is 95% problem: even fewer people need it.
Why? Because most “real-time” problems don’t require academic-level engineering:
~65% can be solved with a simple HTTP service + Postgres.
~25% need something faster, and an OLAP database (like ClickHouse or even DuckDB) would be a good fit.
~5% build custom solutions because they want control.
Only ~5% actually need Flink. Think of very strict, mission-critical use cases like some of Uber’s workloads.
Even Flink’s creators admit it doesn’t work that well for many money-making workloads. This makes sense, as a monolithic system that tries to do it all ends up doing nothing superbly. One of Apache Samza’s (older gen stream processing) creators fears Flink is repeating the “do everything” trap at which Samza failed. Privately, Flink developers admit the platform is too complicated.
After years of effort, they sold the project to two different companies and none of them seem to be making a huge business. In the last few months Managed Flink (Immerok) was acquired by Confluent after just 9 months of operation, and Decodable (Flink-powered) was acquired by Redis 3 years after its Series A. These companies would not have sold so early if there had been significant promise. Confluent so far hasn't shown significant Flink revenue either ($10M ARR out of $1B ARR as per the latest public numbers, and they should be at $30M if they continue the 3x growth they advertise)
In software engineering, chasing “perfect” often turns small problems into cmplex monsters usually justified by “everyone else does it this way.”. K8S is a rare example where being complex from the start didn't prevent it from winning the market (maybe that's what made it attractive), but this is not the norm.
Going from good enough to really good is not a linear effort. You need to have an engineering mindset in choosing the right tradeoff between effort and reward.
That’s why most of us don’t need Flink.
But let’s go through the use cases Flink is good for one by one:
Ultra low latency: Rare in real life. Examples include fraud detection, HFT, and gaming telemetry. The reality is that most apps are fine under 1 second and databases like ClickHouse can already deliver <100ms end to end. All you need is good old SQL, no new rules to learn. I’ve seen low latency (under 20ms end to end) needs in some industries (gaming, finance, transportation, telecommunications, ads) and all of them have ad-hoc solutions for those needs. They don’t use Flink.
Windowed aggregations & ETL: Native SQL engines like ClickHouse handle it more simply and with less infra. A simple ClickHouse instance is capable of processing millions of records per second doing somewhat complex logic. It’s true some late-arrival management is harder but that’s usually overengineering. In cases where you need to handle late arrivals because the use cases are “critical”, you should rely on OLTP (see exactly-once).
Complex event processing (CEP): Flink has nice SQL functionality to solve these use cases (see MATCH_RECOGNIZE), but (a) the use cases are niche, and (b) ClickHouse can solve that with, again, simple SQL that takes a few minutes to write and everybody can understand later.
Exactly-once: It's an expensive guarantee. Applications that require it usually rely on Postgres or MySQL, battle-tested technologies, integrated with most programming frameworks, supported by a large developer pool, plenty of offerings, and an easy-to-understand transactional model. You may then do CDC into Kafka but the real application logic is somewhere else. On the analytics side of things, you can use (again) ClickHouse with ReplacingMergeTree and FINAL to get the same behavior. Most of the use cases are fine with a few duplicates or missing values from time to time. In real life systems there are a lot of parts involved from end to end, and getting it right in every one of them is nearly impossible. It’s hard to get it right in one of them alone (just check Jepsen analyses), so much harder with a bunch of chained services. Whoever says the opposite hasn’t spent enough time in this industry. Systems have dupes and missing values from time to time, whoever says the opposite haven’t had enough time in this industry.
Why would you complicate your stack?
Running Flink means a whole new system, new skills, super-expensive engineers. You can find people who know regular SQL in any Starbucks (with LLMs the entry barrier is near zero today). And it does not end there:
A totally new API to learn
A whole new platform to build - to go beyond POC, you need to buy/build a bunch of things (UI, lineage graphs, data catalogs, CI/CD glue). Up to two new systems to deploy:
- You likely need Kafka if you’re going to use Flink. Running Postgres or ClickHouse without Kafka is common, so that’s one thing less to learn.
- HA Flink needs ZooKeeper or Kubernetes. If you aren’t running on k8s, or decide to run Flink in standalone mode for whatever reason, you need to deploy ZooKeeper.
- Now copy all of this to dev and staging too
Schema evolution troubles: Kafka is well-known for being schemaless by default and difficult to deal with during schema evolution. Flink inherits that from its data source.
Config complexity hell - Kafka is said to have over 300 knobs, but Flink is showing 700+. These configurations are a real tax on anybody who is not an expert and is not using the system to its full capacity
Observability. With so much complexity and points of failure (source data schema, Kafka, Flink, ZooKeeper), you need to hire a whole SRE team to figure out which metrics to graph, how to tune actionable alerts on them, and how to deal with them.
- In stream processing, semantics require more specific monitoring. Things like input data lag, watermarks, checkpoint health, and local state growth are all things that need to be alerted on.
Testing and upgrade treadmill - in the worst case, with 3 new systems, you overcomplicate your end-to-end testing. Upgrades become tricky, too, as one potential bump requires others for compatibility - you have to worry about inter-dependency of rolling upgrades (what goes first). Rollback also becomes more difficult.
Flink may or may not be here in 10 years, and SQL will be. You could argue there could be different implementations of Flink API/SQL but every new stream implementation creates its new standard (as opposed to SQL), and just look at how much the landscape has changed in the last decade.
Flink is JVM-based, so you need to use Java for writing apps. Nothing against Java, but it’s a blocker for many non-Java shops. A PyFlink Python client exists too, but it's a second-class citizen that lacks feature parity.
App complexity can further explode if you choose to use different Flink APIs for different jobs (Java, SQL, Python), for example because one client doesn’t support what you want to do
Even if you have experienced developers, there are a few challenges that are not really well solved with Flink, and the Decodable people explain them pretty well in this post. Or, if you want some deeper thoughts on internals there is this other one.
But there is also a more powerful point: you would rather solve these problems using the programming language of your choice (or a simple SQL database), or use Flink?
Every major programming language has a solid Kafka library to consume topics, so doing the processing with it will solve most problems. There’s no need to teach new things to your team, and the tools are the same as other parts of the stack. I can’t emphasize enough how important it is to keep the tooling as clean as possible in a small-to-medium-sized company (under 100 developers).
And if you want to do analysis, a ClickHouse cluster is near real-time, a well-known pattern for most developers (regular SQL, pull queries), and will solve the same use cases. It is cost-efficient and will work better with large amounts of data while also enabling non-developers (analysts, data scientists, BI people) to process and access the data, so much so that somebody built a stream processor on top of ClickHouse (check out Timeplus Proton).
I wouldn’t say “never use Flink.” There are rare cases where it makes sense, but not every company is Netflix or Uber. It really needs to be worth it, and in my view, Flink isn’t the general-purpose processing framework we should rely on. Confluent’s public report also talks a little bit about this as I mentioned before, and just $10M ARR for a company doing +$1B ARR is peanuts. At that size, with thousands of customers, just a small push on sales engineering should generate way over $10M ARR on upsells.