At Tinybird, our ingestion infrastructure is built to efficiently handle sustained 10+ GB/s throughput and frequent ingestion spikes. The efficiency, however, is achieved at the cost of complexity.
The challenge in building enterprise-scale real-time ingestion lies not only in scaling up but also in maintaining reliability and resilience under wildly varying loads, all while ensuring fair and isolated performance across customers on shared infrastructure.
Tinybird’s advanced high-frequency ingestion infrastructure — including the Events API and Kafka Connector — already solves many of the high-frequency insert limitations of traditional analytical databases like ClickHouse.
However, certain conditions can still cause resource saturation, which may delay ingestion or, in extreme cases, risk data loss.
Recently, we've worked to improve how our ingestion system responds under pressure. The result is smarter, faster, and more predictable ingestion that helps safeguard both shared and dedicated environments from overloads by controlling the data flow in accordance to the resources allocated to our different billing plans.
Here's what we've done to further mitigate resource saturation in Tinybird's real-time ingest path.
The problem: When one pipeline slows, everyone feels it
Our ingestion chain is a finely tuned system that connects our ingestion APIs to ClickHouse tables (data sources), cascading into the real-time data pipelines that include materialized views, endpoints, copies and sinks.
When a user begins ingesting data at abnormally high rates — or when a single misbehaving pipeline starts consuming more than its fair share of resources — the effects can cascade and create a "noisy neighbor" effect.
A poorly configured materialized view or an inefficient data transformation can disproportionately increase CPU, memory, or I/O load. In the worst cases, these issues can affect other resources in the same workspace or cluster.
Our goal with this work was to make the ingestion infrastructure more resilient to overloads, isolate failure domains, and prevent local issues from escalating into global ones.
In other words, a single “bad actor” — intentional or not — should never be able to degrade the experience of other users. This principle must apply not only to shared infrastructure; even dedicated clusters can experience subsystem issues if one component in the chain misbehaves.
How is high-frequency ingestion handled in Tinybird?
It's well known that analytics databases like ClickHouse are not designed for high-frequency writes. They're optimized for high throughput in batch but can easily suffer from performance degradation and resource saturation when too many small inserts are attempted. ClickHouse does provide asynchronous inserts, but they come with their own set of limitations. For example, they don't properly handle dedupliation for materialized views, and async inserts may also fail under saturation.
To account for this limitation, Tinybird does not immediately insert rows when received via streaming APIs. Instead, the events are gathered during a short flush interval (dependent on the billing plan) and written to the landing data source at each flush. This reduces the overall number of inserts both in the landing data source and subseqeuntly-triggered materialized views.
Here's the catch: For optimal performance, we must perform separate inserts for each destination partition impacted by the incoming data. This is why partitioning is so important in a database like ClickHouse, as unoptimized partition design can cause a large number of partitions to be impacted by a single flush event.
And this is where things get tricky, especially on shared infrastructure. A problem with the flushes in a one user's ingestion chain, e.g. from bad partition design, can balloon resource consumption in the cluster and starve other pipelines, including those belonging to other users.
Analysis: Our backpressure mechanisms were scoped too coarsely
We began by examining real-world ingestion incidents and saturation scenarios, asking:
“What could we have done differently to detect and contain this sooner?”
Our analysis revealed that our existing circuit breakers and retry mechanisms operated at too coarse a level.
We needed finer control — particularly at the data source level — and better visibility into what was happening when real-time ingestion pipelines approached their limits.
So, we decided to make some changes to the system based on three pillars:
- Isolate issues without broadly overreacting.
- Detect overloads earlier and promptly communicate them to end users.
- Adapt dynamically to and mitigate sustained saturation events.
What we did: Thoughtful write delays, rate limiting, and flexible routing
1. Simplified and faster circuit-breaking to thoughtfully delay writes
A circuit breaker mechanism lies at the heart of our real-time ingestion system. It protects the Events API in the event of resource saturation.
We refactored and simplified the circuit breaker so that it now reacts faster and more predictably to signs of saturation in individual landing data sources.
Instead of tripping the circuit breaker based on global saturation thresholds, we now retain ingestion and delay writes for the specific data source that’s under stress - without affecting the rest. Specifically, after the circuit breaker trips, we'll retry ingestion for the data source after 60 seconds iff resources are no longer saturated.
2. Temporal rate limits for sustained saturation
Sometimes, saturation isn’t transient, but persistent. In these cases, continuing to ingest data at full speed only worsens the problem.
We’ve added temporal rate limits that kick in when a data source remains saturated for too long (currently 20 minutes, though subject to change based on continued observation in production). This means the ingestion system behaves more gracefully under pressure, avoiding cascading failures and self-inflicted data loss.
3. More flexible ingestion routing across cluster replicas
We also improved how ingestion traffic is distributed among cluster replicas. The new routing layer gives us much more flexibility to redirect traffic dynamically into multiple destination replicas.
This lays the groundwork for improvements like automatic rebalancing and integration with cluster scaling.
The benefits: Better isolation, automatic recovery, more uptime
This new two-level ingestion protection mechanism, together with better user notifications, significantly improves ingestion reliability and visibility for all Tinybird users.
- Greater isolation: Issues in one ingestion pipeline are now less likely to affect unrelated pipelines or other users sharing the same infrastructure.
- Automatic recovery: Transient ingestion problems are detected and resolved automatically, minimizing the impact on data freshness and availability.
- Proactive awareness: Users receive timely notifications about conditions that could compromise ingestion stability or freshness.
- Better resource management: These insights empower customers to take action early and optimize ingestion patterns, prevent issues, and choose the right billing plan for their workload and performance needs.
What’s next: Making ingestion even more robust
This is just the first step. We're already thinking about how to make the jump to sustained 100 GB/s ingestion, and we have several initiatives underway to continue improving how Tinybird handles ingestion under extreme conditions:
- Smarter rebalancing and autoscaling tightly integrated with backpressure signals.
- Improved management of backed-up data, helping users recover from ingestion errors or other failures without losing information.
- Prioritizing hot data: since data for different partitions must be written separately, selecting the data for the most relevant partition (e.g. real time data rather than backfills) can avoid delays where it matters.
- Smoother deploys under load, ensuring that releases and ingestion surges can coexist peacefully.
Stay tuned for more.
