We were paying for petabytes of S3 objects nothing was reading. Last month we cleaned them up and our object storage bill dropped ~45%. We also almost lost real data along the way. Here's what happened.
How do we deal with cloud orphan objects?
At Tinybird, we run large-scale ClickHouse® clusters backed by object storage. Like many teams operating distributed storage systems at scale, we’ve spent a lot of time thinking about replication, consistency, and failure recovery.
One issue that kept growing in the background was cloud storage garbage: objects that were no longer being used, but also never deleted. Just sitting there, racking up costs.
Over the last month, we investigated where this garbage was coming from, improved our cleanup tooling, and recovered tens of thousands of dollars in monthly storage costs. But to be honest the best part wasn't the savings, it was how much we strengthened our operational safety and recovery procedures along the way.
Why cloud objects become orphan?
When using zero-copy replication, ClickHouse stores data remotely and allows multiple replicas to reference the same files.
To coordinate this safely, ClickHouse maintains metadata in ZooKeeper describing which replicas are using each data part. In practice, these references behave like a distributed reference counter:replicas create references when new parts are attached or replicated, remove them when parts disappear, and objects can only be deleted once no references remain.
For each replicated table ClickHouse also tracks the active replicas separately. Under normal conditions, the list of active replicas and the list of replicas referencing data parts is the same.
But we found situations where they didn’t.
When a replica is removed (now a common operation with our self-service cluster management tooling), its replication metadata disappears correctly, but under some conditions some zero-copy references get left behind. The replica no longer exists, yet the object still appears referenced. That stale reference prevents the object from ever reaching a reference count of zero.
At that point, the object effectively became orphaned: no live replica needed it, but storage cleanup would never delete it.
Over time, these inconsistencies accumulated into a very large amount of undeleted storage.
Why this problem becomes expensive
At small scale, orphaned objects are mostly harmless but at large scale, they slowly accumulate and it’s hard to detect until it becomes extremely expensive.
After improving our garbage identification tooling and running it across clusters, we found several petabytes of deletable storage objects spread across different environments. In total, the cleanup represented roughly 45% in cloud storage costs.
This is how our cloud storage looked pre and post change.
Cost reduction was only part of the story, the much harder challenge was determining which objects were actually safe to delete.
Building a garbage collector
We already had an internal garbage collector designed to identify orphaned remote objects. The process worked in three phases:
- collect all metadata from clusters,
- analyze which objects appeared unused,
- delete confirmed garbage.
However, after revisiting the tooling, we discovered several gaps:
- some metadata formats were not fully supported,
- automation around execution had been deprioritized,
- and visibility into collection completeness was insufficient.
After fixing those issues, we started running the collector cluster by cluster and finally obtained a much more accurate picture of storage garbage accumulation.
When cleanup goes wrong
Shortly after rolling out the cleanup process broadly, we encountered the failure we were most concerned about: we (temporarily) deleted legitimate data.
We'd been confident in our backup recovery capabilities going into this. The whole point was to ensure the operation wouldn't cause data loss even if something went wrong. And technically, we were right: we did recover everything. But what we underestimated was just how complex that recovery would actually be.
The deletion logic itself wasn’t wrong, but the collector relies on building a complete snapshot of active remote objects before it starts analyzing anything. During some runs, the collection step timed out silently while gathering metadata from clusters. That produced incomplete snapshots of the live dataset.
Later stages continued operating anyway and as a result, some objects were incorrectly classified as unused simply because the collector had not seen them during snapshot generation.
We didn't realize some valid data was marked as garbage to be deleted until ClickHouse attempted to read them and object storage started returning "file not found" errors.
Recovering deleted data
Recovery turned out to be just as hard as deciding what to delete in the first place.
The core difficulty was that remote objects are represented by opaque identifiers that bear no relation to their original file paths. Reconstructing relationships between: S3 blobs, ClickHouse parts, backups, mutations, and deduplicated files, required combining metadata from multiple systems.
For example on disk, ClickHouse stores the part using the table UUID:
disks/<disk_name>/store/<table_uuid[0:2]>/<table_uuid>/<part_name>/
And under data/, the same part is represented by the database and table:
data/<database_name>/<table_name>/<part_name>/
Backups follow the second structure, while runtime storage paths normally use the first one.
During recovery, a blob reported as missing from object storage cannot be mapped directly to a backup path. First, we need to translate from the runtime layout, which uses the table UUID, to the backup layout, which uses the database and table names. Then we need to locate the corresponding part and file inside the backup chain.
Several additional complications made recovery even messier:
- backup layouts differed from runtime storage layouts,
- part mutations changed part names over time,
- or backup chains contained multiple generations of files.
In some cases, restoring one missing object only revealed additional missing objects from the same part afterward.
To recover safely, we had to:
- rebuild complete snapshots of remote object mappings,
- correlate deleted blobs with active parts,
- search across entire backup chains,
- account for mutations and renamed parts,
- and identify deduplicated files indirectly through metadata and checksums.
The investigation involved reconstructing relationships across millions of objects and many independent metadata sources. Tooling assistance (including LLMs) significantly accelerated parts of the recovery process during the incident response.
In the end, we got all the affected data back and validated our recovery procedures. The whole ordeal made our process considerably more robust.
Lessons learned
The biggest challenge in this project was not identifying garbage. It was proving that data was truly unused before we dared to delete it.
In ClickHouse clusters using replicated tables and zero-copy, deletion workflows are only as reliable as the metadata snapshots they depend on. In our case, incomplete collection snapshots caused legitimate objects to be classified as garbage, even though the rest of deletion logic itself was correct.
That experience forced us to improve several parts of the system:
- stronger snapshot validation,
- safer orchestration between collection phases,
- better observability around storage metadata,
- and more robust recovery procedures when things go wrong.
We’re also rethinking how we operate the garbage collector itself. Right now, we’re still figuring out the right execution frequency to balance storage savings against operational risk.
Meanwhile, fixes to prevent the creation of new orphaned objects are already being rolled out across clusters.
