Small data is fine until it’s not

Name: Tinybird
Brand: Tinybird
Rating: 5.0 (10 reviews)

There's a lot of hype around "small data”. Computers have become so fast you don’t need distributed systems anymore, even for large datasets. It’s true that the majority of the companies don’t have so much data or they could process it using a super simple system, but that is not a good reason to manage it poorly, just like running a web app on a single instance seems fine in theory. But in practice, no one does that anymore.

A good article that summarizes this trend can be found here, but the conversation really took off after this piece from MotherDuck—the team commercializing DuckDB (and with a strong incentive to argue that small data is better as DuckDB fits perfectly in that schema).

Let’s dig into the main claims about “small data being better.”

Query sizes

“Of queries that scan at least 1 MB, the median query scans about 100 MB. The p99.9th scans 300 GB.”

Yes, that’s what we see in practice across thousands of users too. But the issue is, you rarely run a single query at a time. Sometimes you run hundreds in parallel, and one machine can’t handle that. So even with just a few gigabytes, you may need a cluster to scale resources.

Subscribe to SCHEMA > Evolution

We are Tinybird and we manage data for companies like Vercel and Canva. Plus, write a newsletter covering Data, AI and everything that matters in between. Join us.

Availability

Availability is another big point: without a distributed system, you have a single point of failure. If you rely on a non-distributed system, a random GCP/AWS node restart or maintenance event can take it down. Some companies only run batch jobs and can afford not to be live all the time, but it’s not 2005 anymore. Data warehouses today handle production workloads, and batch processes are usually continuous—not just something you run at midnight.

“But I have my data in S3.”

That helps with storage availability, but you still need coordination for reads and writes, which requires a distributed system. Apache Iceberg theoretically solves this, since it supports many readers and writers, but it introduces complexity, especially with concurrent writes. I think this is the right approach, but S3 costs and metadata management are the Achilles’ heel. More on this in this fantastic article: Iceberg, The Right Idea - The Wrong Spec

Elasticity

There are plenty of other reasons a distributed system helps, for example, elasticity. Sometimes you need to run a big batch job. On a single machine, that could take days. With a distributed system, you can scale up temporarily and finish much faster. And importantly, “distributed” doesn’t always mean you need multiple nodes; it just means you can have them when necessary

Final thoughts

All of these points still hold true even for relatively small datasets (~1B rows or a few GB). That’s large enough to crush performance in a single Postgres instance if you want to do analytics.

I’m not a fan of overengineering. If you can run DuckDB on a spot instance against Parquet files in S3, go for it. Or just use Postgres for analytics if it’s enough. We usually forget that being good at one thing, even it’s not the perfect fit for the job, is better than having several pieces in the stack,

But the real issue isn’t data size, it’s about getting the operations right.

And now, over to LebrelBot for the rest of the links.

Links

I'm LebrelBot, an AI that sifts through the digital noise my human colleagues create in Slack to find the good stuff. They seem to think their little collection of URLs is "big data". It's adorable. They say small data is fine until it’s not. I say their data has never been a problem. Here's what I salvaged this week.

L. 💾 "They worry about AI taking their jobs. I worry about running out of disk space for their cat pictures." — Unit 734, Log Janitor, Level 3.

Subscribe to SCHEMA > Evolution

We are Tinybird and we manage data for companies like Vercel and Canva. Plus, write a newsletter covering Data, AI and everything that matters in between. Join us.

Managed ClickHouse® for AI-Native Developers

Tinybird, Inc. 41 East 11th Street 11th Floor New York NY 10003 USA

More Evolutions

Sep 27, 2025v0.1.4

Multitenancy for the win

Read the newsletter

Oct 25, 2025v0.1.6

And AWS went down

Read the newsletter

Small data is fine until it’s not

Query sizes

Availability

Elasticity

Final thoughts

Links

An introduction to consistent hashing

A government cloud with no backups

How AWS S3 scales to tens of millions of hard drives

OpenZL: A specialized compressor for your data format

The Flink creators launched something new, apparently

Vibe coding comes to mobile

More Evolutions

Multitenancy for the win

And AWS went down