The hard parts of building massive data systems with high concurrency
Remember Big Data?
The phrase has been around since like the 90s, but it really started crashing the party about 15 years ago. Web 2.0 and its posse of user-generated content were the bullies on the playground, and Hadoop was summoned by concerned parents to police things. Meanwhile, SQL was having a moment.
But when “Big Data” was “big”, the internet was actually pretty small, relatively speaking. A report sponsored by IDC in 2007 chalked up the internet at about 280 exabytes back then. It’s a lot bigger now. Like, a lot.
Obviously attempts to “measure the internet” are akin to measuring the size of the universe. It’s highly nondeterministic at best.
The point is this: Data is becoming a (big) problem for a lot of companies, in particular those trying to do something with it. Storing it is one thing, and we seem to have a handle on that for now. Processing it? We’re somewhere between newborn and crawling, with a trail of spit up and tears to show for it.
But this is what we want: To be able to process and gain insight from massive amounts of data, often of the streaming variety. Everybody wants to build high concurrency, low-latency data products on top of data sources generating giga-, tera-, peta-, whatever-bytes a day.
This is hard. We know. Because it’s our job.
This is the basic problem that our customers come to us with, broken into 2 simple parts:
- You have some system that generates a ton of data. Maybe most of it is user-generated. This is usually some kind of a web app or ecommerce site, but we can’t rule out other things like IoT.
- You need to extract useful information from that data and either report it back to users in a web application or otherwise do something with that data in realtime.
Imagine a website visitor counter but on steroids. Lots of steroids.
In our experience, creating these kinds of systems is much harder than many organizations anticipate. It’s not just an infrastructure problem either. Sure you could horizontally scale forever, but now you’re just robbing Peter to pay Paul. This is about understanding how data in motion works, how to measure what’s not working, and then how to do something about it.
We’re writing this blog because we’ve learned a lot in this space, and we wanted to share our experience. When it comes to building high concurrency, low-latency systems on top of huge data volumes, these are the challenges you should expect:
- Ingestion at scale doesn’t take kindly to maintenance.
Ingesting data at scale seems pretty easy with the various data streaming products out there. The reality is that streaming 100k events per second and storing it without losing data is really hard. Machines go down, servers need upgrades, databases must be maintained, schemas must be evolved, deployments, and on and on. This really is an infrastructure problem, and just because you can scale doesn’t mean you’ll scale well.
- High concurrent reads will make your products feel slow.
As you start to expose this data back to your consumers via API endpoints, you should expect high concurrent reads, perhaps on the order of 1,000+ QPS. You generally don’t want latency to creep above 150 ms. Any more than that and the user is going to start to “feel” it. You should measure your latency the right way, because the reality is your website will feel slow even if just P99 latency is slow. So don’t trust your UX to vanity metrics like average response time.
- Ingestion can kill reads if you aren’t careful.
Too many times have I seen people running benchmark tests on static data systems. This is crazy. Static systems are like 100x easier to manage than dynamic systems. In a static test you can cache stuff, but then it’s not real time. When you benchmark a system, make sure you ingest at the same time that you query.
- Not all endpoints query the same amount of data.
As always, the Pareto principle applies: 20% of the resources you create will get 80% of the requests and account for the majority of your scanned data. If you are fetching stats for a resource or client with high usage, these analytics endpoints will need to process more data to get the same metric. So when you run benchmark tests, it doesn’t make sense to distribute endpoint tests across a random sample of parameters, because a random distribution might not accurately capture those 20% of calls that need more resources. Rather, you should stratify your tests across endpoint parameters to get an accurate representation of the whole population.
- Handling one-off massive scale events is really hard.
Scaling fast is nearly impossible with a system like this. Don’t go into these kinds of endeavors thinking you’ll simply “scale dynamically.” Be prepared to start devising a plan for Black Friday (or similar such events) like… now. Clock’s tickin’.
- Frontend design has a huge impact.
Bad frontend design can have serious UX consequences when you’re dealing in realtime latencies. It doesn’t matter if the endpoint takes 10ms if the frontend needs 400ms to render the results. Also, it can get really easy to take down the system if you start overloading it with superfluous requests. Focus on calling just the data the user is actually interacting with (similar to lazy image loading). Cache data that you’ve already fetched and anything that doesn’t need to be real time.
- Can you keep systems running while making schema changes?
As you build and manage this system, one thing is for certain: Your data model and queries will change. Do you know how to keep everything running, avoid data loss, and keep your endpoints performant while you change the schema, generate a materialized view, change the endpoints, and so on?
- Monitoring performance can generate a ton of data.
At this scale, observability basically becomes its own data project. Think about it: If you want to track something simple like P99 response time for a system serving 500 QPS, that’s more than 43M rows of data every day. Are you prepared to manage all your observability metrics data, too?
So that’s it. Eight challenges you should absolutely expect when you start playing in the realtime, “big data” world. I’m sure there are more, but this should be plenty to get you thinking in the right direction.
If you want to talk through any of these, you can find me in the Tinybird community Slack.