What if I told you I could save you $40,000 a day? Would you think Iâm crazy?
$40,000 may be an astronomical number, but with what some companies are spending on cloud data warehouse egress costs, the order of magnitude perhaps isnât so far out of the realm of reality.
Regardless of the dollar amount you assign to the problem, the point is this: if you donât take care to craft your data pipelines, youâre going to overspend. $40K a day is extreme, but bad data management has a cost. A chorus of data engineers and data practitioners including people like Chad Sanderson and Lauren Balik remind us of this almost daily.
How you build analytical queries - and specifically how you manage the amount of data that they process to get the answers you want - can have a huge impact on your cloud bill. Itâs critically important that you have robust observability measures to alert you to requests consuming excess data - and the âtaxâ youâre paying to run them.
At Tinybird we always say to work backwards. Build a functional endpoint, then work hard and fast to optimize it so it processes as little data as possible when itâs queried in production.
So here is a real - albeit extreme - example of putting that to work, and how it impacts costs.
If video is more your thing, you can watch my talk on this subject right below:
The data
Consider an example data source: An audit log containing 258M rows and nearly 14 GB of product usage data sorted and indexed by a company_id
. This is a very realistic data set: If my product had 10,000 daily active users generating 100 events per day, Iâd easily hit that size in less than a year.
I want to know how many events of a particular type occurred during a particular time period, grouped by country. Seems pretty simple, right?
Hereâs a diagram of how I might go about that:
I start with the events data, join it to a table containing country data on an author_id
, aggregate, then filter to the timestamps and event types I want. Easy peasy, and hereâs how that looks in a Tinybird Pipe (a chain of nodes containing iterative SQL queries):
But hold on a minute. If you took the time to read through those queries, or if youâre a savvy data engineer, you probably already spotted the problems. This approach is really inefficient. In fact, this query scans 6.67 GB every time itâs called and takes 2.41 s to complete. Yikes.
If I built an application to visualize this result and refreshed it every second, I would spend $40,340.16/day on Tinybirdâs Pro-plan. đł
How do we fix it?
Whether youâre a data engineer or not, there are some fundamental rules you can follow to not only spend significantly less but also improve the performance of your queries. Weâve described them in detail in one of our guides, and Iâll list them here:
- Rule â 1 ⶠThe best data is the data you donât write.
- Rule â 2 ⶠThe second best data is the data you donât read.
- Rule â 3 ⶠSequential reads are 100x faster.
- Rule â 4 ⶠThe less data you process after reading, the better.
- Rule â 5 ⶠDo complex operations later in the processing pipeline.
My original query breaks all five of these rules. Whoops. So hereâs how I fixed it:
Filter first, join later
The first and simplest fix is to apply filters first and do the join later. By doing so, I speed up my query by following rule 1. Instead of writing a massive join to memory, Iâm filtering first, so that subsequent results are smaller.
While that speed boost should deliver a better frontend UX, I still have a problem: The amount of data I read to complete the query hasnât changed. Iâm still spending roughly the equivalent of a 2021 Tesla Model Y every day.
What to do? Letâs take a look at the schema.
Indices and data types
If you peek at the schema, youâll see a few problems.
First, Iâm using expensive String
data types where a LowCardinality(String)
is more appropriate, because columns like event
, device_OS
, and device_browser
have only a limited number of options. We can grab back some memory just by assigning appropriate data types.
Second, my sorting keys are incomplete based on the filters Iâm trying to apply.
Because Iâm filtering on event
and datetime
columns, I should add them to the sorting keys.
By making these changes to the schema of my data source, Iâm following rules 2 and 3: Reading less data - and sequential data - to filter, since now my filters are applied to indexed columns.
Just making these simple changes has massive implications on cost, because now instead of reading every row in the table, the query can target the granules in the table that fit within my event
and datetime
filters.
The resulting query now reads only 76 MB in less than 400 ms, Iâve cut costs by 100X to just $400 a day, and Iâve given the frontend UX a nice boost with a much lower response latency.
Materialized views: the ace up our sleeve
So Iâve just saved $39,600 and itâd be easy to call it a day. But thereâs one more thing I can do to squeeze as much as I can out of this query.
In Tinybird, materialized views can easily be created using Pipes. Instead of terminating a Pipe in an API endpoint, you can materialize the results, and reap the benefits later. Materialized views update on ingestion in Tinybird, so theyâre always up to date.
The best part? Thereâs no processing cost to incrementally update the materialized views, and the storage cost is negligible (in this case less than one-thousandth of a cent).
To improve query performance even more, Iâll materialize the join and aggregation first. Instead of calling the result below through an API endpoint, I materialize it into a new data source called join_data_mv.
Now, I can query that materialized view instead of the original data source:
The final result scans only 333 kB in about 1 ms.
I got the same result as my first query, but with a blazing fast frontend and a bill of just $18/day.
Obviously, compared to the reduction that I got in our third optimization, from $40K to $400, this might seem like only an incremental gain. But this isnât just about what I saved on this query. Itâs about what Iâll save on this query and every query I build after it. By implementing simple best practices, and using materialized views, I can not only justify my value as a data practitioner but also have a real shot at generating a meaningful ROI from this analysis.
Do this enough, and maybe your company will just go ahead and buy you a Tesla to say thanks.
â