🚀Live Coding Session: Simple Real-Time Analytics with KafkaRegister Now
May 08, 2023

The 5 rules for writing faster SQL queries

If you're building real-time analytics, you need your SQL to be fast. Fast SQL queries improve performance and reduce cost. Here are the 5 rules to follow to write the fastest SQL queries of your life.
Aitana Azcona
Data Engineer

When you’re trying to process huge amounts of data in real-time, there are some best practices to follow. These are hard-learned lessons from years of building real-time systems at huge scale. Now, we're sharing them with you.

The 5 Rules of Fast Queries

At Tinybird, we want our users to be able to go fast. Both when building queries, and then when those queries run. So, we've created a cheatsheet of rules you can consult whenever you need to go fast.

We like to call them The 5 Rules of Fast Queries:

  • Rule № 1 ⟶ The best data is the data you don’t write.
  • Rule № 2 ⟶ The second best data is the data you don’t read.
  • Rule № 3 ⟶ Sequential reads are 100x faster.
  • Rule № 4 ⟶ The less data you process (after read), the better.
  • Rule № 5 ⟶ Move complex operations later in the processing pipeline.

Let’s go one by one, analyzing the performance improvement after the implementation of each rule using data stored in Tinybird. We will use the well-known NYC Taxi Trip dataset. You can get a sample here and import it directly into Tinybird by creating a new Data Source from your dashboard.

Rule 1: The best data is the data you don’t write

This rule seems obvious, but it’s not always followed. There is no reason to save data that you don’t need: it will impact the memory needed (and the money!) and the queries will take more time, so it only has disadvantages.

Rule 2: The second best data is the data you don’t read

To avoid reading data that you don’t need, you should apply filters as soon as possible.

Filtering is the fastest operation, so you should always filter first.

Using the NYC taxi data, let’s suppose we want a list of the trips whose distance is greater than 10 miles and that took place between ‘2017-01-31 14:00:00’ and ‘2017-01-31 15:00:00’. Additionally, we want to get those trips ordered by date.

Let’s see the difference in performance when we apply the filters at the end of our query execution versus the beginning.

First, let’s start the first approach by ordering all the data by date:

Once the data is sorted, we filter it:

In Tinybird, you can run multiple nodes of SQL and query over prior nodes. This node runs over the filtered SQL from the previous query above.

This first approach takes around 30-60 ms, if you add the processed time of both nodes.

Pay attention to the statistics: The first node scanned 139.26k rows processed 10.31 MB of data. The second node scanned 24.58k rows and 1.82 MB of data. Why would we scan 139.26k rows in the first place if we just really need to scan 24.58k?

It’s important to be aware that these two values directly impact the query execution time and also affect other queries you may be running at the same time. IO bandwidth is also something you need to keep in mind.

Now, let’s see what happens if the filter is applied before the sorting:

As you can see, if the filter is applied before the sorting, the query takes < 10 ms. If you take a look at the size of the data read, it’s 1.82MB, while the number of rows read is 24.58k. Compared to the previous approach, these are much smaller and more efficient.

This significant difference happens because in the first approach, we are sorting all the data available (even the data that we don’t need for our query) while in the second approach, we are sorting just the rows we need.

Filtering is the fastest operation, so always filter first.

Rule 3: Sequential reads are 100x faster

To be able to carry out sequential reads, it’s essential to define indexes correctly. These indexes should be defined based on the queries that we're going to perform. In this example, we'll simulate indexing data by sorting our tables before we query them to see how filtering by non-indexed columns affects performance.

Always index the columns by which you plan to filter. Sequential reads can be up to 100x faster when working with large amounts of data.

For example, if we want to query the data and filter by tpep_pickup_time, let’s compare what happens when the data is sorted by tpep_pickup_time versus when it’s sorted by any other column.

In the first approach, we will sort the data by another column, for instance, passenger_count:

Once we have the data sorted by passenger_count, we filter it by tpep_pickup_time:

This node filters by date over data that hasn't been sorted by date. The results are suboptimal.

This approach takes around 5-10 ms, the number of scanned rows is 26.73k and the size of data is 1.98MB.

For the second approach, we'll sort the data by tpep_pickup_time:

And once it’s sorted by date, we filter it:

This node filters by date over data that has been sorted by date. The query scans much less data and responds much faster.

We can see that if the data is sorted by tpep_pickup_time and the query uses tpep_pickup_time for filtering, it just takes 1-2 ms, scans only 10.35k rows, and processes only 765.53 KB. The first approach, filtering by another column, takes around 5-10 ms, scans 26.73k rows, and processes 1.98 MB.

It’s important to highlight that the more data we have, the greater the difference between both approaches. When dealing with tons of data, sequential reads can be up 100x faster or more.

Therefore, it’s essential to define the indexes taking into account the queries that will be made.

Rule 4: The less data you process (after read), the better

The less data you process, the faster and cheaper your queries will be. So, if you just need two columns, only retrieve two columns in your SELECT.

Let’s suppose that for our use case, we just need three columns: vendorid, tpep_pickup_datetime, and trip_distance.

Don't run query operations over columns you don't need.

Let’s analyze the difference between selecting all the columns versus just the ones we need.

When we select all the columns, the query takes around 140-180 ms and processes 718.55 MB of data:

However, when we select just the columns we need, the query only takes around 35-60 ms and process ~20% of the data:

As we mentioned before, you can check how the size of scanned data is much less, now just 155.36MB. With analytical databases, if you do not need to retrieve a column, those files are not read and it is much more efficient.

Therefore, you should only process the data that you need.

Rule 5: Move complex operations later in the processing pipeline

Complex operations, such as joins or aggregations, should be performed as late as possible in the processing pipeline. This is because in the first steps you should filter all the data, so the number of rows at the end will be less than at the beginning and, therefore, the cost of executing complex operations will be lower.

So first, let’s aggregate the data:

Now, let’s apply the filter:

This query filters over aggregated data. The results are suboptimal.

If the aggregations are performing before filtering, the query takes around 50-70 ms in total, and it scans 9.71 million rows and processes 77.68 MB.

In general, do SQL operations in this order: 1. Filter - 2. JOIN - 3. Aggregate.

Let’s see what happens if we filter first and then aggregate:

This approach takes only 20-40 ms even though the number of scanned rows and the size of data is the same as in the previous approach.

Therefore, you should perform complex operations as late as possible in the processing pipeline.

Some additional guidance

In addition to these 5 rules, here’s some more general advice for optimal queries in Tinybird:

Avoid Full Scans

The less data you read in your queries, the faster they are. There are different strategies you can follow in Tinybird to avoid reading all the data in a data source (doing a full scan) from your queries:

  • Always filter first
  • Use indices by setting a proper ENGINE_SORTING_KEY in the Data Source.
  • The column names present in the ENGINE_SORTING_KEY should be the ones you will use for filtering in the WHERE clause. You don’t need to sort by all the columns you use for filtering, only the ones to filter first.
  • The order of the columns in the ENGINE_SORTING_KEY is important: from left to right ordered by relevance (the more important ones for filtering) and cardinality (less cardinality goes first).

Avoid Big Joins

When doing a JOIN in Tinybird, the data in the right Data Source is loaded in memory to perform the JOIN. Therefore, it’s recommended to avoid joining big Data Sources by filtering the data in the right Data Source.

JOINs over tables of more than 1M rows might lead to MEMORY_LIMIT errors when used in Materialized Views, affecting ingestion.

A common pattern to improve JOIN performance is the one below:

New to Tinybird?

Tinybird is a real-time data platform for developers and data teams. Ingest data from anywhere, query it with SQL, and publish your (optimized) queries as low-latency APIs in a click. You can sign up for free with no time limit and no credit card required.

Do you like this post?
Build fast data products, faster.
Try Tinybird and bring your data sources together and enable engineers to build with data in minutes. No credit card required, free to get started.
Need more? Contact sales for Enterprise support.