Avoiding Full Scans

The less data you read in your queries, the faster they are. There are different strategies you could follow to avoid reading all the data in a data source (doing a full scan) from your queries:

  • Always filter first

  • Use indices by setting a proper ENGINE_SORTING_KEY in the Data Source.

  • The column names present in the ENGINE_SORTING_KEY should be the ones you will use for filtering in the WHERE clause. You don’t need to sort by all the columns you use for filtering, only the ones to filter first.

  • The order of the columns in the ENGINE_SORTING_KEY is important: from left to right ordered by relevance (the more important ones for filtering) and cardinality (less cardinality goes first)

Data Source: data_source_sorted_by_date
SCHEMA >
  `id` Int64,
  `amount` Int64,
  `date` DateTime

ENGINE "MergeTree"
ENGINE_SORTING_KEY "id, date"
BAD: Not filtering by any column present in the ENGINE_SORTING_KEY
SELECT *
FROM data_source_sorted_by_date
WHERE amount > 30
GOOD: Filtering first by columns present in the ENGINE_SORTING_KEY
SELECT *
FROM data_source_sorted_by_date
WHERE
  id = 135246 AND
  date > now() - INTERVAL 3 DAY AND
  amount > 30

Avoiding Big Joins

When doing a JOIN, the data in the right Data Source is loaded in memory to perform the JOIN. Therefore, it’s recommended to avoid joining big Data Sources by filtering the data in the right Data Source:

A common pattern to improve JOIN performance is the one below:

BAD: Doing a JOIN with a Data Source with too many rows
SELECT
    left.id AS id,
    left.date AS day,
    right.response_id AS response_id
FROM left_data_source AS left
INNER JOIN big_right_data_source AS right ON left.id = right.id
GOOD: Prefilter the joined Data Source for better performance
SELECT
    left.id AS id,
    left.date AS day,
    right.response_id AS response_id
FROM left_data_source AS left
INNER JOIN (
  SELECT id, response_id
  FROM big_right_data_source
  WHERE id IN (SELECT id FROM left_data_source)
) AS right ON left.id = right.id

Memory limit reached

Sometimes, you reach the memory limit when running a query. This is usually because:

  • Lot of columns are used: try to reduce the amount of columns used in the query. This is not always possible, so try to change data types or merge some columns.

  • A cross JOIN or some operation that generates a lot of rows: It might happen if the cross JOIN is done with two data sources with a large amount of rows, so try to rewrite the query to avoid the cross JOIN.

  • A massive GROUP BY: try to filter out rows before executing the GROUP BY.

If the problem persists, just reach us at support@tinybird.co to see if we can help you improving the query.

Nested Aggregate Functions

It’s not possible to nest aggregate functions or to use an alias of an aggregate function that is being used in another aggregate function

BAD: Error on using nested aggregate function
SELECT max(avg(number)) as max_avg_number FROM my_datasource
BAD: Error on using nested aggregate function with alias
SELECT avg(number) avg_number, max(avg_number) max_avg_number FROM my_datasource

Instead, you should use a subquery:

GOOD: Using aggregate functions in a subquery
SELECT
  avg_number as number,
  max_number
FROM (
  SELECT
    avg(number) as avg_number,
    max(number) as max_number
  FROM numbers(10)
)
GOOD: Nesting aggregate functions using a subquery
SELECT
  max(avg_number) as number
FROM (
  SELECT
    avg(number) as avg_number,
    max(number) as max_number
  FROM numbers(10)
)

Merging Aggregate Functions

Columns with AggregateFunction types such as count, avg… precalculate their aggregated values using intermediate states. When you query those columns you have to add the -Merge combinator to the aggregate function to get the final aggregated results. Read more at Understanding State and Merge combinators for aggregates.

It is recommended to -Merge aggregated states as late in the pipeline as possible, read the best practices for writing faster SQL queries to learn why.

Intermediate states are stored in binary format that’s why you might see weird characters when selecting columns with the AggregateFunction type as shown below:

Getting ‘result’ as aggregate function
SELECT result FROM my_datasource
result

AggregateFunction(count)

@33M@

�o�@

When selecting columns with the AggregateFunction type you need to -Merge the intermediate states to get the actual aggregated result for that column. This operation might compute several rows, that’s why it’s advisable to -Merge as late in the pipeline as possible.

Getting ‘result’ as UInt64
-- Getting the 'result' column aggregated using countMerge. Values are UInt64
SELECT countMerge(result) as result FROM my_datasource
result

UInt64

1646597