Avoiding Full Scans¶
The less data you read in your queries, the faster they are. There are different strategies you could follow to avoid reading all the data in a data source (doing a full scan) from your queries:
Always filter first
Use indices by setting a proper
ENGINE_SORTING_KEYin the Data Source.
The column names present in the
ENGINE_SORTING_KEYshould be the ones you will use for filtering in the
WHEREclause. You don’t need to sort by all the columns you use for filtering, only the ones to filter first.
The order of the columns in the
ENGINE_SORTING_KEYis important: from left to right ordered by relevance (the more important ones for filtering) and cardinality (less cardinality goes first)
SCHEMA > `id` Int64, `amount` Int64, `date` DateTime ENGINE "MergeTree" ENGINE_SORTING_KEY "id, date"
SELECT * FROM data_source_sorted_by_date WHERE amount > 30
SELECT * FROM data_source_sorted_by_date WHERE id = 135246 AND date > now() - INTERVAL 3 DAY AND amount > 30
Avoiding Big Joins¶
When doing a JOIN, the data in the right Data Source is loaded in memory to perform the JOIN. Therefore, it’s recommended to avoid joining big Data Sources by filtering the data in the right Data Source:
JOINs over tables of more than 1M rows might lead to
MEMORY_LIMIT errors when used in Materialized Views, affecting ingestion.
A common pattern to improve JOIN performance is the one below:
SELECT left.id AS id, left.date AS day, right.response_id AS response_id FROM left_data_source AS left INNER JOIN big_right_data_source AS right ON left.id = right.id
SELECT left.id AS id, left.date AS day, right.response_id AS response_id FROM left_data_source AS left INNER JOIN ( SELECT id, response_id FROM big_right_data_source WHERE id IN (SELECT id FROM left_data_source) ) AS right ON left.id = right.id
Memory limit reached¶
Sometimes, you reach the memory limit when running a query. This is usually because:
Lot of columns are used: try to reduce the amount of columns used in the query. This is not always possible, so try to change data types or merge some columns.
A cross JOIN or some operation that generates a lot of rows: It might happen if the cross JOIN is done with two data sources with a large amount of rows, so try to rewrite the query to avoid the cross JOIN.
GROUP BY: try to filter out rows before executing the
If you are getting a memory error while populating a materialized view the solutions are still the same but take into account population process is executed in 1M rows chunks (so not a low of rows), so if you hit memory limits is likely because:
there is a JOIN and the right table is large
there is a ARRAY JOIN with a huge array that make the number of rows explode
In order to check if a populate process could break a good practice is to create a pipe with the same query as the MV and replace the source table by a node that gets just 1M rows from the source table. This would be an example:
NODE materialized SQL > select date, count() c from source_table group by date
NODE limited SQL > select * from source_table limit 1000000 NODE materialized SQL > select date, count() c from limited group by date
If the problem persists, just reach us at firstname.lastname@example.org to see if we can help you improving the query.
Nested Aggregate Functions¶
It’s not possible to nest aggregate functions or to use an alias of an aggregate function that is being used in another aggregate function
SELECT max(avg(number)) as max_avg_number FROM my_datasource
SELECT avg(number) avg_number, max(avg_number) max_avg_number FROM my_datasource
Instead, you should use a subquery:
SELECT avg_number as number, max_number FROM ( SELECT avg(number) as avg_number, max(number) as max_number FROM numbers(10) )
SELECT max(avg_number) as number FROM ( SELECT avg(number) as avg_number, max(number) as max_number FROM numbers(10) )
Merging Aggregate Functions¶
AggregateFunction types such as
avg… precalculate their aggregated values using intermediate states. When you query those columns you have to add the
-Merge combinator to the aggregate function to get the final aggregated results. Read more at Understanding State and Merge combinators for aggregates.
It is recommended to
-Merge aggregated states as late in the pipeline as possible, read the best practices for writing faster SQL queries to learn why.
Intermediate states are stored in binary format that’s why you might see weird characters when selecting columns with the
AggregateFunction type as shown below:
SELECT result FROM my_datasource
When selecting columns with the
AggregateFunction type you need to
-Merge the intermediate states to get the actual aggregated result for that column. This operation might compute several rows, that’s why it’s advisable to
-Merge as late in the pipeline as possible.
-- Getting the 'result' column aggregated using countMerge. Values are UInt64 SELECT countMerge(result) as result FROM my_datasource