Speeding up slow joins on ClickHouse with the Join engine

When you have hundreds of millions or billions of rows, joins might be slow even on ClickHouse. Here we'll show you how to speed up joins on large amounts of data making them 100X faster using the JOIN Table Engine.

It is very common to model your data with a star schema for analytics. In it, you will have facts and dimensions related to each other. Typically, fact tables are much larger than dimensional tables, and you will have more of the latter.

An example of the start schema. Source: Wikipedia

If you followed the Ingesting data guide, you'll have these two Data Sources in your account

  • events, with 100M rows. This would be the fact table in the star schema. Here, we'll actually use the events_mat_cols table, created in the Materialized columns guide, that contains a few extra materialized columns so that less data has to be calculated on query time.
  • products, with ~2M rows. This would be a dimension table.

This is what the data in the events_mat_cols Data Source looks like:

And this is what the products Data Source looks like:

The problem: joining big MergeTree tables is too slow

At some point, you'll want to join different fact and dimension tables. In our case, you'll want to join the events (or events_mat_cols) and products Data Sources.

If you're used to OLTP databases like Postgres, the natural way to do it would be with the query below (ClickHouse actually supports joins and the syntax is very similar to the SQL standard).

The problem is that when you have hundreds of millions of rows or more, joins might not give you the required performance for real-time use-cases, as you can see by the time the query above took. It takes ~2s to give a result for a ``JOIN`` query.

The solution: using the Join engine

ClickHouse has a Join Engine, designed to fix this exact problem and make joins faster.

To use it, we have to:

  1. Create a new Data Source with a Join engine for all the dimension Data Sources we want to join with fact Data Sources.
  2. Create a Materialized View to populate each Join Data Source
  3. Rewrite the query where the data is joined, either using the ``JOIN`` clause or joinGet.

{% tip-box title="Join Data Sources are always stored in RAM" %}Join Data Sources will behave in a similar way to a hash map stored in RAM, where the keys are the hashed values of the join keys. As they are in RAM, these dimension tables shouldn't have more than hundreds of thousands of rows, or a few million. If you need to create bigger Join Data Sources than that, please contact us.{% tip-box-end %}

There are a few parameters you need to specify when creating a Join Data Source:

  • ``ENGINE_KEY_COLUMNS``: The column or columns that will be used for the join operation.
  • ``ENGINE_JOIN_TYPE``: Can be any of these values: ``INNER|LEFT|RIGHT|FULL|CROSS``. You'll typically use ``LEFT`.
  • ``ENGINE_JOIN_STRICTNESS``: Can take any of these values: ``OUTER|SEMI|ANTI|ANY|ASOF``. You probably want to use ``ANY``. It will take the first unique value for each key.

Create a new Join Data Source

It can have the same number of columns as the original dimension Data Source, or fewer. Remember that Join engine tables keep the data always in RAM , so if you're not going to use all the columns it's a good idea if the Join Data Source you're creating has fewer columns than the original one.

Big Join Data Sources can potentially degrade your experience. In general having Join Data Sources that take more than a few 100s of MBs on disk is not advised.

We'll use all the columns in our case because the products table doesn't have many.

We're going to use our CLI. If you haven't yet, after running ``tb auth``, run ``tb init`` to create the folder structure in the directory you're at to keep your Pipes and Data Sources organized. Then define a new Data Source like this in the ``datasources`` folder:

Create a Pipe to populate the Join Data Source

Create a new file in your ``pipes`` folder like this. It will read data from the products Data Source (that uses a ``MergeTree`` engine) and populate the products_join_sku Data Source (that uses a ``Join`` engine). Joining a Data Source that uses a Join engine will be much faster

Then push and populate the Data Source and the Pipe in your account by running this:

Rewrite the original query

You can do it using the ``JOIN`` clause, as follows:

You'll have to explicitly add to the query the same join strictness (``ANY``) and type (``LEFT``) that you used to create the Data Source, or you'll get an error.

Another option, even more performant (2 to 10X than using the JOIN clause), is using joinGet to get only specific columns from the Join table.

With these changes, we sped up a query that would originally take ~2s into one that takes ~10-20ms. That is, up to a 200x performance improvement by just making a few changes.