Deduplication Strategies

OLAP databases like ClickHouse are optimized for fast ingestion and, for that to work, some trade-offs are made. One of these is the lack of unique data constraints: enforcing them would add a big overhead, and make ingestion speeds too slow for what's expected from this kind of database.

However, there will be times when you either want to get rid of duplicated data, or just access the most recent data point available. There are several options, but the general workflow is ingesting all of the data first (including duplicates) and dealing with them after ingestion.

The problem

Typically, there are two use cases that can create duplicated data:

  • Upserts. An upsert is an operation that inserts new rows into a database table if they do not already exist, or updates them if they do. With databases like Postgres, this can be accomplished with ON CONFLICT DO INSERT clauses. Because ClickHouse does not enforce uniqueness of primary keys, it doesn't support clauses like this. The way to do upserts on ClickHouse is with a ReplacingMergeTree engine. More on this later.
  • Constant ingestions and historical data. Imagine you are periodically dumping data from your transactional database to ClickHouse to run analytical queries on it in real-time, as described in this Tinybird blogpost. You would end up with lots of rows inserted at different times with the same primary key. You may want to get only the latest data point - in this case you can get rid of the rest and treat them like upserts. But there are also use-cases where you want to keep all the data to have a historic record of the evolution of the attributes of an object over time.

A real-world example

Here we will work with a dataset that resembles what a social media analytics company would have to track page views of posts over time. It has data for about 10000 posts over every day of a year. It looks like this:

We will use this dataset to explain recommended ways to deduplicate data by post_id, for the two use cases we have described above: upserts and historical data.

First, we will define a Data Source like this to store the data, where every day total views for each post are appended.

Selecting the right sorting keys:

If you're going to do lots of filtering by post_id, to keep the scanned index range as small as possible it's better to sort first by post_id and then by date. Read this or Take our 'Principles of Real Time Analytics' free course to learn more about how to define the sorting of your indexes better

This Data Source can be created by adding this definition file to your project.

If you are adding this Data Source to a project under version control, you would create this file and add/commit it to a repository branch. You can then populate this Data Source using the UI or with the CLI and the tb datasource append command (along with the tb branch use command to indicate the target Branch).

If you are not using version control, you have the option of using the tb push command to deploy this Data Source to your Workspace directly, and the tb datasource append command to populate it.

The solutions

There are four strategies to deduplicate data:

  • Doing it at query time.
  • Using a ReplacingMergeTree engine.
  • Using Materialized Views.
  • Using snapshots.

These solutions are ordered from more simple to more complex ones.

The more data you work with, the more likely it is that you will need to use a more complex solution in order to have maximum performance.

Deduplicating on query time

Imagine you are interested only in the latest value of views for each post. In that case, you can deduplicate data on post_id and get the latest value with these strategies:

  • Get the max date for each post in a subquery and then filter by its results.
  • Group data by post_id and use the argMax function.
  • Use the LIMIT BY clause.

Select the Subquery, argMax, or the LIMIT BY links below to see the example queries.

Depending on your data and how you define the sorting keys in your Data Sources to store it on disk, one approach will be faster than others.

In general, deduplicating at query time will be fine if the size of your data is small. But if you have lots of data, the best option to make your query faster will be using a Materialized View to pre-compute the latest value of views for each post_id. We will show you how to do that in the 'Using Materialized Views' section below.

Doing upserts

If you have lots of data and you only care about the latest insertion for each unique key, you can use a ReplacingMergeTree engine. You need to use these two engine options to use ReplacingMergeTree engines: engine_sorting_key and engine_ver.

  • Rows with the same engine_sorting_key will be deduplicated. You can select one or more columns.
  • engine_ver can be omitted and in that case the last inserted row for each unique engine_sorting_key will be kept. If you specify it (the type of engine_ver has to be UInt*, Date or DateTime), the row with the highest engine_ver for each unique engine_sorting_key will be kept.

Define a Data Source like this:

You can create this new Data Source by adding this new defintion file to your project, with the same methods described above.

An important note from the docs:

"Data deduplication occurs only during a merge. Merging occurs in the background at an unknown time, so you can not plan for it. Some of the data may remain unprocessed. Thus, ReplacingMergeTree is suitable for clearing out duplicate data in the background in order to save space, but it does not guarantee the absence of duplicates."

Frequency of merges

Merging will happen in the background, most likely every 9-10 minutes, but if ClickHouse considers that you do not have enough data it will not happen that quickly. So doing SELECT * FROM ... will most likely give you duplicated rows (even when using a ReplacingMergeTree engine) after an insertion, and you'll need to use some other strategy to select only the last value on query time.

Note also that we added a new FINAL query here, using the FINAL modifier. It merges the data on query time and is not generally recommended because it will usually make queries slower.

By the time you are reading this, mostly likely the engine will have already performed the deduplication and the above queries will be very fast. Until that happens, their performance will be the same as if you were using a MergeTree engine with all the duplicated data, as in the first queries. After inserting data, the performance of the query using FINAL will be similar to the one using LIMIT BY, and the one using the subquery will be 5-10X faster, based on our tests.

Using Materialized Views

Materialized Views (MVs) are a powerful tool for managing duplicata data. Beyond deduplication use cases, MVs are recommended for the following situations:

  • You are working with high volumes and velocities of data and your data consumers often need to data aggregated into daily, hourly, and time-series with other durations.
  • You have queries that are making full-table scans. Materialized Views process and transform data as it arrives and not at query time. This results in much less data getting processed per query, and helps deliver millisecond query performance.

To create a MV you need:

  • A 'landing' Data Source, where you ingest the data with duplicates. posts_views_historical.datasource, in this case
  • A transformation Pipe that reads from the 'landing' Data Source and writes to a new Data Source. This Pipe applies filtering and aggregation as data arrives.
  • A destination Data Source that stores the transformed data.

Here is the definition file for the transformation Pipe that 'connects' both Data Sources:

Here is posts_views_latest_agg.datasource definition file, the destination Data Source:

As above, you can create these new MV resources in the UI or with definition files. To learn more about defining MVs with Tinybird, check out our docs.

Finally, this query computes the latest results by selecting from the MV-based posts_views_latest_agg Data Source instead of the landing Data Source which has duplicate data.

Snapshots

Using an AggregatingMergeTree will work fine when you can filter the data and therefore, you do not need to read much information. This is not always possible, there are situations when we need to run queries over big amounts of data that needs to be deduplicated.

Imagine that we want to sync our MySQL database with Tinybird in order to speed up the analytical queries and generate some endpoints for our back office. We want to be as real-time possible, so follow a change data capture (CDC) approach. To simplify, the 'syncer' displayed below will generate CSVs and pushing them to Tinybird, but we could be using a Debezium + Kafka approach as well.

Data flow from MySQL to Tinybird

Once our syncer is integrated, let's take a look at one of our tables we want to sync, the users table.

In order to generate the same information in Tinybird, we are going to need to create two Data Sources: users.datasource and users_landing.datasource

  • users_landing.datasource where we append all the changes that are happening in the MySQL table. We will create a row for each insert, update or delete that our MySQL table generates.
  • users.datasource where we will be storing every snapshot we take. We will generate a new snapshot by composing all the changes and merging them with the previous snapshot. ‍

The users_landing.datasource will have the same fields that we have in our MySQL table with two extra columns:

  • operation will indicate what kind of operation happened to the row (INSERT, UPDATE, DELETE).
  • inserted_at will indicate when the row was inserted in the data source. We will use this column to know if it is already inside a snapshot or not. ‍

Then, our users.datasource will have the same fields that we have in our MySQL table with an extra column, snapshot_time. In this table we are going to be storing every snapshot, so this column will help us to identify the latest snapshot that we want to use.

Now that we have both Data Sources created, we need a Materialized View (MV) to connect them.

To learn more about defining MVs with Tinybird, check out our docs.

Other interesting resources: