Backfill strategies

Backfilling data is the process of filling in missing data that didn't exist before. Whether you're changing data types, changing the sorting key, or redefining whole views, at some point you may need to run a backfill from the previous version of your Data Source or Materialized View to the new one.

In this guide, you'll learn the different strategies to run a backfill when you are iterating a Data Source or Materialized View, and the challenges of doing it with real-time data.

Before you start iterating and making critical changes to your Data Sources, Materialized Views, and Pipes, it's crucial to read the Deployment Strategies docs. You should also be familiar with the concept of a Tinybird Release.

Before we start talking about the different strategies to backfill data in Tinybird, let's first understand why it can be challenging to complete this task.

The challenge of backfilling real-time data

The iteration of Data Sources or Materialized Views often needs a careful approach to backfill data. This process becomes critical, especially under the Fork Downstream strategy, which results in creating a new, empty Data Source or Materialized View.

The main challenge lies in migrating historical data while continuously ingesting new real time data. See the detailed explanation in the "Master Materialized Views" guide.

Use Case

Imagine we have the following Data Source deployed in our Live Release 0.0.2:

Analytics events schema
SCHEMA >
    `timestamp` DateTime `json:$.timestamp`,
    `session_id` String `json:$.session_id`,
    `action` LowCardinality(String) `json:$.action`,
    `version` LowCardinality(String) `json:$.version`,
    `payload` String `json:$.payload`

ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "timestamp"

We want to modify the sorting key from timestamp to action, timestamp. This change requires us to bump the release to 1.0.0 in .tinyenv as it's a breaking change. You can read more about it in our Deployment Stategies docs.

When we deploy the new Release, as we have modified this Data Source, we will have a new Data Source in the new Preview Release 1.0.0. This means that we will have two Data Sources with the same name, one in the Live Release 0.0.2 and one in the Preview Release 1.0.0.

So, how we can sync the data between the two Data Sources? This is when we use a backfill.

How to move data in Tinybird

Reminder: "Running a backfill" just means copying all the data from one Data Source to another Data Source. There are different ways to move data in Tinybird:

Using Copy Pipes

A Copy Pipe is a Pipe used to copy data from one Data Source to another Data Source. This method is useful for one-time moves of data or scheduled executions (for example, every day at 00:00), but it's not recommended if you want to keep the data in sync between two Data Sources.

In the context of a backfill, we could use the following Pipe to copy the data from the Live Release 0.0.2 to the Preview Release 1.0.0 Data Source. (Later, we will explain why we need the timestamp BETWEEN {{DateTime(start_backfill_timestamp)}} AND {{DateTime(end_backfill_timestamp)}} condition).

backfill_data.pipe file
NODE node
SQL > 
   %
   SELECT * 
   FROM v0_0_2.analytics_events
   WHERE timestamp BETWEEN {{DateTime(start_backfill_timestamp)}} AND {{DateTime(end_backfill_timestamp)}}

TYPE COPY
TARGET_DATASOURCE analytics_events

When wanting to read data from other Releases, you must use the v<release> prefix. In this case, we are reading data from the release 0.0.2, so we must use v0_0_2. If we had a release 0.0.2-1, we would use v0_0_2_1.

Once deployed, you would need to run the following command to execute the copy:

Command to run the copy pipe with the backfill_timestamp parameter
tb pipe copy run backfill_data --param start_backfill_timestamp='1970-01-01 00:00:00' --param end_backfill_timestamp='2024-01-31 00:00:00'

You can read more about it in our Copy Pipes docs.

Using Materialized Views

A Materialized View is a Pipe that will materialize the data from one Data Source to another Data Source.

This method is useful to keep the data in sync between two Data Sources.

sync_data.pipe file
NODE node
SQL > 
   %
   SELECT * 
   FROM v0_0_2.analytics_events
   WHERE timestamp > '2024-01-31 00:00:00'

TYPE materialized
DATASOURCE analytics_events

By default, a Materialized view will only materialize the new incoming data; it won't process the old data.

It can be forced by using tb pipe populate command using the CLI, but be careful as this can lead to duplicate or loss of data as explained in the previous section.

Combining both methods, we will see how we can start syncing the Preview Release 1.0.0 Data Source with the Live Release 0.0.2 Data Source and start backfilling data.

Scenarios for backfill strategies

Depending on your use case and ingestion pattern, we recommend different strategies to backfill data in Tinybird. The complexity of this migration depends on several factors, notably the presence of streaming ingestion. The most common scenarios are:

We are actively improving this workflow in Tinybird. Reach out to Tinybird support (support@tinybird.co) if you have any questions.

  • Scenario 1: I'm not in Production.
  • Scenario 2: Full replacement every few hours.
  • Scenario 3: Streaming ingestion WITH incremental timestamp column.
  • Scenario 4: Streaming ingestion WITH NOT incremental timestamp column.

Scenario 1: I'm not in Production

If you are not in production or the data from that Data Source is not being used and you can accept losing data, you can simply promote the Preview Release to Live Release.

Once you start to append data to the Data Source, you will start seeing data in the new Data Source.

Scenario 2: Full replacement every few hours

If you are running a full replacement every few hours, you can create a Materialized View between the Live Release and the Preview Release Data Source.

To sync the data between the two releases, we will use a Materialized Pipe (MV) that will materialize the data from the previous release to the new release. Something like this:

Materialize data from previous release to new release
NODE migration_node
SQL > 
    SELECT * 
    FROM v0_0_2.analytics_events

TYPE materialized
DATASOURCE analytics_events

You would deploy this new Pipe along with the modified Data Source. Once we deploy the new release, we will have a Preview Release 1.0.0 with the new Data Source and the Pipe that will materialize the data from the Live Release 0.0.2.

At this point, you would just need to wait until the full replacement is executed, the new Data Source will have all the data, as it's connected to the Live Release.

Now, you can promote the Preview Release to Live Release.

Scenario 3: Streaming ingestion WITH incremental timestamp column

If you have streaming ingestion using the events API with a huge ingest rate, you can use the following strategy to not be impacted by the backfilling challenge with real-time data.

To use this strategy successfully, you must have an incremental timestamp column with the same timezone in your Data Source. In our example, we have the timestamp column.

First, we will create a new Pipe that will materialize the data from the Live Release 0.0.2 to the Preview Release 1.0.0 Data Source, but we will filter by a future timestamp. For example, if we are deploying the new release on 2024-02-02 13:00:00, we can use timestamp > '2024-02-02 13:30:00'.

sync_data.pipe file
NODE node
SQL >
   SELECT * 
   FROM v0_0_2.analytics_events
   WHERE timestamp > '2024-02-02 13:30:00'

TYPE materialized
DATASOURCE analytics_events

We are using the timestamp > '2024-02-02 13:30:00' condition to only materialize data that is newer than the 2024-02-02 13:30:00 timestamp.

Then, we will create a Copy Pipe with the same SQL statement, but instead of filtering by a specific future timestamp, we will use two parameters to filter by a timestamp range.

This allows us to have better control of the backfilling process. For example, if we are moving very large amounts of data, we can split the backfilling process into different batches to avoid overloading the system.

backfill_data.pipe file
NODE node
SQL > 
   %
   SELECT * 
   FROM v0_0_2.analytics_events
   WHERE timestamp BETWEEN {{DateTime(start_backfill_timestamp)}} AND {{DateTime(end_backfill_timestamp)}}

TYPE COPY
TARGET_DATASOURCE analytics_events

Once we have these changes in our code, we will create a Pull Request and the CI Workflow will generate a new Branch with the new Preview Release 1.0.0.

CI Workflow

Once the CI Workflow has finished successfully, a new Branch will be created with the new Preview Release 1.0.0.

If you go in the UI to the Branch created by the CI Workflow, you won't see these changes. This is because you will be seeing the Live Release of the Branch and not the Preview Release we have just created.

This is a current limitation on the UI only for the Branches that we expect to fix soon 🙏

For the following steps, we will use the CLI. If you don't have it installed, you can follow the CLI installation docs.

First, you should be able to authenticate in the Branch by either copying the token from the Branch or by switching to the Branch from Live Release.

Authenticate in the Branch
# You can use `tb auth -i` to authenticate in the branch
tb auth -i 

# Or you can switch to the branch if you are already authenticated
tb branch ls 

# By default, the CI Workflow will create a branch following the pattern `tmp_ci-<PULL_REQUEST_ID>`.
tb branch use <NAME_BRANCH> 

You will be able to list all the Releases deployed to the Branch by running the following command:

List all the releases
tb release ls

Also, you can query the new Preview Release of your Branch by running the following:

Query the preview release
tb --semver 1.0.0 sql "SELECT count() FROM analytics_events"

At the moment, we need to run the Copy Pipe outside of the CI Workflow. We are working on a new feature that will allow you to run the backfill process automatically from the CI/CD Workflow, but for now, you need to run it manually.

As we do not have continuous ingestion in the Branch, we don't wait for the future filter timestamp as we'll do in the Live Release. Instead, we will run directly the Copy Pipe to backfill the data by running the following command:

Run the copy pipe
tb pipe copy run backfill_data --param start_backfill_timestamp='1970-01-01 00:00:00' --param end_backfill_timestamp='2024-02-02 13:30:00'

Once the Copy Pipe has finished, we will have all the data in the Preview Release 1.0.0 Data Source. We can compare the number of rows in the Live Release 0.0.2 Data Source and the Preview Release 1.0.0 Data Source by running the following command:

Compare the number of rows in the Live Release 0.0.2 Data Source and the Preview Release 1.0.0 Data Source
tb --semver 1.0.0 sql "SELECT count() FROM analytics_events"
tb --semver 0.0.2 sql "SELECT count() FROM analytics_events"

At this point, we should have the same number of rows in both places and we could promote the new release on your Branch as we would do in the Live Release by running the following command:

Promote the Preview Release to Live Release
tb release promote --semver 1.0.0

CD Workflow

Now that we have tested the backfilling process in the Branch, we can merge the Pull Request and the CD Workflow will generate a new Preview Release 1.0.0 in the Main Branch. Once the CD Workflow has finished successfully, a new Preview Release 1.0.0 will be created with the new Data Source and the Materialized Pipe that will materialize the data from the Live Release 0.0.2.

You should verify that have deployed the new Release before the timestamp you have used in the Materialized Pipe. Otherwise, you will be missing data in the new Data Source.

For example, if you have used timestamp > '2024-02-02 13:30:00' in the Materialized Pipe, you should verify that you have deployed the new release before 2024-02-02 13:30:00.

If you have deployed the new release after 2024-02-02 13:30:00, you will need to remove the Preview Release and start again the process using a different timestamp.

At 2024-02-02 13:30:00, we will start seeing data in the new Data Source, that's when we will execute the same process we have done in the CI Workflow to backfill the data.

First, you will need to authenticate in the Workspace by running the following command:

Authenticate in the Workspace
tb auth -i

Then, you will need to run the Copy Pipe to backfill the data by running the following command:

Run the copy pipe
tb --semver 1.0.0 pipe copy run backfill_data --param start_backfill_timestamp='1970-01-01 00:00:00' --param end_backfill_timestamp='2024-02-02 13:30:00'

If the Copy Pipe fails, you can re-run the same command without duplicating data. The Copy Pipe will only copy the data if the process is successful.

If you get any error like MEMORY LIMIT, you can also run the Copy Pipe in batches. For example, you could run the Copy Pipe with a timestamp range of 1 hour, 1 day, 1 week, depending on the amount of data you are moving.

Once the Copy Pipe has finished, we will have all the data in the Preview Release 1.0.0 Data Source. We can compare the number of rows in the Live Release 0.0.2 Data Source and the Preview Release 1.0.0 Data Source by running the following command:

Compare the number of rows in the Live Release 0.0.2 Data Source and the Preview Release 1.0.0 Data Source
tb --semver 1.0.0 sql "SELECT count() FROM analytics_events"
tb --semver 0.0.2 sql "SELECT count() FROM analytics_events"

At this point, we should have the same number of rows in both places and we can promote the new release.

To promote the Preview Release to the Live Release, you can either use the UI, execute the promote action of our Github or Gitlab workflow templates, or run the following command:

Promote the Preview Release to Live Release
tb release promote --semver 1.0.0

Finally, we have our Data Source with the new schema and all the data migrated from the previous Release. The Data Source is receiving real-time data directly and now the next step is to remove the Materialized Pipe and Copy Pipe we have used to backfill the data.

To do that, you would just need to create a new Pull Request where you bump the release to 1.0.1 and remove the Materialized Pipe and Copy Pipe we have used to backfill the data. Once, the Pull Request is merged, the CD Workflow will generate a new Preview Release 1.0.1 in the Live Release and will be auto-promoted to Live Release following the semver versioning.

Scenario 4: Streaming ingestion WITH NOT incremental timestamp column

If you have streaming ingestion, but you do not have an incremental timestamp column, you can use one of the following strategies to backfill data in Tinybird.

Reach out to Tinybird support (support@tinybird.co) if you have any questions or you aren't sure how to proceed.

Strategy 1: Run a populate

Before following this strategy, you should be aware of the backfilling challenge with real-time data.

Let's consider that the use case is the same as the previous one, but we do not have an incremental timestamp column. We can not rely on the timestamp column to filter the data as it's not incremental.

First, we will create a Materialized Pipe that will materialize the data from the Live Release 0.0.2 to the Preview Release 1.0.0 Data Source.

backfill_data.pipe file
NODE migrating_node
SQL > 
   SELECT * 
   FROM v0_0_2.analytics_events

TYPE materialized
DATASOURCE analytics_events

To run the backfill, we will use the tb pipe populate command. This command will materialize the data from the Live Release 0.0.2 to the Preview Release 1.0.0 Data Source and as we don't need to wait until a future timestamp, we can run it inside the CI/CD Workflow.

To do that, we need to create a post-deploy.sh script inside the deploy/1.0.0 folder. This script will run after the deployment. To simplify the process, we have created a tb release generate command that will generate the post-deploy.sh script for you. You can read more about customizing the deployment in the Deployment Strategies docs.

Generate the post-deploy.sh script
tb release generate --semver 1.0.0

This will generate two scripts inside the deploy/1.0.0 folder:

Scripts generated inside the deploy folder
deploy/1.0.0
├── deploy.sh ## This is the script that will be executed during the deployment
├── postdeploy.sh ## This is the script that will be executed after the deployment

In this case, we will remove the deploy.sh script as we do not need to customize how we deploy and we will need to modify the postdeploy.sh script to run the tb pipe populate command. We will need to add the following lines:

postdeploy.sh script
#!/bin/bash

# This script will be executed after the deployment
# You can use it to run any command after the deployment

# Run the populate pipe
tb --semver 1.0.0 pipe populate backfill_data --node migrating_node --wait

Once we have these changes in our code, we will create a Pull Request and the CI Workflow will generate a new Branch with the new Preview Release 1.0.0. Now, we should verify that everything is working as expected as we did in the previous section.

# You can use `tb auth -i` to authenticate in the branch
tb auth -i 

# Or you can switch to the branch if you are already authenticated
tb branch ls 

# By default, the CI Workflow will create a branch following the pattern `tmp_ci-<PULL_REQUEST_ID>`.
tb branch use <NAME_BRANCH> 

# You will be able to list all the releases we have deployed to the branch by running the following command:
tb release ls

# Also, you could compare the number of rows in the Live Release 0.0.2 against the Preview Release 1.0.0 by running the following command:
tb --semver 1.0.0 sql "SELECT count() FROM analytics_events"
tb --semver 0.0.2 sql "SELECT count() FROM analytics_events"

Once we have verified that everything is working as expected, we can merge the Pull Request and the CD Workflow will generate a new Preview Release 1.0.0 in the Main Branch. Once the CD Workflow has finished successfully, a new Preview Release 1.0.0 will be created with the new Data Source and the Materialized Pipe that will materialize the data from the Live Release 0.0.2. Now, you verify the same way as you did in the Branch.

Finally, if everything is working as expected, you can promote the new Release by running the following command:

Promote the Preview Release to Live Release
tb release promote --semver 1.0.0

Strategy 2: Move ingestion to Preview Release

Let's consider that the use case is the same as the previous one. We do not have an incremental timestamp column. We can not rely on the timestamp column to filter the data as it's not incremental and we do not want to run a populate as we might be impacted by the backfilling challenge with real-time data.

In this case, we can move the ingestion to the new Preview Release 1.0.0 Data Source until we finish backfilling data.

First, we will need to create a Copy Pipe that will copy the data from the Live Release 0.0.2 to the Preview Release 1.0.0 Data Source.

backfill_data.pipe file
NODE migrate_data
SQL > 
   SELECT * 
   FROM v0_0_2.analytics_events

TYPE COPY
TARGET_DATASOURCE analytics_events

We could also parametrize the Copy Pipe to filter by a parameter. This allows us to have better control of the backfilling process.

Once we have these changes in our code, we will create a Pull Request and the CI Workflow will generate a new Branch with the new Preview Release 1.0.0 that contains the new Data Source and the Copy Pipe.

Now, we should run the Copy Pipe to backfill the data. To do that, we will need to authenticate in the branch by either copying the token from the branch or by switching to the branch from Live Release.

Authenticate in the branch
# You can use `tb auth -i` to authenticate in the branch
tb auth -i 

# Or you can switch to the branch if you are already authenticated
tb branch ls 

# By default, the CI Workflow will create a branch following the pattern `tmp_ci-<PULL_REQUEST_ID>`.
tb branch use <NAME_BRANCH>

# Once we have authenticated in the branch, we can run the Copy Pipe by running the following command:
tb --semver 1.0.0 pipe copy run backfill_data --node migrate_data --wait

Once the Copy Pipe has finished, we will have all the data in the Preview Release 1.0.0 Data Source. As you are likely not ingesting data into your Branch, both numbers should match.

Compare the number of rows in the Live Release 0.0.2 Data Source and the Preview Release 1.0.0 Data Source
tb --semver 1.0.0 sql "SELECT count() FROM analytics_events"
tb --semver 0.0.2 sql "SELECT count() FROM analytics_events"

At this point, we should have the same number of rows in both places and we can promote the new release by running the following command:

Promote the Preview Release to Live Release
tb release promote --semver 1.0.0

Now, we would merge the Pull Request and the CD Workflow will generate a new Preview Release 1.0.0 in the Main Branch. Once the CD Workflow has finished successfully, a new Preview Release 1.0.0 will be created with the new Data Source and the Copy Pipe that will copy the data from the Live Release 0.0.2.

At this point, you should modify the ingestion to start ingesting data into the new Preview Release 1.0.0 Data Source. Keep in mind that while you are ingesting data into the new Preview Release 1.0.0 Data Source, you will stop ingesting data into the Live Release 0.0.2 Data Source.

If you are using ingesting using the API endpoints directly, you can pass the parameter __tb__semver=1.0.0 parameter in your requests.

If you are using the CLI, you can read more about the tb --semver 1.0.0 parameter in our CLI Reference docs.

Once all the ingestion is pointing to the Preview Release, you should verify that new data is being ingested into the new Preview Release 1.0.0 Data Source and nothing is being ingested into the Live Release 0.0.2 Data Source. To do that you could query the Data Source directly or the Service Data Source tinybird.Data Sources_ops_log.

At this point, you can start backfilling data by running the Copy Pipe. To do that, you would need to run the following command:

Run the copy pipe
tb --semver 1.0.0 pipe copy run backfill_data --node migrate_data --wait

Once the Copy Pipe has finished, we will have all the data in the Preview Release 1.0.0 Data Source and now we can promote the new release by running the following command:

Promote the Preview Release to Live Release
tb release promote --semver 1.0.0

Finally, we have our Data Source with the new schema and all the data migrated from the previous Release. The Data Source is receiving real-time data directly and now the next step is to remove the Copy Pipe we have used to backfill the data.

Next steps

If you're familiar with backfilling strategies, we recommend you check out the Deployment Strategies docs.