Backfill strategies¶
Backfilling data is the process of filling in missing data that didn't exist before. Whether you're changing data types, changing the sorting key, or redefining whole views, at some point you may need to run a backfill from the previous version of your Data Source or Materialized View to the new one.
This page introduces the key challenges of backfilling real-time data, and covers the different strategies to run a backfill when you are iterating a Data Source or Materialized View.
Before you start iterating and making critical changes to your Data Sources, Materialized Views, and Pipes, it's crucial to read the Deployment Strategies docs.
The challenge of backfilling real-time data¶
The iteration of Data Sources or Materialized Views often needs a careful approach to backfill data. This process becomes critical, especially when you create a new version of a Data Source or Materialized View, which results in creating a new, empty Data Source or Materialized View.
The main challenge lies in migrating historical data while continuously ingesting new real time data. See the detailed explanation Best practices for Materialized Views.
Use case¶
Imagine you have the following Data Source deployed in our main Workspace:
analytics_events.datasource
SCHEMA > `timestamp` DateTime `json:$.timestamp`, `session_id` String `json:$.session_id`, `action` LowCardinality(String) `json:$.action`, `version` LowCardinality(String) `json:$.version`, `payload` String `json:$.payload` ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(timestamp)" ENGINE_SORTING_KEY "timestamp"
We want to modify the sorting key from timestamp
to action, timestamp
. This change requires us create a new Data Source (e.g. events1.datasource
).
When merging the Pull Request, you will have analytics_events
and analytics_events_1
in our main Workspace (also in the branch while in CI). This means that you will have two Data Sources one representing the new version you want to deploy which since is newly created is empty.
So, how you can sync the data between the two Data Sources? This is when you use a backfill.
How to move data in Tinybird¶
Reminder: "Running a backfill" just means copying all the data from one Data Source to another Data Source. There are different ways to move data in Tinybird:
Using Copy Pipes¶
A Copy Pipe is a Pipe used to copy data from one Data Source to another Data Source. This method is useful for one-time moves of data or scheduled executions (for example, every day at 00:00), but it's not recommended if you want to keep the data in sync between two Data Sources.
In the context of a backfill, you could use the following Pipe to copy the data from one Data Source to another. (Later, we will explain why you need the timestamp BETWEEN {{DateTime(start_backfill_timestamp)}} AND {{DateTime(end_backfill_timestamp)}}
condition).
backfill_data.pipe file
NODE node SQL > % SELECT * FROM analytics_events WHERE timestamp BETWEEN {{DateTime(start_backfill_timestamp)}} AND {{DateTime(end_backfill_timestamp)}} TYPE COPY TARGET_DATASOURCE analytics_events_1
Once deployed, you would need to run the following command to execute the copy:
Command to run the Copy Pipe with the backfill_timestamp parameter
tb pipe copy run backfill_data --param start_backfill_timestamp='1970-01-01 00:00:00' --param end_backfill_timestamp='2024-01-31 00:00:00' --wait --yes
You can read more about it in our Copy Pipes docs.
Using Materialized Views¶
A Materialized View is a Pipe that will materialize the data from one Data Source to another Data Source.
This method is useful to keep the data in sync between two Data Sources.
sync_data.pipe file
NODE node SQL > % SELECT * FROM analytics_events WHERE timestamp > '2024-01-31 00:00:00' TYPE materialized DATASOURCE analytics_events_1
By default, a Materialized view will only materialize the new incoming data; it won't process the old data.
It can be forced by using tb pipe populate
command using the CLI, but be careful as this can lead to duplicate or loss of data as explained in the previous section.
Combining both methods, you will see how you can start syncing both Data Sources and start backfilling data.
Scenarios for backfill strategies¶
Depending on your use case and ingestion pattern, there are different recommended strategies to backfilling data in Tinybird. The complexity of this migration depends on several factors, notably the presence of streaming ingestion. The most common scenarios are:
We are actively improving this workflow in Tinybird. Reach out to Tinybird support (support@tinybird.co) if you have any questions.
- Scenario 1: I'm not in Production.
- Scenario 2: Full replacement every few hours.
- Scenario 3: Streaming ingestion WITH incremental timestamp column.
- Scenario 4: Streaming ingestion WITH NOT incremental timestamp column.
Scenario 1: I'm not in Production¶
If you are not in production or the data from that Data Source is not being used and you can accept losing data, you can opt-in by create a new Data Source and start using it right away. Alternatively you can remove and re-create the original Data Source using a custom deployment.
Once you start to append data to the Data Source, you will start seeing data in the new Data Source.
Scenario 2: Full replacement every few hours¶
If you are running a full replacement every few hours, you can create a Materialized View the two Data Sources.
To sync the data between the two Data Sources, you will use a Materialized Pipe (MV) that will materialize the data from the old Data Source to the new one. Something like this:
Materialize data from old to new Data Source
NODE migration_node SQL > SELECT * FROM analytics_events TYPE materialized DATASOURCE analytics_events_1
You would deploy this new Pipe along with the modified Data Source. Once you deploy the Pull Request, you will have this Materialized Pipe along with the new Data Source and the Pipe will materialize the data from the old Data Source.
At this point, you would just need to wait until the full replacement is executed, the new Data Source will have all the data, after that you can create a new Pull Request to connect the new Data Source to the rest of your Pipe Endpoints.
Scenario 3: Streaming ingestion WITH incremental timestamp column¶
If you have streaming ingestion using the events API with a huge ingest rate, you can use the following strategy to not be impacted by the backfilling challenge with real-time data.
To use this strategy successfully, you must have an incremental timestamp column with the same time zone in your Data Source. In our example, you have the timestamp
column.
First, create a new Pipe that will materialize the data from old Data Source to the new one, but filter by a future timestamp. For example, if you are deploying the Pull Request on 2024-02-02 13:00:00
, you can use timestamp > '2024-02-02 13:30:00'
.
sync_data.pipe file
NODE node SQL > SELECT * FROM analytics_events WHERE timestamp > '2024-02-02 13:30:00' TYPE materialized DATASOURCE analytics_events_1
We are using the timestamp > '2024-02-02 13:30:00'
condition to only materialize data that is newer than the 2024-02-02 13:30:00
timestamp.
Then, create a Copy Pipe with the same SQL statement, but instead of filtering by a specific future timestamp, you will use two parameters to filter by a timestamp range.
This allows us to have better control of the backfilling process. For example, if you are moving very large amounts of data, split the backfilling process into different batches to avoid overloading the system.
backfill_data.pipe file
NODE node SQL > % SELECT * FROM analytics_events WHERE timestamp BETWEEN {{DateTime(start_backfill_timestamp)}} AND {{DateTime(end_backfill_timestamp)}} TYPE COPY TARGET_DATASOURCE analytics_events_1
Once you have these changes in code, create a Pull Request and the CI Workflow will generate a new Branch.
CI workflow¶
Once the CI Workflow has finished successfully, a new Branch will be created.
For the following steps, use the CLI. If you don't have it installed, you can follow the CLI installation docs.
First, you should be able to authenticate in the Branch by copying the Token from the Branch or using these commands:
Authenticate in the Branch
# You can use `tb auth -i` to authenticate in the branch tb auth -i # Or you can switch to the branch if you are already authenticated tb branch ls # By default, the CI Workflow will create a branch following the pattern `tmp_ci-<PULL_REQUEST_ID>`. tb branch use <NAME_BRANCH>
We recommend you to run the Copy Pipe outside of the CI Workflow. You can automate it using a custom deployment, but most of the times it's just not worth it.
As you do not have continuous ingestion in the Branch, don't wait for the future filter timestamp. Instead, run directly the Copy Pipe to backfill the data by running the following command:
Run the Copy Pipe
tb pipe copy run backfill_data --param start_backfill_timestamp='1970-01-01 00:00:00' --param end_backfill_timestamp='2024-02-02 13:30:00' --wait --yes
Once the Copy Pipe has finished, you will have all the data in the new Data Source. We can compare the number of rows from both Data Sources with the following commands:
Compare the number of rows
tb sql "SELECT count() FROM analytics_events" tb sql "SELECT count() FROM analytics_events_1"
CD workflow¶
Now that you have tested the backfilling process in the Branch, you can merge the Pull Request and the CD Workflow, the operation will be exactly the same as in the Branch: first deploy the resources then run the data operations either manually (recommended) or automate it with a custom deployment.
You should verify that have deployed the new Data Source before the timestamp you have used in the Materialized Pipe. Otherwise, you will be missing data in the new Data Source.
For example, if you have used timestamp > '2024-02-02 13:30:00'
in the Materialized Pipe, you should verify that you have deployed before 2024-02-02 13:30:00
.
If you have deployed after 2024-02-02 13:30:00
, you will need to remove the Data Source and start again the process using a different timestamp.
At 2024-02-02 13:30:00
, yuo will start seeing data in the new Data Source, that's when you will execute the same process you have done in the CI Workflow to backfill the data.
First, you will need to authenticate in the Workspace by running the following command:
Authenticate in the Workspace
tb auth -i
Then, you will need to run the Copy Pipe to backfill the data by running the following command:
Run the Copy Pipe
tb pipe copy run backfill_data --param start_backfill_timestamp='1970-01-01 00:00:00' --param end_backfill_timestamp='2024-02-02 13:30:00' --wait --yes
If the Copy Pipe fails, you can re-run the same command without duplicating data. The Copy Pipe will only copy the data if the process is successful.
If you get any error like MEMORY LIMIT
, you can also run the Copy Pipe in batches. For example, you could run the Copy Pipe with a timestamp range of 1 hour, 1 day, 1 week, depending on the amount of data you are moving.
Once the Copy Pipe has finished, you will have all the data in the new Data Source. We can compare the number of rows by running the following command:
Compare the number of rows
tb sql "SELECT count() FROM analytics_events" tb sql "SELECT count() FROM analytics_events"
At this point, you should have the same number of rows in both places and you can connect the new Data Source with the rest of the Dataflow.
Finally, you have the Data Source with the new schema and all the data migrated from the previous one. The Data Source is receiving real-time data directly and now the next step is to remove the Materialized Pipe and Copy Pipe you have used to backfill the data.
To do that, you would just need to create a new Pull Request and remove (git rm
) the Materialized Pipe and Copy Pipe you have used to backfill the data. Once, the Pull Request is merged, the resources will be automatically removed, you can double check this operation while in CI.
Scenario 4: Streaming ingestion WITH NOT incremental timestamp column¶
If you have streaming ingestion, but you do not have an incremental timestamp column, you can use one of the following strategies to backfill data in Tinybird.
Reach out to Tinybird support (support@tinybird.co) if you have any questions or you aren't sure how to proceed.
- Strategy 1: Run a populate, but be aware that you may be impacted by the previously-mentioned challenges of backfilling real-time data.
- Strategy 2: Move ingestion to the new Data Source until you finish backfilling data, but the data in your old Data Source will be outdated until the new Data Source is fully in sync.
Strategy 1: Run a populate¶
Before following this strategy, you should be aware of the backfilling challenge with real-time data.
Let's consider that the use case is the same as the previous one, but you do not have an incremental timestamp column. We can not rely on the timestamp
column to filter the data as it's not incremental.
First, create a Materialized Pipe that will materialize the data from the old Data Source to the new one.
backfill_data.pipe file
NODE migrating_node SQL > SELECT * FROM analytics_events TYPE materialized DATASOURCE analytics_events_1
To run the backfill, you will use the tb pipe populate
command. This command will materialize the data from the old Data Source to the new one and as you don't need to wait until a future timestamp, you can run it inside the CI/CD Workflow.
You can create a custom deployment using VERSION=1.0.0
in the .tinyenv
file and placing the custom deployment script in the deploy/1.0.0
folder:
Scripts generated inside the deploy folder
deploy/1.0.0 ├── deploy.sh ## This is the script that will be executed during the deployment
You will need to modify the deploy.sh
script to run the tb pipe populate
command:
deploy.sh script
#!/bin/bash # This script will be executed after the deployment # You can use it to run any command after the deployment # Run the populate Pipe tb pipe populate backfill_data --node migrating_node --wait
Once you have these changes in the code, you will create a Pull Request and the CI Workflow will generate a new Branch with the new Data Source. Now, you should verify that everything is working as expected as you did in the previous section.
# You can use `tb auth -i` to authenticate in the branch tb auth -i # Or you can switch to the branch if you are already authenticated tb branch ls # By default, the CI Workflow will create a branch following the pattern `tmp_ci-<PULL_REQUEST_ID>`. tb branch use <NAME_BRANCH> # Also, you could compare the number of rowsby running the following command: tb sql "SELECT count() FROM analytics_events" tb sql "SELECT count() FROM analytics_events_1"
Once you have verified that everything is working as expected, merge the Pull Request and the CD Workflow will generate a new Data Source in the Main Branch. Once the CD Workflow has finished successfully, you verify the same way as you did in the Branch.
Strategy 2: Move ingestion to the new Data Source¶
Let's consider that the use case is the same as the previous one. We do not have an incremental timestamp column. We can not rely on the timestamp
column to filter the data as it's not incremental and you do not want to run a populate as you might be impacted by the backfilling challenge with real-time data.
In this case, you can move the ingestion to the new Data Source until you finish backfilling data.
First, create a Copy Pipe that will copy the data from the old Data Source to the new one.
backfill_data.pipe file
NODE migrate_data SQL > SELECT * FROM analytics_events TYPE COPY TARGET_DATASOURCE analytics_events_1
We could also parametrize the Copy Pipe to filter by a parameter. This allows us to have better control of the backfilling process.
Once you have these changes in our code, create a Pull Request and the CI Workflow will generate a new Branch with the new Data Source and the Copy Pipe.
Now, run the Copy Pipe to backfill the data. To do that, authenticate in the branch by either copying the Token from the branch.
Authenticate in the branch
# You can use `tb auth -i` to authenticate in the branch tb auth -i # Or you can switch to the branch if you are already authenticated tb branch ls # By default, the CI Workflow will create a branch following the pattern `tmp_ci-<PULL_REQUEST_ID>`. tb branch use <NAME_BRANCH> # Once you have authenticated in the branch, you can run the Copy Pipe by running the following command: tb pipe copy run backfill_data --node migrate_data --wait --yes
Once the Copy Pipe has finished, you will have all the data in the new Data Source. As you are likely not ingesting data into your Branch, both numbers should match.
Compare the number of rows
tb sql "SELECT count() FROM analytics_events" tb sql "SELECT count() FROM analytics_events_1"
Now, merge the Pull Request and the CD Workflow will generate the new resources in the main Branch.
At this point, you should modify the ingestion to start ingesting data into the new Data Source. Keep in mind that while you are ingesting data into the new Data Source, you will stop ingesting data into the old Data Source.
Once all the ingestion is pointing to the new Data Source, you should verify that new data is being ingested into the new Data Source and nothing is being ingested into the old one. To do that you could query the Data Source directly or the Service Data Source tinybird.datasources_ops_log.
At this point, you can start backfilling data by running the Copy Pipe. To do that, you would need to run the following command:
Run the Copy Pipe
tb pipe copy run backfill_data --node migrate_data --wait --yes
There are sometimes when the Data Source you are modifying has downstream dependencies, in that case when creating the new version of the Data Source you need to make sure that you create new version of all downstream dependencies to avoid connecting two different Data Sources receiving data to the same part of the Dataflow hence duplicating data.
Next steps¶
If you're familiar with backfilling strategies, check out the Deployment Strategies docs.