Replace and delete data in your Tinybird Data Sources

Intermediate

Update & delete operations are common in transactional databases over operational data, but sometimes you also need to make these changes on your analytical data in Tinybird.

Perhaps there was a bug in your application, a transient error in your operational database, or simply an evolution of requirements due to product or regulatory changes. Either way, you might need the capability to delete or replace some data in Tinybird.

While realtime analytical databases like Tinybird are optimized for SELECTs and INSERTs, Tinybird fully supports the need to replace & delete data. The complexities of data replication, partition management and mutations rewriting are hidden, allowing you to focus on your data engineering flows and not the internals of realtime analytical databases.

This guide will show you how to selectively delete or update data in Tinybird using the REST API.

It is not safe to replace data in the partitions where you are actively ingesting data. You may lose the data inserted during the process.

Delete data selectively

To delete data that is within a condition, you can send a POST request to the Data Sources /delete API, providing the name of one of your Data Sources in Tinybird and a delete_condition parameter, which is an SQL expression filter.

Imagine you have a Data Source called events and you want to remove all the transactions for November 2019. You'd send a POST request like this:

Use a token with the right scope:

Replace <your_token> by a token whose scope is DATASOURCES:CREATE or ADMIN

Once you make the request, you will see that the POST request to the delete API endpoint is asynchronous. It returns a Job response, indicating an ID for the job, the status of the job, the delete_condition and some other metadata. Although the delete operation runs asynchronously (hence the job response), the operation waits synchronously for all the mutations to be rewritten and data replicas to be deleted.

{
   "id": "64e5f541-xxxx-xxxx-xxxx-00524051861b",
   "job_id": "64e5f541-xxxx-xxxx-xxxx-00524051861b",
   "job_url": "https://api.tinybird.co/v0/jobs/64e5f541-xxxx-xxxx-xxxx-00524051861b",
   "job": {
      "kind": "delete_data",
      "id": "64e5f541-xxxx-xxxx-xxxx-00524051861b",
      "job_id": "64e5f541-xxxx-xxxx-xxxx-00524051861b",
      "status": "waiting",
      "created_at": "2023-04-11 13:52:32.423207",
      "updated_at": "2023-04-11 13:52:32.423213",
      "started_at": null,
      "is_cancellable": true,
      "datasource": {
         "id": "t_c45d5ae6781b41278fcee365f5bxxxxx",
         "name": "shopping_data"
      },
      "delete_condition": "event = 'search'"
   },
   "status": "waiting",
   "delete_id": "64e5f541-xxxx-xxxx-xxxx-00524051861b"
}  

You can periodically poll the job_url with the given ID to check the status of the deletion process. When it's done it means the data matching the SQL expression filter has been removed and all your Pipes and API endpoints will continue running with the remaining data in the Data Source.

Truncate a Data Source

Sometimes you just want to delete all data contained in a Data Source. Most of the time starting from zero. You can perform this action from the UI and API.

Using the API, the truncate endpoint will delete all rows in a Data Source and can be done as follows:

Truncate a token with the right scope:

Replace <your_token> by a token whose scope is DATASOURCES:CREATE or ADMIN

If you wish, you can also Truncate a Data Source directly from the UI:

Deleting selectively is only available via API, but truncating it to delete all of its data can be done via the UI.

Replace data selectively

The ability to update data is often not the top priority when designing analytical databases, but there are always scenarios where you need to update or replace your analytical data. For example, you might have reconciliation processes over your transactions that affect your original data. Or maybe your ingestion process was simply faulty, and you ingested inaccurate data for a period of time.

In Tinybird, you can specify a condition under which only a part of the data is replaced during the ingestion process. For instance, let's say you want to reingest a CSV with the data for November 2019 and update your Data Source accordingly. In order to update the data, you just need to pass the replace_condition parameter with the toDate(date) >= '2019-11-01' and toDate(date) <= '2019-11-30' condition.

Use a token with the right scope:

Replace <your_token> by a token whose scope is DATASOURCES:CREATE or ADMIN

The response you'd get after making the previous API call looks like this:

As in the case of the selective deletion, selective replacement also runs as an asynchronous request, so we recommend you check the status of the job periodically. You can easily see the status of the job going to the job_url returned in the previous response

About the replace condition

Conditional replaces are applied over partitions. Partitions are selected for replaces based on the rows that match the condition in the new data. The partitions involved are the ones where these remaining rows would be stored.

The replace condition is applied to filter the new data that's going to be appended, meaning rows not matching the condition won't be inserted.

The condition is also applied for the selected partitions in the Data Source, so rows that don't match the condition in these partitions will be removed. But rows that don't match the condition and may be present in other partitions won't be deleted.

If you are trying to delete rows in the target data source in your workflow, we recommend that you have a look at the "Replace data removing non-matching rows from the data source" example below.

Linked materialized views

If you have several connected materialized views, then selective replaces are done in cascade. For example, if Data Source A materializes data in a cascade to Data Source B and from there to Data Source C, then when you replace data in Data Source A, Data Sources B and C will automatically be updated accordingly. All three Data Sources need to have compatible partition keys since replaces are done by partition.

The command tb dependencies --datasource the_data_source --check-for-partial-replace returns the dependencies that would be recalculated, both for Data Sources and Materialized Views, and raises an error if any of the dependencies have incompatible partition keys.

Note that the provided Auth token must have the DATASOURCES:CREATE scope.

Example

For this example, consider this Data Source:

Its partition key is ENGINE_PARTITION_KEY "profession". If you wanted to replace the last two rows with new data, you can send this request with the replace condition replace_condition=(profession='Jedi'):

Since the replace condition column matches the partition key, the result is, as expected:

However, consider what happens if you create the Data Source with ENGINE_PARTITION_KEY "name":

And then, run the same replace request:

Now the result probably doesn't make sense:

Why were the existed rows not removed? Because the Replace process used the payload rows to identify which partitions to work on. The Data Source is now partitioned by name, so the process didn't delete the other "Jedi" rows They're in different partitions because they have different names.

The rule of thumb is this: Always make sure the replace condition uses the partition key as the filter field. This way you'll always get the results you expect.

Replace a Data Source completely

To replace a complete Data Source, you'd make an API call similar to the previous one, without providing a replace_condition:

Use a token with the right scope:

Replace <your_token> by a token whose scope i DATASOURCES:CREATE or ADMIN

The request above is replacing a Data Source with the data found in a given URL pointing to a CSV file but you can also do it through our User Interface as follows:

Replacing a Data Source completely can also be done through the User Interface

Schemas must be the same:

When replacing data (either selectively or a whole Data Source) you need to ensure your Data Schema is the same as in the original Data Source. Rows not containing the same schema will go to quarantine.