Iterate a Data Source

Advanced

Creating a Data Source for the first time is really straightforward. However, when iterating data projects, sometimes you need to edit the Data Source schema. This can be challenging when the data is already in production, and there are a few different scenarios. With Tinybird you can easily add more columns, but other operations (such as changing the sorting key or changing a column type) require you to fully recreate the Data Source.

This guide is for Workspaces that aren't using version control. If your Workspace is linked using the Git<>Tinybird integration, see the repo of common use cases for iterating when using version control.

Context

This guide walks through the process for 4 different scenarios. Pick the one that's most relevant for you:

  • Scenario 1: I'm not in production
  • Scenario 2: I can stop/pause data ingestion
  • Scenario 3: I need to change a Materialized View & I can't stop data ingest
  • Scenario 4: It's too complex and I can't figure it out

You'll need to be familiar with the Tinybird CLI to follow along with this guide. Never used it before? Read the docs here.

All of the guide examples have the same setup - a Data Source with a nullable(Int64) column that we want to change to a Int64 for performance reasons. This requires editing the schema and, to keep the existing data, replacing any occurrences of NULL with a number, like 0.

Scenario 1: I'm not in production

This scenario assumes that you are not in production and can accept losing any data you have already ingested.

If you are not in production, and you can accept losing data, use the Tinybird CLI to pull your Data Source down to a file, modify it, and push it back into Tinybird.

Begin with tb pull to pull your Tinybird resources down to files. Then, modify the .datasource file for the Data Source you want to change. When you're finished modifying the Data Source, delete the existing Data Source from Tinybird, either in the CLI with tb datasource rm or through the UI. Finally, push the new Data Source to Tinybird with tb push.

See a screencast example: https://www.youtube.com/watch?v=gzpuQfk3Byg

Scenario 2: I can stop data ingestion

This scenario assumes that you have stopped all ingestion into the affected Data Sources.

1. Use the CLI to pull your Tinybird resources down into files

Use tb pull --auto to pull your Tinybird resources down into files. The --auto flag will organize the resources into directories, with your Data Sources being places into a datasources directory.

2. Create the new Data Source

Create a copy of the Data Source file that you want to modify and rename it.

For example, datasources/original_ds.datasource -> datasources/new_ds.datasource.

Modify the new Data Source schema in the file to make the changes you need.

Now push the new Data Source to Tinybird with tb push datasources/new_ds.datasource.

3. Backfill the new Data Source with existing data

If you want to move the existing data from the original Data Source to the new Data Source, use a Copy Pipe or a Pipe that materializes data into the new Data Source.

A Copy Pipe is a Pipe used to copy data from one Data Source to another Data Source. This method is useful for one-time moves of data or scheduled executions.

Move your data using the following Copy Pipe, paying particular attention to the TYPE, TARGET_DATASOURCE and COPY_SCHEDULE configs at the end:

NODE copy_node
SQL > 
    SELECT 
        * EXCEPT (my_nullable_column), 
        toInt64(coalesce(my_nullable_column,0)) as my_column -- adjust query to your changes
    FROM original_ds

TYPE COPY
TARGET_DATASOURCE new_ds
COPY_SCHEDULE @on-demand

Push it to the Workspace:

tb push pipes/temp_copy.pipe

And run the Copy:

tb pipe copy run temp_copy

When it's done, remove the Pipe:

tb pipe rm temp_copy

3.2 Alternative option: A Populate

Alternatively, you can create a Materialized View Pipe and run a Populate to transform data from the original schema into the modified schema of the new Data Source.

Do this using the following Pipe, paying particular attention to the TYPE and DATASOURCE configs at the end:

NODE temp_populate
SQL > 
    SELECT 
        * EXCEPT (my_nullable_column), 
        toInt64(coalesce(my_nullable_column,0)) as my_column
    FROM original_ds

TYPE materialized
DATASOURCE new_ds

Then push the Pipe to Tinybird, passing the --populate flag to force it to immediately start processing data:

tb push pipes/temp.pipe --populate --wait

When it's done, remove the Pipe:

tb pipe rm temp

At this point, review your new Data Source and ensure that everything is as expected.

4. Delete the original Data Source and rename the new Data Source

You can now go to the UI, delete the original Data Source, and rename the new Data Source to use the name of the original Data Source.

By renaming the new Data Source to use the same name as the original Data Source, any SQL in your Pipes or Endpoints that referred to the original Data Source will continue to work.

If you have a Materialized View that depends on the Data Source, you must unlink the Pipe that is materializing data before removing the Data Source. You can modify and reconnect your Pipe after completing the steps above.

Scenario 3: I need to change a Materialized View & I can't interrupt service

This scenario assumes you want to modify a Materialized View that is actively receiving data and serving API Endpoints, and you want to avoid service downtime.

Before you begin

Because this is a complex scenario, we'll introduce some names for the example resources to make it a bit easier to follow along.

Read The challenge with populates and streaming ingest to understand why this scenario is challenging.

Let's assume that you have a Data Source that is actively receiving data; let's call this the Landing Data Source. From the Landing Data Source, you have a Pipe that is writing to a Materialized View; let's call these the Materializing Pipe and Materialized View Data Source respectively.

1. Use the CLI to pull your Tinybird resources down into files

Use tb pull --auto to pull your Tinybird resources down into files. The --auto flag organizes the resources into directories, with your Data Sources being places into a datasources directory.

2. Duplicate the Materializing Pipe & Materialized View Data Source

Duplicate the Materializing Pipe & Materialized View Data Source.

For example:

pipes/original_materializing_pipe.pipe -> pipes/new_materializing_pipe.pipe
datasources/original_materialized_view_data_source.datasource -> datasources/new_materialized_view_data_source.datasource

Modify the new files to change the schema as needed.

Lastly, you'll need to add a WHERE clause to the new Materializing Pipe. This clause is going to filter out old rows, so that the Materializing Pipe is only materializing rows newer than a specific time. For the purpose of this guide, let's call this the Future Timestamp. Do not use variable time functions for this timestamp (e.g. now()). Pick a static time that is in the near future; five to fifteen minutes should be enough. The condition should be >, for example:

WHERE … AND my_timestamp > "2024-04-12 13:15:00"

3. Push the Materializing Pipe & Materialized View Data Source

Push the Materializing Pipe & Materialized View Data Source to Tinybird:

tb push datasources/new_materialized_view_data_source.datasource
tb push pipes/new_materializing_pipe.pipe

4. Create a new Pipe to transform & materialize the old schema to the new schema

You now have two Materialized Views: the one with the original schema, and the new one with the new schema. You need to take the data from the original Materialized View, transform it into the new schema, and write it into the new Materialized View.

To do this, create a new Pipe. In this guide, it's called the Transform Pipe. In your Transform Pipe create the SQL SELECT logic that transforms the old schema to the new schema.

Lastly, your Transform Pipe should have a WHERE clause that only selects rows that are older than our Future Timestamp. The condition should be <=, for example:

WHERE … AND my_timestamp <= "2024-01-12 13:00:00"

5. Wait until after the Future Timestamp, then push & populate with the Transform Pipe

Now, to avoid any potential for creating duplicates or missing rows, wait until after the Future Timestamp time has passed. This means that there should no longer be any rows arriving that have a timestamp that is older than the Future Timestamp.

Then, push the Transform Pipe and force a populate:

tb push pipes/new_materializing_pipe.pipe --populate --wait

6. Wait for the populate to finish, then change your API Endpoint to read from the new Materialized View Data Source

Wait until the previous command has completed to ensure that all data from the original Materialized View has been written to the new Materialized View Data Source.

When it is complete, modify the API Endpoint that is querying the old Materialized View to query from the new Materialized View Data Source.

For example:

SELECT * from original_materialized_view_data_source

Would become:

SELECT * from new_materialized_view_data_source

7. Test, then clean up old resources

Test that your API Endpoint is serving the correct data. If everything looks good, you can tidy up your Workspace by deleting the original Materialized View & the new Transform Pipe.

Scenario 4: It's too complex and I can't figure it out

If you are dealing with a very complex scenario, don't worry! Contact Tinybird support (support@tinybird.co) and we can help guide you.