Common use cases iterating Data Projects¶

Intermediate

In this guide, you’ll learn how to iterate Tinybird Data Projects. Whether you’re adding a column, changing data types, or redefining whole views, this guide helps you navigate the complexities in a real-time data environment.

Before you start iterating and making critical changes to your Data Sources, Materialized Views, and Pipes, it’s crucial to read our guidelines on How to Work with Data Projects.

Add column to a Landing Data Source¶

Adding a column to a Landing Data Source is a straightforward and commonly executed task in any Data Project. It allows for the consumption of new data, but it’s important to be aware that the moment in which columns are added is significant.

A Landing Data Source is the initial data repository that collects information from multiple channels enabled by Tinybird for data ingestion. It serves as the first point of entry for new data, where it is gathered before undergoing any subsequent processing or transformation.

Before you start pushing data for the new property or column to Tinybird, make sure to first add this column to your Data Source. Otherwise, you may encounter issues. For example, when using Events API, the data in the new property won’t be stored but the sending data requests won’t fail.

To begin, add the new column to the file containing your Data Source definition. There’s no need to update the version or clone the Data Source; simply incorporate the changes directly into the Data Source file:

The next step involves executing a Custom Deployment using the following command.

Custom Deployment¶
tb deploy --yes

The --yes option confirms that you want to alter the Data Source.

For more information on how to make Custom Deployments, click here.

All the changes implemented to add the new column: PR on GitHub

Add column to a Materialized View¶

Why and How the Data Migration Mechanism Works

When dealing with real-time or near-real-time data streams, stopping the flow of data for system modifications—like adding a new column—is often not an option.

To circumvent this limitation, you should employ a multi-step data migration mechanism:

  • Sync data streams: Create a new Data Source that includes the additional column, and synchronize the data streams between both the original and the new Data Sources using a Materialized View. This operation will run in parallel with the existing data ingestion process for the original Data Source, ensuring an uninterrupted flow of data.

  • Time Filtering: Use time-based filters to avoid data duplication and to coordinate data migration. The new data is directed to the target Data Source after a specified filter time. At that moment, we’ll backfill the target Data Source as described in the next step.

  • Backfill: Once the filter time is reached, execute a data population operation from the original data source to the new one. Applying the time filter correctly will prevent duplication.

This ensures that the existing system remains operational during the migration and the data is consistent before, during, and after the operation.

Example Step by Step

Imagine you want to add a new column to a Materialized View, in this example to add the column environment to the Data Source analytics_sessions_mv.

Initial flow schema

Step 1: Duplicate the Materialized View

Create a duplicate of the original Data Source adding the new desired column.

Pull Request: PR #1 on GitHub

Filter: Use a filter to avoid duplicate records with the original Data Source. A 15-minute window is typically sufficient, although this may vary.

WHERE timestamp > '2023-11-07 12:30:00'

New target Data Source being populated with new traffic

Once the filter date is reached new rows start to be ingested in the target Data Source.

Step 2: Backfilling

  • Wait until the filter date is reached.

  • Backfill data from the original to the target Data Source. Remember to backfill only the data previous to the filter date.

    WHERE timestamp <= '2023-11-07 12:30:00'

  • Make a custom deployment with the following commands to populate the target Data Source using a new Materialized Pipe.

ci-deploy.sh - integration: Command to populate the Data Source using a pipe and add fixtures for testing¶
  tb deploy --populate --fixtures --wait
cd-deploy.sh - deployment: Same without fixtures to avoid polluting production¶
  tb deploy --populate --wait

Add a quality test to check that the backfilling is working as expected before going to production. We can compare the difference of number of hits in the new and legacy Data Sources (it shouldn’t be any difference):

WITH
   (SELECT countMerge(hits) FROM analytics_sessions_mv) AS legacy_hits,
   (SELECT countMerge(hits) FROM analytics_sessions) AS new_hits
SELECT
   legacy_hits - new_hits AS diff
WHERE
   diff != 0 -- Quality tests expect that no rows are returned.

It’s highly recommended to add automated testing to your Data Project. As outlined in the article Implementing Test Strategies, Versions offers a comprehensive suite of tools designed to automate various types of testing within our CI/CD pipelines. These include regression tests to identify any new bugs or performance degradation, data quality tests to detect cases that should not exist in production, as well as fixture tests to validate specific scenarios with a fine-grained level of control.

Pull Request: PR #2 on GitHub

Step 3: Update Dependencies

Finally, following your requirements, update all dependent elements, such as API endpoints, to use the new Materialized View with the added column.

  • Create a new git branch and a pull request with the changes.

  • Modify the dependencies to point to the new Data Source.

  • Validate that the transition is seamless in the temporal environment created within the Continuous Integration (CI) workflow.

  • Once confirmed, merge the changes to deploy the changes in the main environment.

You can see all the sources of this example with the pull requests and detailed step by step documentation:

Add column to a Materialized View use case sources

Recover data from quarantine¶

See this specific guide

Sharing Data Sources¶

A typical way to organize Data Projects is as follows:

../_images/shared-workspaces.png

Some data teams have a Data Project containing the source of truth for their Data Sources and some basic services so that other teams can build data products from them.

The way to do that is by sharing Data Sources from one Data Project to other. You can share Data Sources in two ways:

  • From the origin Workspace in the UI, using the Share button in the Data Source modal

  • From the CLI, using tb datasource share <datasource_name> <destination_workspace_name> --user_token <user_token>. Where user_token is the token labelled in the UI as RW user token.

Alternatively there’s an API which is not meant to be public, but you can contact us in case you need access to it.

When trying to use a shared Data Source which has not been properly shared as mentioned above in a Data Project you’ll see an error like this in the Continuous Integration pipeline:

Error when trying to use a shared Data Source not shared yet¶
** Failed running ./pipes/ingestion_snowflake_error.pipe:
** Failed pushing pipe ingestion_snowflake_error: Resource 'Internal.external_datasource_connector_ops_log' not found

In order to work with Internal.external_datasource_connector_ops_log from your Data Project you have to:

  • Go to the Internal Workpsace and shared external_datasource_connector_ops_log to your destination Workspace.

  • In your destination Data Project run tb pull. A new vendor folder will be created containing the external_datasource_connector_ops_log.datasource. vendor folders have the default Data Project folder structure.

  • To use external_datasource_connector_ops_log in your Pipes you just refer to it as Internal.external_datasource_connector_ops_log in your queries, for instance:

Use a shared Data Source in a .pipe file¶
NODE error_report
SQL >
   SELECT count() FROM Internal.external_datasource_connector_ops_log
   WHERE status == 'error'

BigQuery Data Sources¶

If it’s the first time you create a BigQuery connector you need grant access to the Tinybird principal in the IAM & Admin console in GCP. You can follow along this guide in your main Environment to get the principal and grant access.

Alternatively you can run this command in your main Environment:

tb connection create bigquery
** Using CLI in development mode

** Log into your Google Cloud Platform Console as a project editor and go to https://console.cloud.google.com/iam-admin/iam
** Grant access to this principal: <principal_name>
** Assign it the role "BigQuery Data Viewer"
Ready?  (y, N) [N]: y
** Connection bigquery created successfully!

To create a new BigQuery Data Source follow this Pull Request example.

bq_pypi_data.datasource¶
DESCRIPTION >
 Downloads from pypi

SCHEMA >
   `timestamp` DateTime,
   `cli_version` LowCardinality(String),
   `python_version` LowCardinality(String)

ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toDate(timestamp)"
ENGINE_SORTING_KEY "timestamp"

IMPORT_SERVICE bigquery
IMPORT_SCHEDULE 30 3 * * *
IMPORT_EXTERNAL_DATASOURCE bigquery-public-data.pypi.file_downloads
IMPORT_STRATEGY REPLACE
IMPORT_QUERY "SELECT `timestamp`, file.version as cli_version, details.python as python_version FROM `bigquery-public-data.pypi.file_downloads` WHERE DATE(timestamp) >= DATE_ADD(CURRENT_DATE(), INTERVAL -30 DAY) AND project = 'tinybird-cli'"

You can create a new Environment and the BigQuery Data Source from the UI and then tb pull to sync to your local Data Project and commit to your Git repository.

Once the Datafile is pushed to a Pull Request, check the Continuous Integration job is passing. If you created the .datasource Datafile directly in Git, you can check the BigQuery connection by getting the CI Environment from the Create new test Environment with data in the CI pipeline:

Run tb \
** Environment 'tmp_ci__3' from 'st_wask_versions_v2' has been created
** Data Branch job url ***/v0/jobs/d58c8038-5207-4ae8-b856-be5bf5ee4b86
Data Branching
** Now using tmp_ci__3 (8f40b30a-e28e-4ce5-bb31-2c5b9982cb80)
-------------------------------------------------------
| Data Source            | Partition | Status | Error |
-------------------------------------------------------
| analytics_sources_mv   |    202310 | Done   |       |
| analytics_sessions_mv  |    202310 | Done   |       |
| analytics_events       |    202310 | Done   |       |
| analytics_pages_mv__v1 |    202310 | Done   |       |
| analytics_pages_mv     |    202310 | Done   |       |
-------------------------------------------------------

In this case tmp_ci__3, log in to the Tinybird UI, select the tmp_ci__3 Environmnent from the environments drop down, go to the by_pypi Data Source and click on Sync now in the actions menu.

Alternatively you can run these CLI commands from your terminal:

tb use env tmp_ci__3
tb datasource sync by_pypi

Once the connection is validated you can just merge the Pull Request and the Data Source will be created in the main Environment.

To iterate an existing BigQuery Data Source, the recommended workflow is create a new .datasource with the new BigQuery Data Source as explained above.

For instance, in this second Pull Request we are adding the bq_pypi_data_new.datasource which adds a new column to the previous BigQuery Data Source and, using a copy Pipe, we copy the old data into the new Data Source.

bq_pypi_data_new.datasource¶
DESCRIPTION >
 Downloads from pypi

SCHEMA >
   `timestamp` DateTime,
   `cli_version` LowCardinality(String),
   `python_version` LowCardinality(String),
   `country_code` Nullable(String)

ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toDate(timestamp)"
ENGINE_SORTING_KEY "timestamp"

IMPORT_SERVICE bigquery
IMPORT_SCHEDULE 30 3 * * *
IMPORT_EXTERNAL_DATASOURCE bigquery-public-data.pypi.file_downloads
IMPORT_STRATEGY REPLACE
IMPORT_QUERY "SELECT `timestamp`, file.version as cli_version, details.python as python_version, `country_code` FROM `bigquery-public-data.pypi.file_downloads` WHERE DATE(timestamp) >= DATE_ADD(CURRENT_DATE(), INTERVAL -30 DAY) AND project = 'tinybird-cli'"

Once you deploy this new Data Source to the main Environment and data has been synchronized, you can deploy this change to Production like in this Pull Request.

Dropping resources¶

When you drop a Datafile from the Git repository it’s not automatically deleted from the main Environment. Instead you have to create a custom deployment and run the required CLI command to drop the desired resources.

For instance, let’s say in the BigQuery use case explained above after you have created bq_pypi_new.datasource and it’s being used by the installations API endpoint, you want to remove the previous bq_pypi Data Source from Git and from the main Environment.

You need to run a custom deployment by running tb release generate --semver <semver> and include the tb datasource rm bq_pypi --yes command in the deploy/<semver>/ci-deploy.sh and deploy/<semver>/cd-deploy.sh so the bq_pypi Data Source is removed in CI and when merging the Pull Request.