Common use cases iterating Data Projects¶
Before you start iterating and making critical changes to your Data Sources, Materialized Views, and Pipes, it’s crucial to read our guidelines on How to Work with Data Projects.
Add column to a Landing Data Source¶
Adding a column to a Landing Data Source is a straightforward and commonly executed task in any Data Project. It allows for the consumption of new data, but it’s important to be aware that the moment in which columns are added is significant.
A Landing Data Source is the initial data repository that collects information from multiple channels enabled by Tinybird for data ingestion. It serves as the first point of entry for new data, where it is gathered before undergoing any subsequent processing or transformation.
Before you start pushing data for the new property or column to Tinybird, make sure to first add this column to your Data Source. Otherwise, you may encounter issues. For example, when using Events API, the data in the new property won’t be stored but the sending data requests won’t fail.
To begin, add the new column to the file containing your Data Source definition. There’s no need to update the version or clone the Data Source; simply incorporate the changes directly into the Data Source file:
The next step involves executing a Custom Deployment using the following command.
tb deploy --yes
--yes option confirms that you want to alter the Data Source.
For more information on how to make Custom Deployments, click here.
All the changes implemented to add the new column: PR on GitHub
Add column to a Materialized View¶
Why and How the Data Migration Mechanism Works
When dealing with real-time or near-real-time data streams, stopping the flow of data for system modifications—like adding a new column—is often not an option.
To circumvent this limitation, you should employ a multi-step data migration mechanism:
Sync data streams: Create a new Data Source that includes the additional column, and synchronize the data streams between both the original and the new Data Sources using a Materialized View. This operation will run in parallel with the existing data ingestion process for the original Data Source, ensuring an uninterrupted flow of data.
Time Filtering: Use time-based filters to avoid data duplication and to coordinate data migration. The new data is directed to the target Data Source after a specified filter time. At that moment, we’ll backfill the target Data Source as described in the next step.
Backfill: Once the filter time is reached, execute a data population operation from the original data source to the new one. Applying the time filter correctly will prevent duplication.
This ensures that the existing system remains operational during the migration and the data is consistent before, during, and after the operation.
Example Step by Step
Imagine you want to add a new column to a Materialized View, in this example to add the column
environment to the Data Source
Step 1: Duplicate the Materialized View
Create a duplicate of the original Data Source adding the new desired column.
Pull Request: PR #1 on GitHub
Filter: Use a filter to avoid duplicate records with the original Data Source. A 15-minute window is typically sufficient, although this may vary.
WHERE timestamp > '2023-11-07 12:30:00'
Once the filter date is reached new rows start to be ingested in the target Data Source.
Step 2: Backfilling
Wait until the filter date is reached.
Backfill data from the original to the target Data Source. Remember to backfill only the data previous to the filter date.
WHERE timestamp <= '2023-11-07 12:30:00'
Make a custom deployment with the following commands to populate the target Data Source using a new Materialized Pipe.
tb deploy --populate --fixtures --wait
tb deploy --populate --wait
Add a quality test to check that the backfilling is working as expected before going to production. We can compare the difference of number of hits in the new and legacy Data Sources (it shouldn’t be any difference):
WITH (SELECT countMerge(hits) FROM analytics_sessions_mv) AS legacy_hits, (SELECT countMerge(hits) FROM analytics_sessions) AS new_hits SELECT legacy_hits - new_hits AS diff WHERE diff != 0 -- Quality tests expect that no rows are returned.
It’s highly recommended to add automated testing to your Data Project. As outlined in the article Implementing Test Strategies, Versions offers a comprehensive suite of tools designed to automate various types of testing within our CI/CD pipelines. These include regression tests to identify any new bugs or performance degradation, data quality tests to detect cases that should not exist in production, as well as fixture tests to validate specific scenarios with a fine-grained level of control.
Pull Request: PR #2 on GitHub
Step 3: Update Dependencies
Finally, following your requirements, update all dependent elements, such as API endpoints, to use the new Materialized View with the added column.
Create a new git branch and a pull request with the changes.
Modify the dependencies to point to the new Data Source.
Validate that the transition is seamless in the temporal environment created within the Continuous Integration (CI) workflow.
Once confirmed, merge the changes to deploy the changes in the main environment.
You can see all the sources of this example with the pull requests and detailed step by step documentation:
Recover data from quarantine¶
BigQuery Data Sources¶
If it’s the first time you create a BigQuery connector you need grant access to the Tinybird principal in the
IAM & Admin console in GCP. You can follow along this guide in your main Environment to get the principal and grant access.
Alternatively you can run this command in your main Environment:
tb connection create bigquery ** Using CLI in development mode ** Log into your Google Cloud Platform Console as a project editor and go to https://console.cloud.google.com/iam-admin/iam ** Grant access to this principal: <principal_name> ** Assign it the role "BigQuery Data Viewer" Ready? (y, N) [N]: y ** Connection bigquery created successfully!
To create a new BigQuery Data Source follow this Pull Request example.
DESCRIPTION > Downloads from pypi SCHEMA > `timestamp` DateTime, `cli_version` LowCardinality(String), `python_version` LowCardinality(String) ENGINE "MergeTree" ENGINE_PARTITION_KEY "toDate(timestamp)" ENGINE_SORTING_KEY "timestamp" IMPORT_SERVICE bigquery IMPORT_SCHEDULE 30 3 * * * IMPORT_EXTERNAL_DATASOURCE bigquery-public-data.pypi.file_downloads IMPORT_STRATEGY REPLACE IMPORT_QUERY "SELECT `timestamp`, file.version as cli_version, details.python as python_version FROM `bigquery-public-data.pypi.file_downloads` WHERE DATE(timestamp) >= DATE_ADD(CURRENT_DATE(), INTERVAL -30 DAY) AND project = 'tinybird-cli'"
You can create a new Environment and the BigQuery Data Source from the UI and then
tb pull to sync to your local Data Project and commit to your Git repository.
Once the Datafile is pushed to a Pull Request, check the Continuous Integration job is passing. If you created the
.datasource Datafile directly in Git, you can check the BigQuery connection by getting the CI Environment from the
Create new test Environment with data in the CI pipeline:
Run tb \ ** Environment 'tmp_ci__3' from 'st_wask_versions_v2' has been created ** Data Branch job url ***/v0/jobs/d58c8038-5207-4ae8-b856-be5bf5ee4b86 Data Branching ** Now using tmp_ci__3 (8f40b30a-e28e-4ce5-bb31-2c5b9982cb80) ------------------------------------------------------- | Data Source | Partition | Status | Error | ------------------------------------------------------- | analytics_sources_mv | 202310 | Done | | | analytics_sessions_mv | 202310 | Done | | | analytics_events | 202310 | Done | | | analytics_pages_mv__v1 | 202310 | Done | | | analytics_pages_mv | 202310 | Done | | -------------------------------------------------------
In this case
tmp_ci__3, log in to the Tinybird UI, select the
tmp_ci__3 Environmnent from the environments drop down, go to the
by_pypi Data Source and click on
Sync now in the actions menu.
Alternatively you can run these CLI commands from your terminal:
tb use env tmp_ci__3 tb datasource sync by_pypi
Once the connection is validated you can just merge the Pull Request and the Data Source will be created in the main Environment.
To iterate an existing BigQuery Data Source, the recommended workflow is create a new
.datasource with the new BigQuery Data Source as explained above.
For instance, in this second Pull Request we are adding the
bq_pypi_data_new.datasource which adds a new column to the previous BigQuery Data Source and, using a copy Pipe, we copy the old data into the new Data Source.
DESCRIPTION > Downloads from pypi SCHEMA > `timestamp` DateTime, `cli_version` LowCardinality(String), `python_version` LowCardinality(String), `country_code` Nullable(String) ENGINE "MergeTree" ENGINE_PARTITION_KEY "toDate(timestamp)" ENGINE_SORTING_KEY "timestamp" IMPORT_SERVICE bigquery IMPORT_SCHEDULE 30 3 * * * IMPORT_EXTERNAL_DATASOURCE bigquery-public-data.pypi.file_downloads IMPORT_STRATEGY REPLACE IMPORT_QUERY "SELECT `timestamp`, file.version as cli_version, details.python as python_version, `country_code` FROM `bigquery-public-data.pypi.file_downloads` WHERE DATE(timestamp) >= DATE_ADD(CURRENT_DATE(), INTERVAL -30 DAY) AND project = 'tinybird-cli'"
Once you deploy this new Data Source to the main Environment and data has been synchronized, you can deploy this change to Production like in this Pull Request.
When you drop a Datafile from the Git repository it’s not automatically deleted from the main Environment. Instead you have to create a custom deployment and run the required CLI command to drop the desired resources.
For instance, let’s say in the BigQuery use case explained above after you have created
bq_pypi_new.datasource and it’s being used by the
installations API endpoint, you want to remove the previous
bq_pypi Data Source from Git and from the main Environment.
You need to run a custom deployment by running
tb release generate --semver <semver> and include the
tb datasource rm bq_pypi --yes command in the
deploy/<semver>/cd-deploy.sh so the
bq_pypi Data Source is removed in CI and when merging the Pull Request.