Deployment strategies¶
Advanced
So, you’ve made your Workspace production-ready and started working with the Data Project following the Git workflow. You’ve configured your Continuous Integration (CI) pipeline and everything is green, and now, how do you bring those changes to your main Environment?
Deploying a Data Project can be complicated, working with new, updated or deleted resources, maintaining streaming ingestion, API requests, and data operations, while being careful about the health and lifecycle of your Data Product.
In this guide you’ll learn about the default method for implementing Continuous Deployment (CD), how to bypass the default deployment strategy to create your custom deployments, and finally strategies to take into account when migrating data.
How deployment works¶
Before Workspaces could be integrated with Git, you had to carefully use the tb push
command to deploy your local changes to your Workspace. Git integration enables a better workflow.
With the Git integration:
The Data Project is the real source of truth.
The remote Workspace saves a reference to the Git commit deployed.
This way we can make deployments a little bit smarter and easier to execute.
There are two steps in the CI/CD pipeline where you need to deploy changes to the remote Workspace or Environment.
With CI pipelines, a new Environment is created from the main one and deployment is done with
tb deploy --populate --fixtures
.With CD pipelines, the default deployment strategy just runs
tb deploy
, which means Data Operations (such as populations) are left out.
The new tb deploy
command is just a smarter version of tb push
that does the following:
Checks the current commit in the Workspace and validates that is an ancestor of the commit in the Pull Request being deployed. If not, usually you have to
git rebase
your branch.Performs a
git diff
from the current branch to the main branch so it can get a list of the Datafiles that changed.Deploys them, in the case of CI to the remote Environment, in order, also deploying downstream dependent endpoints. For instance if you change a Materialized View or create a new version, Pipes and their API Endpoints depending on it are also deployed.
At this point, if you run a tb diff
between the Git branch and the remote Environment, there should not be any changes.
When used with the --populate --fixtures
flags, once resources have been deployed, it populates the Materialized Views if needed and appends data fixtures, so API Endpoints are ready to be tested. These flags are only recommended on CI pipelines and not in the main Environment.
The strategy to deploy to the main Environment in the CD pipeline is the same, taking into account the user deploying the branch to the main Environment is responsible to run the Data Operations.
Guide preparation¶
You can follow along using the ecommerce_data_project.
Download the project by running:
git clone https://github.com/tinybirdco/ecommerce_data_project
cd ecommerce_data_project
Then, create a new Workspace and authenticate using your user admin Auth Token. If you don’t know how to authenticate or use the CLI, check out the CLI Quick Start.
tb auth -i
** List of available regions:
[1] us-east (https://ui.us-east.tinybird.co)
[2] eu (https://ui.tinybird.co)
[0] Cancel
Use region [1]: 2
Copy the admin token from https://ui.tinybird.co/tokens and paste it here :
Finally, push the Data Project to Tinybird:
tb push --push-deps --fixtures
** Processing ./datasources/events.datasource
** Processing ./datasources/top_products_view.datasource
** Processing ./datasources/products.datasource
** Processing ./datasources/current_events.datasource
** Processing ./pipes/events_current_date_pipe.pipe
** Processing ./pipes/top_product_per_day.pipe
** Processing ./endpoints/top_products.pipe
** Processing ./endpoints/sales.pipe
** Processing ./endpoints/top_products_params.pipe
** Processing ./endpoints/top_products_agg.pipe
** Building dependencies
** Running products_join_by_id
** 'products_join_by_id' created
** Running current_events
** 'current_events' created
** Running events
** 'events' created
** Running products
** 'products' created
** Running top_products_view
** 'top_products_view' created
** Running products_join_by_id_pipe
** Materialized pipe 'products_join_by_id_pipe' using the Data Source 'products_join_by_id'
** 'products_join_by_id_pipe' created
** Running top_product_per_day
** Materialized pipe 'top_product_per_day' using the Data Source 'top_products_view'
** 'top_product_per_day' created
** Running events_current_date_pipe
** Materialized pipe 'events_current_date_pipe' using the Data Source 'current_events'
** 'events_current_date_pipe' created
** Running sales
** => Test endpoint at https://api.tinybird.co/v0/pipes/sales.json
** 'sales' created
** Running top_products_agg
** => Test endpoint at https://api.tinybird.co/v0/pipes/top_products_agg.json
** 'top_products_agg' created
** Running top_products_params
** => Test endpoint at https://api.tinybird.co/v0/pipes/top_products_params.json
** 'top_products_params' created
** Running top_products
** => Test endpoint at https://api.tinybird.co/v0/pipes/top_products.json
** 'top_products' created
** Pushing fixtures
** Warning: datasources/fixtures/products_join_by_id.ndjson file not found
** Warning: datasources/fixtures/current_events.ndjson file not found
** Checking ./datasources/events.datasource (appending 544.0 b)
** OK
** Checking ./datasources/products.datasource (appending 134.0 b)
** OK
** Warning: datasources/fixtures/top_products_view.ndjson file not found
One you have the Data Project deployed to a Workspace make sure you connect it to Git and push the CI/CD pipelines to the repository.
Custom deployments¶
Think of tb deploy
as a helper that would make it possible to forget about deployments in the vast majority of cases.
Having said that, the complexity of data pipelines in a Data Project vary across projects and certain changes in a branch are not that “simple” to deploy. For those cases, the owner of the Git branch being merged is able to perform a custom deployment.
When to do a custom deployment?
Either you need to have full control of the sequence of commands required to deploy changes to the ephimeral Environments or the main one.
Or the default
tb deploy
reports an error and it’s not capable of doing the default deployment.Or you need to perform some Data Operations before or after resources have been deployed.
To do a custom deployment follow these steps:
Edit the
.tinyenv
file at the root of your Data Project and increase theVERSION
environment variable, following thesemver
notation. Let’s say you bump it from0.0.0
to0.0.1
.Create these files in the Data Project folder:
deploy/0.0.1/ci-deploy.sh
anddeploy/0.0.1/cd-deploy.sh
and make sure they have execution permissionschmod +x -R deploy/0.0.1/
Performing the custom deployment is as simple as writing the CLI commands you would run in your terminal to deploy the changes to the Environment or Workspace.
The CI and CD pipelines will find the
ci-deploy.sh
andcd-deploy.sh
files and they’ll be run in CI and CD respectively.
That way you have full control on the deployment commands. At the same time you are contributing to the shared knowledge of your Data Project, since that custom deployment will be part of the Git repository.
Once the branch has been merged, on the next Pull Request, remember to bump the VERSION
in the .tinyenv
file, so the custom deployment for the previous changes are not executed with the changes in the new branch.
Find below some examples on how and when to use custom deployments, specially when a data migration is required.
Data migration paths¶
There are several cases in which you have to migrate data from one Data Source to another. The complexity of the migration varies depending on some factors, mainly if there’s streaming ingestion or not.
There are mainly three scenarios covered by the Iterating Data Sources guide:
I’m not in production
I’m in production but I can stop data ingestion
I’m in production and I cannot stop data ingestion
Let’s see how to cover some of the most common scenarios with custom deployments.
When working on custom deployments you might find useful the staging and production Workspaces deployment pattern as described in this guide
Practical examples¶
Example 1: Overwrite a Data Source¶
By deafult tb deploy
does not overwrite Data Sources to avoid unintended deployments.
Certain changes in a Data Source that are not breaking changes are supported by the Tinybird APIs, for instance, adding a new column to a Data Source. Let’s see an example:
Edit the events.datasource
Datafile to add a new new_column String
column like this:
DESCRIPTION >
this contains all the events produced by Kafka, there are 4 fixed columns
plus a `json` column which contains the rest of the data for that event.
See [documentation](url_for_docs) for the different events.
SCHEMA >
`timestamp` DateTime,
`product` String,
`user_id` String,
`action` String,
`json` String,
`new_column` String
ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYear(timestamp)"
ENGINE_SORTING_KEY "timestamp"
Now commit the change to a new git branch and create a Pull Request like this one
The CI pipeline fails in the deployment step with this error:
** Running events
** The description or schema of 'events' has changed.
** - ADD COLUMN `new_column` String
Error:
** Failed running ./datasources/events.datasource:
** Please confirm you want to apply the changes above y/N:
Error: Process completed with exit code 1.
As described in the Custom Deployments section, we need to provide the commands to make the deployment both in CI and CD. In this case, it is as simple as following the steps as decribed in this git commit, you just needs to use the --yes
flag to overwrite the events.datasource
.
Example 2: Overwrite a Materialized View¶
You want to overwrite a Materialized View when you are not changing the resulting schema but just the query used to materialize.
Overwriting a Materialized Pipe is supported by the tb deploy
command. If you don’t have to perform any further data migration, then that’s the way to go. For instance, let’s add a new filter to the top_product_per_day.pipe
:
VERSION 1
NODE only_buy_events
DESCRIPTION >
filters all the buy events
SQL >
SELECT
toDate(timestamp) date,
product,
JSONExtractFloat(json, 'price') as price,
action
FROM events
NODE top_per_day
SQL >
SELECT
date,
action,
topKState(10)(product) top_10,
sumState(price) total_sales
from only_buy_events
where date > now() - interval 30 day -- <- THIS IS THE CHANGE
group by date, action
TYPE materialized
DATASOURCE top_products_view
Take a look at the commit with the new filter. In this case the default tb deploy
command will overwrite top_product_per_day.pipe
so new rows ingested in the events
Data Source will be materialized only if they are less than 30 days old.
Now depending if we want to do some Data Operation or not we could go with a custom deployment like this one. For instance, imagine you want to apply the new filter and get rid of data older than 30 days. You would first tb deploy
and then perform the delete operation.
Of course the commands required to perform the data operation might vary depending on the nature of the change in the Materialized View.
Example 3: Version a Materialized View with data migration¶
You want to version a Materialized View when there are some breaking changes that affect API Endpoints or when you made a change in a Materialized View that requires some complex data migration.
As an example let’s modify the top_product_per_day.pipe
Materialized View to aggregate by user_id
.
Let’s start by versioning both top_product_per_day.pipe
and top_products_view.datasource
VERSION 2
NODE only_buy_events
DESCRIPTION >
filters all the buy events
@@ -9,7 +9,8 @@ SQL >
toDate(timestamp) date,
product,
JSONExtractFloat(json, 'price') as price,
action,
user_id
FROM events
@@ -21,9 +22,10 @@ SQL >
date,
action,
topKState(10)(product) top_10,
sumState(price) total_sales,
user_id
from only_buy_events
group by date, action, user_id
TYPE materialized
DATASOURCE top_products_view
VERSION 2
SCHEMA >
`date` Date,
`action` String,
`user_id` String,
`top_10` AggregateFunction(topK(10), String),
`total_sales` AggregateFunction(sum, Float64)
ENGINE "AggregatingMergeTree"
ENGINE_PARTITION_KEY "toYear(date)"
ENGINE_SORTING_KEY "date, action, user_id"
When you commit those changes to a Git branch and create a Pull Request, there are two interesting things that happen in the CI pipeline.
First, two Datafiles are detected to have changed:
changed: top_product_per_day
changed: top_products_view
Also the Pipes using those resources are pushed to the Environment, so they make use the new version of the Materialized View and they can be tested for regressions.
** Building dependencies
** Running top_products_view => v2 (remote latest version: v1)
** 'top_products_view__v2' created
** Running top_product_per_day => v2 (remote latest version: v1)
** Materialized pipe 'top_product_per_day__v2' using the Data Source 'top_products_view__v2'
** Populating job url ***/v0/jobs/d7d8b5aa-306f-4cfd-a9f8-fac0d2b8ea48
Populating
** 'top_product_per_day__v2' created
** Running top_products_params
** Token read_token found, adding permissions
** => Test endpoint with:
** $ curl ***/v0/pipes/top_products_params.json?token=p.eyJ1IjogIjViNjdmNjg4LWZmYjktNDk2Mi1hNTczLTAwNjM5MTYxNDlmYiIsICJpZCI6ICIwYzNiMWU3Zi03NWFiLTQ4OTUtODBjOC1lMDEyOTA2NmJhNWYiLCAiaG9zdCI6ICJldV9zaGFyZWQifQ.nXG6hCJVo9fJOaTjM0cn5VttWNakBnxtmjEAypTO0ik
** 'top_products_params' created
** Running top_products_agg
** Token read_token found, adding permissions
** => Test endpoint with:
** $ curl ***/v0/pipes/top_products_agg.json?token=p.eyJ1IjogIjViNjdmNjg4LWZmYjktNDk2Mi1hNTczLTAwNjM5MTYxNDlmYiIsICJpZCI6ICIwYzNiMWU3Zi03NWFiLTQ4OTUtODBjOC1lMDEyOTA2NmJhNWYiLCAiaG9zdCI6ICJldV9zaGFyZWQifQ.nXG6hCJVo9fJOaTjM0cn5VttWNakBnxtmjEAypTO0ik
** 'top_products_agg' created
** Running top_products
** Token read_token found, adding permissions
** => Test endpoint with:
** $ curl ***/v0/pipes/top_products.json?token=p.eyJ1IjogIjViNjdmNjg4LWZmYjktNDk2Mi1hNTczLTAwNjM5MTYxNDlmYiIsICJpZCI6ICIwYzNiMWU3Zi03NWFiLTQ4OTUtODBjOC1lMDEyOTA2NmJhNWYiLCAiaG9zdCI6ICJldV9zaGFyZWQifQ.nXG6hCJVo9fJOaTjM0cn5VttWNakBnxtmjEAypTO0ik
** 'top_products' created
New release deployed: '78882650bbaefda891a7d41a2197a56d9dfddb79'
After that, the CI pipeline complain about regression tests failing:
==== Failures Detail ====
❌ top_products(coverage) - ***/v0/pipes/top_products.json?date_start=2020-04-24&date_end=2020-04-25&q=SELECT+%0A++date%2C%0A++count%28%29+total%0AFROM+top_products%0AGROUP+BY+date%0AHAVING+total+%3C+0%0A&cli_version=1.0.0b410+%28rev+145e3d7%29&pipe_checker=true
** 32.0 not less than 25 : Processed bytes has increased 32.0%
💡 Hint: Use `--assert-bytes-read-increase-percentage -1` if it's expected and want to skip the assert.
That’s good since you are changing the aggregation of the Materialized View used by the top_products
endpoint and regression testing warns you about the API endpoint processing more data than the previous version.
At this point you have two options:
You can increase the
VERSION
number in the related pipes, so regression testing does not run over them. Then run a custom deployment to just deploy the changed files and not the related Pipes.Ignore the regression with a
--assert-bytes-read-increase-percentage -1
label as suggested in the💡 Hint
above.
For this example, let’s ignore the regression since the API endpoint interface did not change and there’s no need to create a new VERSION
.
Once CI is green, we need to think how to bring these changes to the main Environment where data is being ingested and API endpoints are receiving requests. A typical approach is as follows:
Deploy the versioned resources first, in this case
top_product_per_day
andtop_products_view
. When they are deployed they are connected to the ingestion but disconnected from the API Endpoints, that’s the scenario we want to achieve.Once the Materialized View is deployed it automatically starts materializing data that’s being ingested. Before connecting it to the API endpoints, data needs to be backfilled.
Optionally, once data is backfilled you want to perform some data quality check between current and previous version.
Finally you need to deploy the rest of API Endpoints depending on the resources changed so they start using the new versions.
Let’s go through this custom deployment using the .tinyenv
and custom script described above. Bump the VERSION
in .tinyenv
to 0.0.1 and create deploy/0.0.1/cd-deploy.sh
:
# deploy the versioned resources alone
tb push datasources/top_products_view.datasource
BACKFILL_TIME=$(date +"%Y-%m-%d %H:%M:%S")
tb push pipes/top_product_per_day.pipe
# backfill old data with a populate
tb pipe populate top_product_per_day__v2 --node top_per_day --sql-condition "timestamp < '$BACKFILL_TIME'" --wait
# do the data quality check, checking that a sum in top_products_view__v1 and top_products_view__v2 return the same value
diff=$(tb --no-version-warning sql "with (select sumMerge(total_sales) from top_products_view__v2) as new, (select sumMerge(total_sales) from top_products_view__v1) as old select old - new as diff" --format json | python -c "import sys, json; print(json.load(sys.stdin)['data'][0]['diff'])")
echo "Diff: $diff"
if [ $diff -eq 0 ]; then
echo "Diff is equal."
exit 0
else
echo "Diff is not equal."
fi
# deploy the depending API endpoints
tb deploy
To test this custom deployment you can go through a staging-production Workspace set up or test it manually in a dedicated Environment for that purpose.
Once deployment is validated, you can just merge the Pull Request and the script will run in the main Environment.
What to do in case of a failure? Since we are versioning resources you can “rollback” the deployment by removing the newly created resources top_products_view__v2
and top_product_per_day__v2
.
What do to in other cases?¶
Please read the Iterating Data Sources guide to look for other common use cases and scenarios or reach us at support@tinybird.co and we’ll help you on the best deployment path given your use case.
What’s coming next¶
Reading the above guide, you may have realized that to deploy changes, especially those that involve data migrations, to your main Environment you have to:
Use Versioning when there are breaking changes.
Carefully craft your Data Sources iterations especially when there’s streaming ingestion.
Perform a series of controlled steps which become part of the “tribal knowledge” of your Data Project.
This guide does not solve all possible deployment cases yet, reach us at support@tinybird.co if you need help running a custom deployment.
We are working on a better way to deploy the changes of your Data Projects that will enable preview and rollback releases. These abilities will make it easier to control the life cycle of your Data Products and provide a clear path to iterate any resource. Stay tunned for more info about this.