Implementing test strategies

Intermediate

In the versioning your pipes guide you learned how to use versions as part of the usual development workflow of your API Endpoints.

In this guide you’ll learn about different strategies for testing your data project.

Guide preparation

You can follow along using the ecommerce_data_project.

Download the project by running:

Git clone the project
git clone https://github.com/tinybirdco/ecommerce_data_project
cd ecommerce_data_project

Then, create a new workspace and authenticate using your Auth Token. If you don’t know how to authenticate or use the CLI, check out the CLI Quick Start.

Authenticating to EU
tb auth -i

** List of available regions:
   [1] us-east (https://ui.us-east.tinybird.co)
   [2] eu (https://ui.tinybird.co)
   [0] Cancel

Use region [1]: 2

Copy the admin token from https://ui.tinybird.co/tokens and paste it here :

Finally, push the data project to Tinybird:

Recreating the project
tb push --push-deps --fixtures

** Processing ./datasources/events.datasource
** Processing ./datasources/top_products_view.datasource
** Processing ./datasources/products.datasource
** Processing ./datasources/current_events.datasource
** Processing ./pipes/events_current_date_pipe.pipe
** Processing ./pipes/top_product_per_day.pipe
** Processing ./endpoints/top_products.pipe
** Processing ./endpoints/sales.pipe
** Processing ./endpoints/top_products_params.pipe
** Processing ./endpoints/top_products_agg.pipe
** Building dependencies
** Running products_join_by_id
** 'products_join_by_id' created
** Running current_events
** 'current_events' created
** Running events
** 'events' created
** Running products
** 'products' created
** Running top_products_view
** 'top_products_view' created
** Running products_join_by_id_pipe
** Materialized pipe 'products_join_by_id_pipe' using the Data Source 'products_join_by_id'
** 'products_join_by_id_pipe' created
** Running top_product_per_day
** Materialized pipe 'top_product_per_day' using the Data Source 'top_products_view'
** 'top_product_per_day' created
** Running events_current_date_pipe
** Materialized pipe 'events_current_date_pipe' using the Data Source 'current_events'
** 'events_current_date_pipe' created
** Running sales
** => Test endpoint at https://api.tinybird.co/v0/pipes/sales.json
** 'sales' created
** Running top_products_agg
** => Test endpoint at https://api.tinybird.co/v0/pipes/top_products_agg.json
** 'top_products_agg' created
** Running top_products_params
** => Test endpoint at https://api.tinybird.co/v0/pipes/top_products_params.json
** 'top_products_params' created
** Running top_products
** => Test endpoint at https://api.tinybird.co/v0/pipes/top_products.json
** 'top_products' created
** Pushing fixtures
** Warning: datasources/fixtures/products_join_by_id.ndjson file not found
** Warning: datasources/fixtures/current_events.ndjson file not found
** Checking ./datasources/events.datasource (appending 544.0 b)
**  OK
** Checking ./datasources/products.datasource (appending 134.0 b)
**  OK
** Warning: datasources/fixtures/top_products_view.ndjson file not found

Regression tests

When one of your API Endpoints is integrated in a production environment (a web or mobile application, a dashboard, etc.), you want to make sure that any change in the Pipe doesn’t change the output of the endpoint.

In other words, you want the same version of an API Endpoint to return the same data for the same requests.

The CLI provides you with automatic regression tests any time you try to push the same version of a Pipe. Let’s see it with an example:

Imagine you have this version of the top_products Pipe:

Definition of the top_products.pipe file
NODE endpoint
DESCRIPTION >
   returns top 10 products for the last week
SQL >
   select
      date,
      topKMerge(10)(top_10) as top_10
   from top_product_per_day
   where date > today() - interval 7 day
   group by date

And you want to parameterize the date filter to this:

Adding a new day parameter to the top_products pipe definition
NODE endpoint
DESCRIPTION >
   returns top 10 products for the last week
SQL >
   %
   select
      date,
      topKMerge(10)(top_10) as top_10
   from top_product_per_day
   where date > today() - interval {{Int(day, 7)}} day
   group by date

The new param day has a default value of 7. That means by default, the behaviour of the endpoint should be the same.

To illustrate the example, send a couple of requests to the API Endpoint:

Doing a request to the endpoint
curl https://api.tinybird.co/v0/pipes/top_products.json?token={TOKEN}

Now, try to override the endpoint:

Overriding the api endpoint
tb push endpoints/top_products.pipe --force

** Processing endpoints/top_products.pipe
** Building dependencies
** Running top_products
** => Test endpoint at https://api.tinybird.co/v0/pipes/top_products__checker.json
current https://api.tinybird.co/v0/pipes/top_products.json?&pipe_checker=true
   new https://api.tinybird.co/v0/pipes/top_products__checker.json?&pipe_checker=true ... ok

==== Test Metrics ====

------------------------------------------------------------------------
| Test Run | Test Passed | Test Failed | % Test Passed | % Test Failed |
------------------------------------------------------------------------
|        1 |           1 |           0 |         100.0 |           0.0 |
------------------------------------------------------------------------

==== Response Time Metrics ====

----------------------------------------------
| Timing Metric (s)    | Current  | New      |
----------------------------------------------
| min response time    | 0.255429 | 0.254966 |
| max response time    | 0.255429 | 0.254966 |
| mean response time   | 0.255429 | 0.254966 |
| median response time | 0.255429 | 0.254966 |
| p90 response time    | 0.255429 | 0.254966 |
| min read bytes       | 4.11 KB  | 4.11 KB  |
| max read bytes       | 4.11 KB  | 4.11 KB  |
| mean read bytes      | 4.11 KB  | 4.11 KB  |
| median read bytes    | 4.11 KB  | 4.11 KB  |
| p90 read bytes       | 4.11 KB  | 4.11 KB  |
----------------------------------------------
** 'top_products' created
** Not pushing fixtures

The CLI tests all combinations of parameters by running at least one request for each combination, and comparing the results of the new and old version of the Pipe. The regression test will also display the statistics of the new vs old Pipe, so we can detect if the new endpoint has any improvement or degradation in performance. In case you want to validate the requests that contain one specific parameter, you can filter the requests using --match <PARAMETER_NAME>.

As a test, change the default date range to the last day:

Changing the default date range for the top_products.pipe api endpoint
NODE endpoint
DESCRIPTION >
   returns top 10 products for the last week
SQL >
   %
   select
      date,
      topKMerge(10)(top_10) as top_10
   from top_product_per_day
   where date > today() - interval {{Int(day, 1)}} day
   group by date

And try to override it:

Overriding the api endpoint
tb push endpoints/top_products.pipe --force

** Processing endpoints/top_products.pipe
** Building dependencies
** Running top_products
** => Test endpoint at https://api.tinybird.co/v0/pipes/top_products__checker.json
current https://api.tinybird.co/v0/pipes/top_products.json?&pipe_checker=true
   new https://api.tinybird.co/v0/pipes/top_products__checker.json?&pipe_checker=true ... FAIL
==== Test FAILED ====

current https://api.tinybird.co/v0/pipes/top_products.json?&pipe_checker=true
   new https://api.tinybird.co/v0/pipes/top_products__checker.json?&pipe_checker=true

** check error: 1 != 0 : Number of elements does not match

=====================

Error:
** Failed running endpoints/top_products.pipe: Invalid results, you can bypass checks by running push with the --no-check flag

Since the default period changed, the response changed for the default request, so the Pipe is not overriden. The CLI has prevented a possible regression.

If you are sure the new response is correct, and don’t consider this change a regression, you can force the change through like this:

Forcing the override of the top_products.pipe api endpoint
tb push endpoints/top_products.pipe --force --no-check

** Processing endpoints/top_products.pipe
** Building dependencies
** Running top_products
** 'top_products' created
** Not pushing fixtures

In this case, the regression tests won’t be executed. Of course, you do this at your own risk!

How the regression tests work

When we run tb pipe regression-test to check the changes of Pipe against the existing one, or we run tb push endpoints/ -f to override the existing one, we are going to run some regression test to validate you are not breaking backward compatibility without realizing.

The regression test funcionality is powered by tinybird.pipe_stats_rt, one of the service Data Sources that are available to the you out of the box. You can find more information about these service Data Sources here.

In this case, a query is run against tinybird.pipe_stats_rt to try to gather all the combination of parameters you are using in an API Endpoint. This way we have coverage that all the possible combinations have been validated at least once.

Query to gather all the possible combinations of queries done in the last 7 days for one endpoint
SELECT
   ## Using this function we extract all the parameters used in each requests
   extractURLParameterNames(assumeNotNull(url)) as params,
   ## According to the option `--sample-by-params`, we run one query for each combination of parameters or more
   groupArraySample({sample_by_params if sample_by_params > 0 else 1})(url) as endpoint_url
FROM tinybird.pipe_stats_rt
WHERE
   pipe_name = '{pipe_name}'
   ## According to the option `--match`, we will filter only the requests that contain that parameter
   ## This is specially useful when you want to validate a new parameter you want to introduce or you have optimize the endpoint in that specific case
   { " AND " + " AND ".join([f"has(params, '{match}')" for match in matches])  if matches and len(matches) > 0 else ''}
GROUP BY params
FORMAT JSON

If you have an endpoint with millions of requests per day, we can fallback to a list:

Query to gather all the possible combinations of queries done in the last 7 days for one endpoint
WITH
   ## Using this function we extract all the parameters used in each requests
   extractURLParameterNames(assumeNotNull(url)) as params
SELECT url
FROM tinybird.pipe_stats_rt
WHERE
   pipe_name = '{pipe_name}'
   ## According to the option `--match`, we will filter only the requests that contain that parameter
   ## This is specially useful when you want to validate a new parameter you want to introduce or you have optimize the endpoint in that specific case
   { " AND " + " AND ".join([f"has(params, '{match}')" for match in matches])  if matches and len(matches) > 0 else ''}

## According to the option `--limit` by default 100
LIMIT {limit}
FORMAT JSON

Data Quality Tests

Data quality tests are meant to cover scenarios that don’t have to happen in your production data. For example, you can check that the data is not empty, or that the data is not duplicated.

Data quality tests are run with the tb test command. You should include as many YAML files in the tests directory of your data project.

For instance, given the ecommerce_data_project let’s say we want to validate that:

  • There are no negative sales.

  • There are products sold every day.

We’ll create a tests/default.yaml as in this Pull Request

- no_negative_sales:
   max_bytes_read: null
   max_time: null
   sql: |
      SELECT
         date,
         sumMerge(total_sales) total_sales
      FROM top_products_view
      GROUP by date
      HAVING total_sales < 0
- products_by_date:
   max_bytes_read: null
   max_time: null
   sql: |
      SELECT count(), date
      FROM top_products
      GROUP BY date
      HAVING count() < 0

Then you run it with tb test run -v

tb test run -v

----------------------------------------------------------------------
| file                 | test              | status | elapsed        |
----------------------------------------------------------------------
| ./tests/default.yaml | no_negative_sales | Pass   | 0.001300466 ms |
| ./tests/default.yaml | products_by_date  | Pass   | 0.000197256 ms |
----------------------------------------------------------------------

Configure the Continuous Integration tests

Contact us at support@tinybird.co if you need to configure Continuous Integration for your data project.

Add this step to your CI workflow:

.github/workflows/ci.yml
## We execute the exec_test script that will compare the expected results with the actual result
- name: Running tests
  run: ./scripts/exec_test.sh
  env:
    TB_TOKEN: ${{ env.ADMIN_TOKEN }}

See a working example in this repository

The GitHub action will run a set of tests configured with two files.

Let’s see an example for the top_products API Endpoint with the date_start and date_end parameters.

The top_products.test file is as follows:

Consuming the top_products api endpoint filtering by one specific day
tb --token $TB_TOKEN pipe data top_products --date_start 2020-04-24 --date_end 2020-04-24 --format CSV

It does a call to the top_products Pipe filtering by one specific day and returns the data in CSV format.

Now for the top_products.test.result, it contains the expected result for the previous API Endpoint:

Results of the api endpoint request
"date","top_10"
"2020-04-24","['sku_0001','sku_0002','sku_0003','sku_0004']"

With this approach, you can have your tests for your data project integrated into your development process. All you have to do anytime you create a new Environment, besides doing the proper changes in your .datasource and .pipe files, is update your test files accordingly.

GitHub Actions running