Calculating data on ingestion with Materialized Views


There are times when you want to calculate new columns from other columns of a Data Source. Doing it on query time can be expensive. In this guide, you’ll learn how to do it with Materialized Views, so that expensive calculations are done at ingestion time and then querying your data is as fast as it gets.

Let’s see an example.

Calculating lots of data at query time can be slow

Image you have a Data Source with this schema:

Then, we have a Pipe where we do some transformations to the original data. For example, here we extract a couple of fields from a JSON column in the events Data Source and extract the Date from a DateTime field:

This operation alone takes almost 1.5 seconds. If this is done at query time, it’ll add that time whenever the endpoint is called and that’s not acceptable if this endpoint is going to be called many times or we need a low latency - which we’ll assume is the case for this guide.

To make our queries as fast as possible, we should avoid doing expensive calculations at query time. A better option is to have those fields pre-calculated and store them along with the rest of the data in disk. Our data will take up more space in disk, but if we’re building a real-time application that needs to return data with low latencies, this is a necessary trade-off to be made. And this is exactly what materialized views allow us to do.

The solution

We’ll create a Materialized View that contains the new columns. The values of these new columns will be calculated only once, on ingestion time. The way to create Materialized Views (MVs from now on) is:

  1. Create a destination Data Source whose schema contains the new columns. When we refer to querying the MV we refer to querying this destination Data Source.

  2. Create a pipe that reads from the original Data Source, calculates the new columns and writes the results to the destination Data Source

  3. Push the pipe and populate the destination Data Source.

Creating a new destination Data Source

This would be the schema for it:

Create a pipe that writes to it

There are two things you have to consider here:

  1. The names and types of the columns from the pipe have to match those of the destination Data Source. If not, the CLI will output an error.

  2. Also, if some of the columns in the pipe are nullable, the types of the matching columns in the destination Data Source have to also be nullable. In general, the performance will be better if you don’t use Nullable columns (replacing null values by 0 or by an empty string, for example).

Knowing this, define the pipe that will read data from events, calculate some extra columns and write the result to events_mat, like this:

Push it to Tinybird

The last step is pushing the data files you’ve created to Tinybird, so that the MV is created properly. Do so by running tb push like this:

The --push-deps flag pushes all the dependencies for the file you’re pushing, so that you don’t have to run tb push twice. This is especially helpful for big projects with lots of dependencies.

The --populate flag makes sure that the MV is populated with all the data available, passing all the data from the origin Data Source through the pipe.

Check the results

Now, you can select data from the MV you’ve created:

MVs allow for as many transformations as you can do in a query. See the built-in functions for a complete list of them, but the most common ones are:

  • Casting to a different data type (toFloat, toInt, toDate, …)

  • Concat several fields

  • Substring a field

  • Parsing a JSON object, a date or a wrongly formatted column

  • Indexing some data using a geospatial function or other type of hash

See the API reference to learn how to create a Data Source with schema from the Data Sources API.

Alternatively, check the CLI documentation to learn how to define a Data Source schema and push it to your Tinybird account.