Calculating data on ingestion with Materialized Views¶
Easy
Let’s see an example.
Calculating lots of data at query time can be slow¶
Image you have a Data Source with this schema:
Then, we have a Pipe where we do some transformations to the original data. For example, here we extract a couple of fields from a JSON column in the events
Data Source and extract the Date
from a DateTime
field:
This operation alone takes almost 1.5 seconds. If this is done at query time, it’ll add that time whenever the endpoint is called and that’s not acceptable if this endpoint is going to be called many times or we need a low latency - which we’ll assume is the case for this guide.
To make our queries as fast as possible, we should avoid doing expensive calculations at query time. A better option is to have those fields pre-calculated and store them along with the rest of the data in disk. Our data will take up more space in disk, but if we’re building a real-time application that needs to return data with low latencies, this is a necessary trade-off to be made. And this is exactly what materialized views allow us to do.
The solution¶
We’ll create a Materialized View that contains the new columns. The values of these new columns will be calculated only once, on ingestion time. The way to create Materialized Views (MVs from now on) is:
Create a destination Data Source whose schema contains the new columns. When we refer to querying the MV we refer to querying this destination Data Source.
Create a pipe that reads from the original Data Source, calculates the new columns and writes the results to the destination Data Source
Push the pipe and populate the destination Data Source.
Creating a new destination Data Source¶
This would be the schema for it:
Create a pipe that writes to it¶
There are two things you have to consider here:
The names and types of the columns from the pipe have to match those of the destination Data Source. If not, the CLI will output an error.
Also, if some of the columns in the pipe are nullable, the types of the matching columns in the destination Data Source have to also be nullable. In general, the performance will be better if you don’t use Nullable columns (replacing null values by 0 or by an empty string, for example).
Knowing this, define the pipe that will read data from events, calculate some extra columns and write the result to events_mat, like this:
Push it to Tinybird¶
The last step is pushing the data files you’ve created to Tinybird, so that the MV is created properly. Do so by running tb push
like this:
The --push-deps
flag pushes all the dependencies for the file you’re pushing, so that you don’t have to run tb push
twice. This is especially helpful for big projects with lots of dependencies.
The --populate
flag makes sure that the MV is populated with all the data available, passing all the data from the origin Data Source through the pipe.
Check the results¶
Now, you can select data from the MV you’ve created:
MVs allow for as many transformations as you can do in a query. See the built-in functions for a complete list of them, but the most common ones are:
Casting to a different data type (toFloat, toInt, toDate, …)
Concat several fields
Substring a field
Parsing a JSON object, a date or a wrongly formatted column
Indexing some data using a geospatial function or other type of hash
See the API reference to learn how to create a Data Source with schema from the Data Sources API.
Alternatively, check the CLI documentation to learn how to define a Data Source schema and push it to your Tinybird account.