Master materialized views

Advanced

Before you start building Materialized Views in Tinybird, you need to know how they work. Tinybird approaches Materialized Views a bit differently than you may be used to, so instead of jumping right in, we recommend you read this guide first to understand how Materialized Views work in Tinybird.

What is a Materialized View in Tinybird?

In the simplest terms, a Materialized View is the continuous, streaming result of a Pipe saved as a new Data Source.

As new data is ingested into the origin Data Source, the transformed results from the Pipe will continually be inserted in the new Materialized View Data Source, and you can query it just like you would query any other Data Source in Tinybird.

How is data inserted into a Materialized View in Tinybird?

In Tinybird, data is ingested into Materialized Views in blocks. This process is explained in the diagram below. Every time new data is ingested into the origin Data Source, the materialization process is triggered, applying the transformation Pipe over the data ingested and saving the output of that Pipe, which is a partial result, in the Materialized View.

But what about the data that was present in the origin Data Source prior to the Materialized View creation? These rows can be inserted into the destination Materialized View by using a populate operation.

Materialization process

Regular Materializaton

Unlike other databases, Materialized Views in Tinybird are incremental and triggered upon ingestion. So from the moment it is created, the new rows will be inserted in the Materialized View. No extra action needed. If an insert is too big, it will be processed in blocks.

Taking into account how data is populated into Materialized Views, there are certain operations that shouldn't be included in the SQL for your transformation Pipe. Since materializations are performed only over the new data being ingested, and not over the whole data source, certain operations should not be used, such as: Window functions, ORDER BY, Neighbor, DISTINCT.

Populates

Populates are used for moving historical data from the origin Data Source into the Materialized View. There are 2 types: complete and partial.

Usually, a complete populate is the right choice. But if you are populating from a Data Source with hundreds of millions of rows and doing complex transformations, you may face some memory errors. In this type of situation, you should use partial populates.

If you’re using the CLI, populates are triggered with tb pipe populate. You can add conditions using the --sql-condition flag, for example, --sql-condition='date == toYYYYMM(now())'. Note that if your sql_condition includes any column present in the Data Source’s engine_sorting_key the populate job should process less data.

If you have constant ingest in your origin Data Source, read The challenge with populates and streaming ingest before proceeding.

Doing aggregations the right way with Materialized Views

There are two basic types of Materialized Views: those that aggregate and those that don’t. For aggregated Materialized Views, you’ll need to keep the following in mind.

Occasionally, a background process in Tinybird merges partial results saved in intermediate states during the ingestion process, compacting the results and reducing the number of rows. The diagram below explains this process in more detail through a simple example.

Let’s say we’re an ecommerce store and want to materialize the count of units sold by product. We are ingesting a JSON object every minute, with a product (represented by a capital letter e.g. “A” in the diagram) and the quantity sold during the last minute.

We could define in our Pipe some simple SQL to sum the count of units sold per minute as data is ingested, and this Pipe will be applied over each new block of appended data. The output will immediately be saved in intermediate states into the Materialized View.

Every 8-10 minutes, the background process we mentioned will merge the intermediate states, completing the aggregation across the entire Data Source.

Because we are working in real-time, we can’t always wait for this background process to take place. Fortunately, this is easily accounted for: When querying the Materialized View, you should use the proper merge combinator and GROUP BY in the query itself.

Understanding State and Merge combinators for Aggregates

As we mentioned, it’s not practical to track when the background process that merges aggregate results in a Materialized View has occurred. Because of this, you need to store intermediate states using the -State suffix. Note that if you’re creating from the UI, this will be done automatically.

Here’s an example of using -State when defining the transformation Pipe to calculate these intermediate states:

You also need to specifically define the appropriate schema for the Materialized View:

Finally, you’ll need to retrieve the data using the -Merge suffix in your endpoint node to make sure the merge process is completed for all data in the Materialized View:

Understanding the Materialized View parameters

When you create a Materialized View in the UI, Tinybird automatically recommends the best parameters for most use cases. It is useful to understand these parameters for more complex use cases.

Sorting Key

The Sorting Key defines how data is sorted and is critical for great performance when filtering. You should choose the order of your sorting keys depending on how you are going to query them. Here are a few examples for a simple Materialized View containing day, city, and avg columns:

  • You want to query the average for all cities on a particular day: The ``day`` **column should be the first sorting key.**‍
  • You wanted the average over the last month for a particular city: The ``city`` column should be the first sorting key.

Note: For Materialized Views containing aggregations, every column in the GROUP BY statement has to be in the sorting keys, and only those columns can be sorting keys.

For non-aggregated Materialized Views, you can select other columns if they fit better for your use case, but we don't recommend adding too many. You get only a negligible performance boost after the 4th sorting key column.

Partition By

A partition is a logical combination of records by a given criterion. In most cases you don't need a partition key, or a partition by month is enough.

Tinybird will guess the best partition key if your materialization query has a Date or DateTime column. If there aren't any Date columns, Tinybird doesn't set a partition key. As we recommend here, having no partition is better than having the wrong one. If you are comfortable with partitions and you really want to group records by another criteria, you can switch to the advanced tab and add your custom partition code.

TTL

If you have certain lifetime requirements (such as to satisfy GDPR regulations) on the data in your Materialized Views, you can specify a Time To Live (TTL) parameter when creating a Materialized View.

This can also be useful if you only intend to query a brief history of the data. For example, if you always and only query for data from within the last week, you can set a TTL of 7 days. When a TTL is set, all rows older than the TTL will be removed from the Materialized View.

Advanced Tab (UI)

Most of the time, the defaults recommended by Tinybird will be the best parameters to use. Occasionally, however, you may need to tweak these parameters. For this, you can use the Advanced tab. Here you can directly write code that will be passed to the View creation.

Warning: You'll need certain ClickHouse expertise to modify these parameters, so if you aren't sure, stick with the defaults.

The challenge with populates and streaming ingest

As explained before, a populate is the operation of moving data that was present in the origin Data Source before the creation of the Materialized View.

It is a challenging operation when having continuous ingest in the origin Data Source.

Let’s see the detailed process (see the diagram below):

At t1, the Materialized View is created, so new rows arriving in the origin Data Source are processed into the Materialized View.

To move the data from t0 to t1, we launch a populate —either manually or when defining the MV— at time t2.

All that data that arrives between t1 and t2 may be materialized twice: once due to the regular MV process, at ingest time, and the other one due to the populate process.

When you don’t have streaming ingest in the origin Data Source, it is usually a safe operation, as long as no new data arrives while the populate is running.

How to backfill, then?

Two Materialized View Pipes

Probably the easiest way. It will use a timestamp in the near future to split realtime ingest and backfill.

Create the regular MV with a WHERE clause specifying that materialized data will be newer than a certain timestamp in the future. E.g. WHERE timestamp >= '2024-01-31 00:00:00' in the example:

realtime materialized.pipe
    NODE node
    SQL > 
        %
        SELECT (...) 
        FROM origin_ds
        WHERE timestamp >= '2024-01-31 00:00:00'

    TYPE Materialized
    TARGET_DATASOURCE mv

Wait until the desired timestamp has passed, and create a the backfill MV Pipe with a WHERE clause for data before the specified timestamp. No new data will be processed, as the condition won't be met.

populate.pipe
    NODE node
    SQL > 
        %
        SELECT (...) 
        FROM origin_ds
        WHERE timestamp < '2024-01-31 00:00:00'

    TYPE Materialized
    TARGET_DATASOURCE mv

Finally, because it is now safe, run the --populate command.

Use Copy Pipes

Depending on the transformations, Copy Pipes can substitute the populate for historical data.

You can see more details in our Backfill strategies docs, which, despite being focused mostly on version control iterations, exposes different alternatives.

Remember these tips only apply for streaming ingest. With batch ingest, or being able to pause ingest, populates are totally safe.

Next steps

Create a Materialized View

Now that you understand how Materialized Views work in Tinybird, you can read our guide on how to create them in the Tinybird UI or CLI, and start building Materialized View Data Sources in your own Workspace!