Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

What are Materialized Views (and why do they matter for realtime)?

Use Materialized Views for speed, simplicity, and consistency in your data pipelines.
Cameron Archer
Content Marketing Lead
Jun 9, 2022
 ・ 
  min read

In 2020, humans created 2.5 exabytes of data every day.

By 2025, that number will hit over 450 exabytes, a 200x increase in data generation in a mere 5 years. Data creation is accelerating so rapidly that we might even run out of places to store all our data.

If you pair the “data hyperdrive” with the human bent towards understanding, problem solving, and (depending on how cynical you’re feeling today) profit, you’ll realize pretty quickly that businesses, governments, and enterprising individuals are going to want to do something with all that data.

So, as a developer, you need ways to do more with all those exabytes while keeping costs and latency low.

And if you’re frequently building data products on top of real-time data streams, then there’s no time like the present to get up to speed on Materialized Views, if you haven’t already.

What is a Materialized View?

A Materialized View is the result of a query on a table in your database, stored in memory or disk so that you can easily access - and query over - its results in the future. 

It’s a powerful tool that developers can employ across a wide range of applications to improve query performance and simplify the development and maintenance of data products.

If you’re in the process of or about to start building data products, you should become familiar with Materialized Views and when to use them.

Why should I use Materialized Views?

At Tinybird, we use Materialized Views when we need speed over large amounts of data, because they often provide a performance boost over regular database tables or views, especially when we start making more complex queries that involve filtered selections, aggregation, or projections. When customers come to us looking for help optimizing their endpoints, Materialized Views are a common way we can grab back performance by cutting down on the number of bytes or rows read when the endpoint is called.

In particular, Materialized Views give us an advantage over regular tables in 3 specific categories: 

  1. Speed,
  2. Simplicity, and 
  3. Consistency.

Here’s what we mean by each of those…

Materialized Views are fast

At Tinybird, we’re all about speed. We believe that data should be analyzed when it happens. Period.

And when data is “happening” at millisecond latency, we need queries that return results at millisecond (or faster) latency.

Typical database queries - especially those that involve complex aggregations or filters - require most of the computation work to be done at query time. Expensive queries like this scan the same rows over and over, and they continually degrade as data grows.

Materialized Views speed up queries by shifting the computational work to write time instead of query time. When we need answers ASAP, Materialized Views give us a performant “cache” to query; the heavy lifting has been done during the materialization process, giving our product-facing queries the speed boost they need to serve relevant data.

To get a sense of how much faster (and cheaper) Materialized Views can make your queries, take a look at this blog that my friend (and colleague at Tinybird) Alison wrote about using Materialized Views to create data rollups on Ethereum blockchain transactions. She even created a nice little UI tool to demonstrate how her MVs can make queries 50x faster (and scan up to 50,000X fewer rows!).

Materialized Views are simple

Ever tried to optimize performance in one of your applications and realize the performance bottleneck is behind a complex SQL query that gets executed over and over again? 

Say, for example, that your application displays a ranking with top revenue-producing products by category for the last year in the United States? So you spin up a quick query that scans and aggregates all the rows from your transactional data, something like this:

Since that query is aggregating mostly past data that is not going to change, it is very likely you are overloading your database when you run it. Think about it, if you’ve made a million sales every year since the time you set up your database 10 years ago, every day you run this query you’ll be scanning 10 million+ rows even though only the most recent day’s worth of data (~2,700 rows) influences a change. Not to mention you’re doing a JOIN to a static metadata table every time. It’s a total waste of resources.

Materialized Views can help you keep that aggregation always up-to-date and improve execution by speeding up not only compute time, but your time as well. By relying on Materialized Views for the intermediate results you use most frequently, you’ll save time and achieve more flexibility as you build new data products on top.

Materialized Views are consistent

We’ve already talked about how Materialized Views create speed, solving #1. And we’ve also shown how they simplify the scans you need to make on your database to get common aggregations. But Materialized Views also create consistency, especially when the results stored in Materialized Views involve heuristic filters determined at the time of development.

Using concepts from the same example as above, consider if we wanted to know how much revenue we generated yesterday by selling a certain product. If this is a global product, the definition of “yesterday” is going to depend on the timezone. For example, “yesterday” in London only overlaps “yesterday” in Los Angeles by 16 hours. Said differently, a sale made at 6 PM on one day in Los Angeles would show up as having been made at 2 AM in London the next day.

If we ask query writers to make the logical determination of what “yesterday” means, each one might choose differently. For example, a data analyst building a dashboard for the CRO might choose a common timezone (GMT) regardless of where the sale originated, but a developer building real-time personalization on the ecommerce site might instead choose the timezone of the sale.

Neither determination is inherently wrong, but the discrepancy can cause problems down the line as the different business users rely on insights gathered using different sets of logic.

By creating a Materialized View with a descriptive name, like {% code-line %}sales_yesterday_gmt_by_poduct_mv{% code-line-end %}, you cement within your data structures the business logic that you consistently reuse, avoiding application-level irregularities that cause process breakdowns later on.

When should I not use Materialized Views?

Materialized Views aren’t always a silver bullet, and the implementation details vary depending on the database you are using (more on that below). There are always trade offs. In particular, Materialized Views have a few drawbacks to consider:

  1. They take up additional storage space. Data storage costs money. Materialized Views will often contain orders of magnitude less rows than the original data tables, but this is still something to keep in mind.

In Tinybird, both the original data and Materialized Views can be created with a TTL definition, to only retain data as needed.

  1. Not all SQL queries can be materialized. Depending on the database you use, you may find problems materializing queries with {% code-line %}UNION{% code-line-end %} or {% code-line %}JOIN{% code-line-end %}, for example, and you’ll need to avoid clauses like {% code-line %}LIMIT{% code-line-end %} or {% code-line %}ORDER BY{% code-line-end %}.

Materialized Views in Tinybird do support JOINs and GROUP BY clauses, plus a few other clauses that some databases don’t allow.

  1. They are not (necessarily) kept up to date automatically. Again, depending on your database, you may need to explicitly refresh data in your Materialized View as new data is ingested into your base tables. This isn’t the case in Tinybird (more below), but if the results of the query that you’re materializing change often, Materialized Views might hold stale data.

A special note about realtime use cases

Of course, the database you use and the way it approaches Materialized Views will have a pretty significant impact on when it’s practical to use them. At Tinybird, we focus pretty heavily on realtime use cases, where we’re streaming large volumes of data, transforming that data in real-time using SQL, then publishing the results via low-latency API endpoints.

So for our applications, #3 above just won’t fly. If we use a traditional relational database like Postgres, for example, Materialized Views have to be forcefully updated when new data is ingested. If we’re streaming in GB of new data every second, like in many of our customers’ use cases, we wouldn’t keep up. Simply put, databases like Postgres (and others) aren’t meant for streaming data, because they require batch processes to keep the Materialized View current.

But with Tinybird, that’s a different story. Tinybird is built on ClickHouse, which is not only exceptionally fast, but also does materialization incrementally. As new data is ingested, transformations and subsequent materializations happen on the fly, so our endpoints can always serve fresh data. For a little more on how that process works, you can check out our guide to mastering MVs in Tinybird.

Wrapping up

Materialized Views are a tried and true way to grab performance benefits when query logic is frequently being applied and recycled at the application level, or when you need your endpoints to scan fewer rows. Use Materialized Views in most cases where you want to reduce application response time, simplify your codebase, and ensure consistent business logic across multiple data-consuming applications. And remember, if you’re going realtime, choose a database that can keep up.

Want to learn how Materialized Views are created in Tinybird? Check out our guide:

Create Materialized Views with Transformation Pipes

Become a better data developer

Subscribe to the tinytales newsletter for monthly tips on building better data products.