Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

When you should use columnar databases and not Postgres, MySQL, or MongoDB

Row-oriented, OLTP databases aren't ideal application DBs when you know you'll need to run analytics on lots of data. Choose OLAP instead.
Javier Santana
Oct 17, 2022
  min read

When you develop an application, your first choice for a database is often one of Postgres, MySQL, or - if NoSQL is more your thing - MongoDB. You can’t go wrong with any of these; they’re great general-purpose databases with huge communities and some excellent features (e.g. transactions) that make developers’ lives easier.

But there comes a point when things start to slow down. So you read a few blog posts, scan some docs, browse some dev forums, and spend hours tuning and improving queries, database config, etc. This does improve things temporarily, but eventually, you hit a wall.

If you cut your database teeth on the Postgres/MySQL zeitgeist, it’d be understandable if you thought that most applications should be built on these or similar databases; the row-oriented, OLTP approach lends itself well to most of the app development that’s happened over the last decade.

But then you realize there are other databases out there focused specifically on analytical use cases with lots of data and complex queries. Newcomers like ClickHouse, Pinot, and Druid (all open source) respond to a new class of problem: The need to develop applications using endpoints published on analytical queries that were previously confined only to the data warehouse and BI tools.

To use a metaphor: if you want to find 3 specific trees in a massive forest, OLTP is great. But if you want to count all the trees? That’s where OLAP comes in.

When you discover these new analytical databases, all the sudden your general-purpose database doesn’t feel so “general-purpose” anymore. And you realize that maybe Postgres, MySQL, and MongoDB aren’t the databases you’re looking for to tackle your next project.

these aren't the databases you're looking for meme

But why?

Why can't every database be good for both transactional and analytical workloads? The answer is basic physics. Hardware has hard limits, and you have to configure it one way for transactional use cases, and another way for analytical ones.

More specifically, it has to do with how data is physically stored and processed. Analytical databases like ClickHouse, for example, store data in a columnar fashion; all the values in a column in a table are stored together in disk. In OLTP databases, the data is stored by row, so all the values in a row stay together.

And this is very important.

Analytical databases store data in columns. Traditional OLTP databases store it in rows. This makes a big difference.

Data locality matters.

If you go to the grocery store to buy 100 cans of soda, you hope to find them packaged in 12-packs all on the same shelf. You go to one place, grab the packs you need, check out, and you’re on to your next errand in 5 minutes tops.

But if the cans are spread out all over the store, behind the bananas and beside the corn starch, you'd need to push your shopping cart from end to end to collect them all. You’d be lucky to get out of there before closing time.

The same thing is true with data: If data sits together in disk or memory, reading and processing it is way faster. 

This is because disks and memory work 100 times faster if access is sequential. And CPUs process much faster if they don't need to jump between different tasks.

Disks, memory, and CPU work way faster when data is close together.

But of course, cans of soda aren’t data and it’s not fair to compare them. Cans of soda are physical objects, data is not. If the goal is a “faster checkout time”, then there are many things you can do with data to speed things up:

Compress it

We all use compression to save data in our disks. If you are old like me you probably used winzip. In general, we all understand that if you compress data, it gets smaller. Typically speaking, smaller is better when storing things. It’s why fast food chains buy soda syrup and carbonated water in bulk instead of stacking hundreds of 12-packs next to the burger patties.

Compression is a general term, and there are many ways to compress data. But there’s one thing common to all compression methods: compression works much better if similar values are together.

Vectorize it

Many years ago, video games started to use 3D graphics that required complex vector calculations to move things in a 3D space. Intel (the market leader) created MMX technology and later SSE to enable CPUs to do vector math quickly. So in one CPU cycle you do 3 operations instead of one. Vectorization lets CPUs process many values at the same time, but it's only possible if those values are stored together.

Cache it

In a somewhat famous slide titled “Numbers Everyone Should Know” (slide 13 in this presentation), Jeff Dean lists retrieval times by data location. As you can see, L1 cache access is 200x faster than main memory access. If you cache, you go fast.

And it turns out that caching really likes when the data is close together. Add this along with compression so you can fit more data into the cache, and you can feed the CPU that much faster.

Sort it

Sorting is another huge performance factor, especially for columnar databases. If the data is sorted by the columns you’ll be filtering on, everything speeds up. It compresses better, access is faster, and the data locality is much much better. Sorting also helps to improve algorithms like joins, order by, limit, and group by. Proper sorting can speed up queries by multiple orders of magnitude. 

Parallelize it

One more thing: parallelization. It’s not unique to analytical databases, but it helps a lot when you have a lot of data, which is when you typically consider using an analytical database. There are several types of parallelization: inside the CPU (aka vectorization), across multiple CPUs, and even across multiple machines. I’ll talk about it more in a bit.

Because analytical use cases almost always involve aggregating and filtering on columns, running against data stored by column is much faster.

All these factors combined, it’s pretty simple: Things go faster when the data you want to access is stored together. And because analytical use cases pretty much always involve aggregating and filtering on specific columns, running these queries against data stored by column is just much faster. The hardware is optimized to count all the “trees” in as few cycles as possible. This is the main reason why column-oriented analytical bases are better for analytics.

But it doesn’t stop at data locality

The low-level benefits of columnar data stores for analytics should be pretty clear at this point: Data locality to extract 100% of the hardware is huge for speeding up queries on large amounts of data involving aggregations and filters typical in analytical use cases. But it doesn’t stop there. Analytical databases have other properties that make them even more appealing for handling large amounts of data.

Probabilistic data structures

When you run analytics, you often don’t need many things that OLTP databases offer. One of those things is exactness, especially on statistical calculations. That might not seem like a big deal, but if you’re building on top of an analytical databases, it has big implications.

If you are allowed to have a small error in your statistics, say +/- 1%, it can mean much, much faster queries. Calculating unique values, for example, is very compute-intensive and requires a lot of memory. If you can get by with some error, you can use probabilistic data structures like HyperLogLog that estimate unique values with less memory and less CPU.

Eventual consistency is also important here; it’s not usually achievable on OLTP workloads, but it’s not a problem in analytics. When you have a lot of data, a single machine often isn’t enough to run your analytical workloads. Of course, you should always try to scale vertically if you can, but eventually you’ll need to put data on several machines (sharding and replication are the terms for this). 

This is a well-trodden path: Coordination in a distributed system is not hard and there are numerous books written about it. So you can pretty easily set up a cluster with several machines to scale your reads and writes. But the benefits go beyond basic horizontal scaling.

Probabilistic data structures improve query performance when exactness isn't required, and they are highly parallelizable.

I talked earlier about parallelization at different levels. It turns out that parallelization, distributed systems, and probabilistic data structures get along quite well.

The same algorithms and methods used to parallelize a workload on many cores work well when you parallelize it on many machines. Probabilistic data structures themselves happen to be highly parallelizable as well.

Analytical databases take advantage of this. ClickHouse, for example, has a number of non-exact functions like ``uniq()`` and ``quantileDeterministic()`` that deterministically estimate their respective statistics. As data volumes get bigger and bigger, this has a meaningful impact on query latency.

Faster writes with LSM tree

An important component of analytical architecture is the log-structured merge-tree (LSMT), a data structure that many new databases (and some “old”) use because it aligns well with how hardware works. 

ClickHouse, for example, uses an LSMT-like structure that lets you insert millions of rows per second without any problems.

In a lot of analytical use cases, you don’t just need your queries to be fast, you also need them to query the freshest data. This is especially important for realtime use cases where you need to serve low-latency analytics on streaming data. Every millisecond counts. Just ask the day traders.

Incremental rollups and materializations

Almost all databases support some form of materialized views. But in most databases, including Postgres, MySQL, and MongoDB, the materialized views need to be periodically and manually refreshed. Analytical databases, on the other hand, usually have special tables and structures to enable incremental materializations, rollups, and other kinds of aggregations. The result is faster queries over aggregations even on datasets with high-frequency inserts.

Specialized functions for statistics and time series

It’s hard to find an analytical use case that doesn’t involve time series data, statistical functions, or both. Most OLAP database designers understood this, so they designed their DBMS with specialized functions for time series data and statistics. ClickHouse has a host of specialized functions for dealing with dates and times that you mostly don’t get with Postgres, for example.

OLAP databases like ClickHouse come with functions and structures that are optimized for analytical use cases.

They aren’t perfect, but you still need them.

Analytical databases aren’t perfect by any means. They’re often a huge pain, not necessarily because they are harder to manage, but, because they let you store and process way more data, things just get harder in general. More data, more problems.

But maybe, as you approach your next project, you should stop thinking in terms of transactions, linearizability, fast point queries, super advanced search indices, and the other trappings of OLTP databases.

Instead, think about what the types of queries you’ll need to run and how much data you’ll have. If you need to do sums on billions of rows, you’re gonna want to go with a columnar, OLAP database.

Become a better data developer

Subscribe to the tinytales newsletter for monthly tips on building better data products.