When you develop an application, your first choice for a database is often a relational database like Postgres, MySQL, or NoSQL databases like MongoDB. You can’t go wrong with any of these traditional databases for software development; they’re great general-purpose relational databases with huge communities and some excellent features (e.g., transactions) that make developers’ lives easier.
But there comes a point when things start to slow down. So you read a few blog posts, scan some docs, browse some dev forums, and spend hours tuning and improving queries, database config, etc. This does improve things temporarily, but eventually, you hit a wall. What you really need is a columnar database.
If you cut your backend teeth on row-oriented databases like Postgres or MySQL, it’d be understandable if you thought that most applications should be built on these or similar databases; the row-oriented, online transactional processing (OLTP) approach lends itself well to most app development that’s happened over the last decade.
But then you realize there are other databases out there focused specifically on real-time analytical use cases with lots of data and complex queries. Newcomers like ClickHouse, Pinot, and Druid (all open source RDBMS) respond to a new class of problem: The need to develop applications over big data with analytical queries that were previously confined only to data warehouses like Snowflake, BigQuery, or Amazon Redshift and the business intelligence tools they often feed. These online analytical processing (OLAP) data stores are optimized for fast writes and fast reads over big data.
To use a metaphor: if you want to find 3 specific trees in a massive forest, row-oriented OLTP is great. But if you want to count all the trees? That’s where column-oriented OLAP comes in.
When you discover column-oriented databases, all of a sudden your general-purpose database doesn’t feel so “general-purpose” anymore. And you realize that maybe row-based databases like Postgres and MySQL or document databases like MongoDB aren’t the databases you’re looking for to tackle your next project.
Tinybird is a real-time, columnar database with a built-in API layer that’s perfect for building applications over analytical queries: You can ingest lots of data into a columnar database, query it with SQL, and instantly publish your analytical queries as APIs with which to build your applications.
Why choose a columnar database?
Why can't every database management system (DBMS) be good for both transactional and analytical workloads? The answer is basic physics. Hardware has hard limits, and you have to configure it one way for transactional use cases, and another way for analytical ones.
More specifically, it has to do with how data is physically stored and processed. Column-oriented databases like ClickHouse and Tinybird, for example, store data in a columnar fashion; all the values in a column store live together on disk. In row-oriented, OLTP databases, the data is stored by row, so all the values in a row stay together.
And this is very important.
Columnar databases store data in columns of data together on disk. Traditional row-oriented OLTP databases store data in rows of data together on disk. This makes a big difference for analytical data processing.
Data locality matters.
If you go to the grocery store to buy 100 cans of Coca-Cola, you hope to find them packaged in 24-packs all on the same shelf. You go to one place, grab the packs you need, check out, and you’re on to your next errand in 5 minutes tops.
But if the cans are spread out all over the store, behind the bananas and beside the corn starch, you'd need to push your shopping cart from end to end to collect them all. You’d be lucky to get out of there before closing time.
Now consider what it might be like if your cans of soda are all stored in different stores! Now we’re getting into the realm of distributed systems, Apache Cassandra, HBase, and cloud data warehousing.
The same thing is true with data: If data sits together in disk or memory, reading and processing it is way faster.
This is because disks and memory work 100 times faster if access is sequential. And CPUs process much faster if they don't need to jump between different tasks. If you are going to access data quickly, you don’t want it distributed over many disks on many different machines. You want it on one disk, stored close together.
Disks, memory, and CPU work way faster when data is close together.
But of course, cans of soda aren’t data and it’s not fair to compare them. Cans of soda are physical objects, data is not. If the goal is to optimize your “checkout time”, then there are many things you can do with data storage to speed things up:
We all use compression to save data on our disks. If you are old like me you probably used winzip. In general, we all understand that if you compress data, it gets smaller. Typically speaking, smaller is better when storing anything. It’s why fast food chains buy soda syrup and carbonated water in bulk instead of stacking hundreds of 12-packs next to the burger patties.
Compression is a general term, and there are many ways to compress data. But there’s one thing common to all compression methods: compression works much better if similar values are together.
Many years ago, video games started to use 3D graphics that required complex vector calculations to move things in a 3D space. Intel (the market leader) created MMX technology and later SSE to enable CPUs to do vector math quickly. So in one CPU cycle, you do 3 operations instead of one. Vectorization lets CPUs process many values at the same time, but it's only possible if those values are stored together.
In a somewhat famous slide titled “Numbers Everyone Should Know” (slide 13 in this presentation), Jeff Dean lists retrieval times by data location. As you can see, L1 cache access is 200x faster than main memory access. If you cache, you go fast.
And it turns out that caching really likes when the data is close together. Add this along with compression so you can fit more data into the cache, and you can feed the CPU that much faster.
Sorting is another huge performance factor, especially for columnar databases. If the data is sorted by the columns you’ll be filtering on, everything speeds up. It compresses better, access is faster, and the data locality is much much better. Sorting also helps to improve algorithms like joins, order by, limit, and group by. Proper sorting can speed up queries by multiple orders of magnitude.
One more thing: parallelization. It’s not unique to analytical databases, but it helps a lot when you have a lot of data, which is when you typically consider using an analytical database. There are several types of parallelization: inside the CPU (aka vectorization), across multiple CPUs, and even across multiple machines. I’ll talk about it more in a bit.
Because analytical use cases almost always involve aggregating and filtering on data stored columns, running analytical queries against columnar storage is much faster.
All these factors combined, it’s pretty simple: Things go faster when the data you want to access is stored together. And because analytical use cases pretty much always involve aggregating and filtering on data in columns, running these analytical queries over columns of data is just much faster. The hardware is optimized to count all the “trees” in as few cycles as possible. This is the main reason why column-oriented storage is better for analytics.
Other use cases for columnar databases
The low-level benefits of column-oriented databases for analytics should be pretty clear at this point: Data locality to extract 100% of the hardware is huge for speeding up queries on large amounts of data involving aggregations and filters typical in analytical use cases. But it doesn’t stop there. Analytical databases have other properties that make them even more appealing for handling big data.
Probabilistic data structures
When you run analytics, you often don’t need many things that OLTP databases offer. One of those things is exactness, especially in statistical calculations. That might not seem like a big deal, but if you’re building on top of analytical databases, it has big implications.
If you are allowed to have a small error in your statistics, say +/- 1%, it can mean much, much faster queries. Calculating unique values, for example, is very compute-intensive and requires a lot of memory. If you can get by with some error, you can use probabilistic data structures like HyperLogLog that estimate unique values with less memory and less CPU.
Eventual consistency is also important here; it’s not usually achievable on OLTP workloads, but it’s not a problem in analytics. When you have a lot of data, a single machine often isn’t enough to run your analytical workloads. Of course, you should always try to scale vertically if you can, but eventually, you’ll need to put data on several machines (sharding and replication are the terms for this).
This is a well-trodden path: Coordination in a distributed system is not hard and there are numerous books written about it. So you can pretty easily set up a cluster with several machines to scale your reads and writes. But the benefits go beyond basic horizontal scaling.
Probabilistic data structures improve query performance when exactness isn't required, and they are highly parallelizable.
I talked earlier about parallelization at different levels. It turns out that parallelization, distributed systems, and probabilistic data structures get along quite well.
The same algorithms and methods used to parallelize a workload on many cores work well when you parallelize it on many machines. Probabilistic data structures themselves happen to be highly parallelizable as well.
Analytical databases take advantage of this. Tinybird, for example, supports a number of non-exact SQL functions like
quantileDeterministic() that deterministically estimate their respective statistics. As data volumes get bigger and bigger, this has a meaningful impact on query latency.
Faster writes with LSM tree
An important component of analytical architectures is the log-structured merge-tree (LSMT), a data structure that many new databases (and some “old”) use because it aligns well with how hardware works.
ClickHouse, for example, uses an LSMT-like structure that lets you insert millions of rows per second without any problems.
In a lot of analytical use cases, you don’t just need your queries to be fast, you also need them to query the freshest data. This is especially important for real-time use cases where you need to serve low-latency analytics on streaming data. Every millisecond counts. Just ask the day traders.
Incremental rollups and materializations
Almost all databases support some form of materialized views. But in most databases, including Postgres, MySQL, and MongoDB, the materialized views need to be periodically and manually refreshed. Analytical databases, on the other hand, usually have special tables and structures to enable incremental materializations, rollups, and other kinds of aggregations. The result is faster queries over aggregations even on datasets with high-frequency inserts.
Specialized functions for statistics and time series
It’s hard to find an analytical use case that doesn’t involve time series data, statistical functions, or both. Most columnar database designers understood this, so they designed their DBMS with specialized functions for time series data and statistics. Tinybird has a host of specialized functions for dealing with dates and times that you mostly don’t get with Postgres, for example.
Columnar databases like ClickHouse or Tinybird come with functions and structures that are optimized for analytical use cases.
Columnar databases aren't perfect, but they are useful.
Column-oriented databases aren’t perfect by any means. They’re often a huge pain, not necessarily because they are harder to manage, but, because they let you store and process way more data, things just get harder in general.
But maybe, as you approach your next project, you should stop thinking in terms of transactions, linearizability, fast point queries, super advanced search indices, and the other trappings of row-oriented databases.
Instead, think about what the types of queries you’ll need to run and how much data you’ll have. If you need to do sums on billions of rows, you’re gonna want to go with a columnar database.
Fortunately, Tinybird can make working with a columnar database feel more akin to working with something traditional like Postgres. If you're developing software that needs to handle analytical workloads, you should give Tinybird a try. The Build Plan is free forever with no time restrictions and generous free limits.