Are you new to columnar databases? Trying to understand when to use a columnar database or what its limitations might be? Well, you're in the right place. In this article, you'll learn what a columnar database is, understand when (and when not) to use columnar databases, and find a big list of open source and proprietary column-oriented databases to choose from.
Learn what a columnar database is, when to use one, and see a list of 35 different columnar databases to choose from.
What is a columnar database?
A columnar database, also known as a column-oriented database, is a type of database management system (DBMS) that stores data in columns together on disk. This enables faster queries for data analytics, which generally involves filtering and aggregating table columns.
Compare columnar databases to row-oriented databases, in which the contents of a single row are stored together on disk. These types of databases are optimized for transactional, single-entity lookup instead of analytics over many entities.
Columnar databases are particularly advantageous for analytics queries that frequently scan or aggregate over large datasets but only need a few columns. Because only the necessary columns are read from storage, I/O costs and time are minimized, providing an edge over traditional row-based databases in analytics scenarios. For these use cases, columnar databases use system resources more efficiently and yield quicker insights.
Columnar databases are great for data analytics. They scan fewer rows and process less data than row-oriented databases.
Beyond just the way they store data, columnar databases also provide some features that can be quite useful for real-time analytics or time series data, such as…
- Probabilistic data structures
- Fast write throughput with log-structured merge tree (LSMT)
- Incremental rollups and materializations
- Specialized SQL functions for statistics, analytics, and time series data
You can read more about columnar databases and the ways they are optimized for big data analytics in this article.
When should I use a columnar database?
You should use columnar databases when you intend to filter and aggregate data logically stored in columns. For example, consider a generic data table that looks like this:
A query like…
… would benefit from columnar data storage, as it needs to access only data in the
current_balance column and can utilize indexing to filter by the
user_id column and minimize rows read.
Conversely, a query like…
…would benefit from row-oriented data storage, as it seeks to access and look up a single row of data indexed by a primary key,
Of course, this is an oversimplification. Even a minimally resourced row-oriented database could easily handle a column aggregation on a table with only five rows. The amount of data stored has a huge influence on whether a column-oriented database will be more performant than a row-oriented database. With big data, column-oriented databases become especially useful.
The amount of data stored has a huge influence on the performance gap between columnar databases and row-oriented databases.
Use columnar databases for real-time analytics
Columnar databases generally excel at real-time analytics, where high write throughput and low latency on complex, analytical queries are required.
Or consider event-driven architectures or event sourcing approaches, where state is maintained by aggregating a long history of timestamped events, rather than in a table using upserts or replaces. Storing and aggregating time series data can be very difficult for row-oriented, relational databases, but columnar databases handle time-series analytics very well.
Ultimately, it comes down to basic physics. Data in most databases is stored on disk and accessed from disk, and column-oriented databases store data differently than row-oriented databases. An analytical, columnar database provides distinct advantages when you want to access data in columns very quickly while eschewing some of the benefits that make transactional, row-oriented databases useful.
When should I avoid columnar databases?
You should avoid columnar databases when you don't intend to do complex analytics and you want to retain the benefits of transactional, row-oriented databases. For example, columnar databases are generally not optimized for frequent single-row updates or deletes, both of which are important functions of traditional databases used for online transaction processing.
The benefits of columnar databases become much more pronounced when working with large amounts of data, but you might want a columnar database even for small datasets (especially if you expect that data to grow).
Similarly, you may not need to use a columnar database if you are working with smaller amounts of data. While there's nothing technically "wrong" with using a columnar database with smaller data sets, these databases generally have a steeper learning curve, and you can likely achieve what you need with more comfortable and broadly supported relational databases like Postgres or MySQL.
|USE COLUMNAR DATABASES WHEN...||USE ROW-ORIENTED DATABASES WHEN...|
|Your primary use case is analytics||Your primary use case is transactions|
|You need low query latency||You don't need low query latency (for analytics)|
|You have large amounts of data||You're working with smaller data (for analytics)|
|You don't need strict ACID compliance||You need strict ACID compliance|
|You're using event sourcing principles||You need to do frequent, small replaces and deletes|
|You need to store and analyze lots of time series data||You need to store and access records with unique IDs|
What are some examples of columnar databases?
Popular examples of columnar databases include:
- Amazon Redshift
- Google BigQuery
- Apache Druid
- Apache Pinot
Below you'll find a larger list of various columnar databases, including open source columnar databases (plus hosted or serverless implementations thereof) and proprietary columnar databases.
A list of columnar databases to choose from
Here is a list of 35 different open source, hosted, and proprietary databases that either fully or partially support column-oriented storage, roughly listed in order of popularity:
- Amazon Redshift
- Google BigQuery
- Apache Kylin✝
- SAP HANA
- Apache Druid✝
- MariaDB ColumnStore✝
- Oracle Database In-Memory
- Apache Kudu✝
- Apache Pinot✝
- Greenplum Database✝
- IBM Informix
- SAP IQ (formerly Sybase IQ)
- Yellowbrick Data
- Apache Doris✝
- Apache Cassandra✝*
- Apache HBase✝*
- Teradata Columnar*
- DataStax Enterprise*
*These are technically classified as "column-family" databases. They store data in rows based on the primary key of each record, but they group similar columns in "families" such that they can minimize which columns are accessed during reads. They provide some of the benefits of columnar databases but are not technically classified as such.
**These are hybrid databases that offer row- and column-oriented storage, providing certain benefits of both with some tradeoffs.
What's the best columnar database?
You know the answer: It depends. The choice of a columnar database depends on your goals, use case, and budget.
Cloud data warehouses like AWS Redshift, BigQuery, and Snowflake are popular, powerful, and often more expensive than other options. They're good for business intelligence (BI) but not optimized for real-time analytics.
Managed versions of open source columnar databases like Tinybird (ClickHouse) or DataStax (Cassandra) can be good options when your goal is to leverage the performance benefits of column-oriented storage without handling infrastructure.
Beyond performance and pricing, you should consider ease of use. Many different databases have features that are optimized for certain use cases. ClickHouse, for example, includes many custom SQL functions for working with web clickstream data, such as specialized datetime functions and URL parsing.
Tinybird makes it possible to publish SQL queries as real-time, scalable APIs. Some in-memory databases, like Exasol, can improve query speed where long-running data persistence isn't required. NoSQL databases like Apache Cassandra or ScyllaDB that use column-family storage can be beneficial for scenarios where you have unstructured data and require schema flexibility.
Keep in mind that column-oriented storage isn't a silver bullet for query latency on complex analytics. Even with the most performant columnar databases, you still must consider proper indexing, sorting, partitioning, and replication mechanisms to maintain and optimize the performance of column-oriented databases at scale.
Columnar database FAQs
Below are some answers to frequently asked questions about columnar databases.
Do columnar databases support SQL?
Of course! There are plenty of columnar databases that implement SQL or SQL-like query languages. Some examples include data warehouses like Amazon Redshift, Google BigQuery, and Snowflake, databases like ClickHouse and Vertica, and real-time platforms like Tinybird. Each database may implement its own "flavor" of SQL, with specialized functions optimized for different use cases. Regardless, the storage format (column-oriented vs. row-oriented) does not determine whether a database can or can't use SQL as its query interface, so many column-oriented databases leverage the familiarity of SQL.
Is SQL a columnar database?
No. SQL is a query language, not a database. In fact, SQL is the most popular database query language in the world, and many column-oriented databases support SQL or SQL-like interfaces for data access. Regardless, the use of SQL does not depend on storage format.
Are real-time databases the same as columnar databases?
Not necessarily. Real-time databases can use a columnar storage format. ClickHouse, Tinybird, Pinot, and Druid are all real-time databases that utilize columnar storage formats. Of course, databases like Firebase and MongoDB can also be used as real-time databases in certain cases, and they don't use columnar storage formats.
What's the difference between columnar and wide-column databases?
Columnar databases store data on disk in columns, whereas wide-column databases still use row-oriented storage. Wide-column databases gain some of the benefits of columnar databases by grouping columns into column families, such that data in columns can be accessed without reading entire rows. Unlike columnar databases, wide-column databases still store and index rows by a primary key, with mappings to the different column families.
Is Snowflake a columnar database?
Yes, Snowflake uses a columnar storage format as a fundamental part of its architecture. As a cloud data warehouse optimized for online analytical processing (OLAP) workloads and big data analytics, Snowflake benefits from column-oriented storage, as the most common use cases with Snowflake include business intelligence, data warehousing, and other big data processing functions that generally benefit from column-oriented storage. That said, Snowflake is not optimized for real-time analytics despite its columnar storage format.
Is MongoDB a columnar database?
No, MongoDB is not a columnar database. MongoDB is a NoSQL, document-oriented database that is designed for storing and retrieving unstructured or semi-structured data. MongoDB can be used for analytical workloads given the right use case and scale, but it is not optimized for online analytical processing over big data.
Read more about when to use columnar databases over MongoDB.
Is MySQL a columnar database?
No. In its default configuration, MySQL is not a columnar database, as it is a relational database management system (RDBMS) that uses row-oriented storage formats designed for transactional workloads.
That said, some projects like MariaDB have forked MySQL and introduced columnar storage engines. For example, MariaDB ColumnStore is designed to provide columnar storage and analytical capabilities with an interface familiar to MySQL users.
Still, solutions like MariaDB aren't purely column-oriented, and they make trade-offs on performance by trying to combine the benefits of a traditional relational database with columnar storage.
Read more about when to use columnar databases over MySQL.
Is Postgres a columnar database?
No. In its native configuration, PostgreSQL (Postgres) is not a columnar database. Like MySQL, it is an online, transactional processing (OLTP) RDBMS that uses a row-oriented storage format designed for transactional workloads.
Still, Postgres can be extended to support columnar-like functionality. The open source project
cstore_fdw from Citus adds columnar storage extensions to Postgres via foreign data wrapper (FDW), and proprietary solutions such as TimescaleDB extend Postgres' functionality to enhance performance for analytics over time-indexed datasets.
Postgres is one of the most widely adopted and comfortable databases out there, so if columnar-type extensions can give you the performance boosts you need, those are good options. If you need pure columnar performance with very low latency on analytical queries, then you may want to consider a native columnar database with SQL support, like ClickHouse or Tinybird.
Read more about when to use columnar databases over Postgres.
How much data do I need to have before switching to a columnar database?
You don't need to have a large amount of data to use columnar databases. Columnar databases are efficient at filtering and aggregating over columns regardless of the size of the data.
While row-oriented databases can be scaled to handle large amounts of data (even "big data"), at a certain point this becomes increasingly painful to maintain. Broadly speaking (and with some caveats), once your OLTP database grows to 10-100M rows (which, by the way, isn't considered "big data" anymore), you'll likely benefit from a dedicated columnar database for analytics.
If you intend to build analytics use cases from the start, opt for a columnar database for this function, even if your data is small. This way, you won't have to do a painful migration as your data grows.
Do I have to choose between columnar and row-oriented databases?
No. It's very common to use row-oriented and column-oriented databases simultaneously for different use cases. You can use row-oriented databases for user management and authentication, and columnar databases for analytics and time series data. Choose the database that works for the use case you are building, or consider databases that support both row- and column-oriented storage if you want a single, general-purpose database. Just keep in mind that databases that try to do many things well rarely excel at one thing.