--- title: "Outgrowing Postgres: How to run OLAP workloads on Postgres" excerpt: "A deep dive into running analytics on Postgres, from basic optimizations to advanced techniques and knowing when to quit." authors: "Ariel Perez" categories: "The Data Base" createdOn: "2025-01-31 00:00:00" publishedOn: "2025-01-31 00:00:00" updatedOn: "2025-04-24 00:00:00" status: "published" ---

I used to think running large-scale analytics workloads on Postgres was like trying to use a screwdriver as a hammer - technically possible but probably not a great idea. But after years of running both OLTP and OLAP workloads on Postgres at scale, I've learned it's much more nuanced than that. Postgres can actually handle analytics workloads quite well, if you know its limits and how to work within them.

In my previous articles, we looked at handling growing data volumes and increasing user concurrency. Analytics workloads add another dimension of complexity - instead of simple lookups and CRUD operations, you're now dealing with complex aggregations across millions or billions of rows, joins across multiple large tables, and queries that need to scan huge chunks of your data.

The good news? With the right approach, Postgres actually can handle many analytics use cases, from basic reporting to complex business intelligence queries. I've seen it work well for datasets up to several terabytes, providing query response times from seconds to minutes. The bad news? You'll need to make some specific optimizations and architectural choices to get there, and it does require a lot of oversight, planning, and investment.

Understanding OLAP vs OLTP query patterns

Before we get into specific optimizations, let's understand what makes analytics queries fundamentally different from the transactional queries we've discussed in previous articles. Having run both types at scale, I can tell you the differences go way beyond just query complexity.

Here's what we're dealing with:

Dimension	OLTP Workloads	OLAP Workloads
Query Type	Single-record lookups and updates Example: Fetching a user's profile	Large dataset analysis Example: Monthly revenue across all customers
Query Frequency	High (100s-1000s/second)	Lower, unless user-facing dashboards
Query Complexity	Simple, few joins	Complex, multiple joins and aggregations
Performance Needs	Milliseconds or bust	Seconds OK for batch<br>Must match OLTP speed for user-facing
Resource Pattern	CPU-bound Heavy index usage	I/O-bound Large sequential scans
Data Access	< 1% of rows per query	Often > 10% of rows per query
Index Usage	Relies heavily on index seeks	Often bypasses indexes for full scans

I think this table neatly summarizes why we need different strategies for OLAP workloads in Postgres. When your analytics queries start competing with your OLTP workloads for resources, you'll need to get creative with your approach - which is exactly what we'll cover next.

Postgres OLAP optimization strategies

Separate your workloads

The first rule of running analytics on Postgres: don't run analytics queries on your primary OLTP database. This isn't just best practice - it's survival. One complex analytics query scanning millions of rows can bring your entire application to a halt. I've seen it happen, and the incident reviews aren't fun.

Instead, set up a dedicated analytics replica. But not just any replica - one specifically configured for analytical workloads. Here's what this gets you:

Complete isolation from production OLTP traffic
Freedom to tune and optimize specifically for analytics
Ability to scale your analytics infrastructure independently
Protection against runaway queries impacting production
Option to add more read replicas as analytics needs grow

Here's a basic setup using streaming replication:

Configure the primary and replicas by updating the postgresql.conf:

Update your application to point to the analytics replicas where appropriate.

Notice the differences from the suggested settings in the “Outgrowing Postgres: Handling increased user concurrency” article. The key difference is that you’ll want to optimize these replicas for analytics workloads. Skip to the Performance tuning for analytics section below for more find-tuned settings for your replicas.

Key things you’ll want to watch out for:

Replication lag: Analytics queries can cause the replica to fall behind
Resource contention: Even on a replica, poorly optimized queries can cause problems
Data freshness: Your analytics will be slightly behind production due to replication

A common question I get is "How much replication lag is acceptable?" For most analytics workloads, a few seconds or even minutes of lag is fine. But if you need real-time analytics, you'll need to either accept higher costs from a more powerful replica setup or start looking at specialized analytics solutions.

Design your schema for analytics

Your schema design can make or break analytics performance. Let's look at several techniques that can dramatically improve query speed, starting with the basics and moving to more advanced approaches.

Materialized views for pre-aggregation

If you've read my first article in this series on handling growing data volumes, you're already familiar with materialized views. They're particularly valuable for analytics workloads because they let you pre-calculate expensive aggregations:

For a more thorough treatment of materialized views, including concurrent refreshes and maintenance strategies, check out the materialized views section of that article.

Table partitioning

Partitioning is another fundamental technique we covered in the data volumes article. For analytics workloads, time-based partitioning is particularly effective:

Take a look at the table partitioning section of that article for detailed partitioning strategies and maintenance approaches.

Vertical partitioning for analytics

Here's where things get interesting for analytics workloads. Vertical partitioning - splitting tables by columns rather than rows - can significantly improve analytics performance. While we touched on this in that same data volumes article, it deserves special attention for analytics.

Consider a wide events table:

You can split this into separate tables based on query patterns:

This approach has several benefits for analytics:

Queries only need to scan the columns they need
Better cache utilization for frequently accessed columns
More efficient compression for similar data types
Ability to use different storage strategies per table

But - and this is a big but - vertical partitioning isn't free. Here's what you're signing up for:

Write complexity

Every insert now requires multiple queries
Transactions need to span multiple tables
More complex error handling and rollback scenarios
Higher chance of constraint violations
Need for careful transaction management to maintain data consistency

Read complexity

If you need to read columns across the partitions, simple queries become joins
Need to maintain indexes across multiple tables
More complex query planning
Potential for performance cliffs if joins aren't optimized

Maintenance headaches

More tables to manage and monitor
Complex cleanup for orphaned records
Harder to modify schema (changes need to be coordinated)
More complex backup and restore procedures
Additional foreign key constraints to maintain

When should you actually use vertical partitioning? In my experience, it works best when:

Your analytical queries have clear patterns (some columns always used together)
You have very wide tables where most queries use a small subset of columns
You have different access patterns for different column groups
Your write volume is moderate (and you can handle the extra complexity)
You have columns with very different storage needs (like text vs numeric data)

If you're just starting out with analytics on Postgres, I'd recommend exhausting simpler options first. Materialized views and regular partitioning can take you quite far with less complexity. But when you're hitting scan performance issues and you see clear patterns in how columns are accessed, vertical partitioning might be worth the trade-offs.

Columnar storage with Postgres plugins

Postgres has a powerful extension system that lets you add new capabilities to your database. Think of extensions like browser plugins - they extend functionality but run with the same privileges as Postgres itself. That's both powerful and potentially risky.

Understanding Postgres extensions

Extensions are shared libraries that can:

Add new data types and functions
Create new table access methods
Implement new index types
Add foreign data wrappers

They're typically written in C and need to be compiled against your specific Postgres version. While the official Postgres extension repository (PGXN) has some quality controls, you should still evaluate extensions carefully - they run with full database privileges.

Columnar storage options

For analytics workloads, the following columnar storage extensions stand out. However, teams eventually considering dedicated OLAP databases should also evaluate alternatives like ClickHouse®; see our detailed comparison of ClickHouse® vs PostgreSQL with extensions to understand performance trade-offs and migration strategies.

cstore_fdw

The battle-tested option, developed by Citus Data (now part of Microsoft):

Pros:

Mature and stable
Used in production by many companies
Good documentation and community support
Solid compression ratios
Supports both read and write operations
Works with standard Postgres backup tools

Cons:

No support for indexes
Limited to PostgreSQL 12 and older in the open-source version
Writes are slower than regular tables
No concurrent write support
Limited query optimization capabilities

pg_analytics (formerly pg_lakehouse)

A DuckDB-powered option by ParadeDB focused on data lake integration:

Pros:

Native integration with data lakes and object stores
Supports Delta Lake and Apache Iceberg formats
Query pushdown to DuckDB for performance
Good compatibility with Postgres types

Cons:

Read-only access currently
Requires shared_preload_libraries configuration
Limited write capabilities
Early in development cycle

pg_duckdb

A relative newcomer by Hydra and MotherDuck that also embeds DuckDB in Postgres:

Pros:

Leverages DuckDB's high-performance analytics engine
Can directly query data in S3/object storage
Supports Parquet, CSV, JSON file formats
Integrates with MotherDuck cloud service
Active development from DuckDB team

Cons:

Currently read-only for object storage
Some Postgres data types not yet supported
Relatively new (v0.2.0 as of early 2024)
Limited production deployments

pg_mooncake

The newest kid on the block by Mooncake Labs combines DuckDB execution with Iceberg/Delta Lake storage and is available on Neon:

Pros:

Significantly faster analytics queries due to DuckDB's vectorized execution
Better compression ratios
Native integration with data lakes (can query Parquet files)
Modern columnar storage format
Active development

Cons:

Relatively new and less battle-tested
Limited write capabilities (no UPDATE/DELETE)
Some Postgres features not supported yet
More complex setup if using cloud storage
Smaller community and fewer production deployments

When to use a columnar storage plugin

There are definitely some risks you need to consider before going down this route:

Version Lock-in: Extensions need to match your Postgres version exactly
Operational Complexity: Need to maintain extension updates separately
Backup Considerations: Some backup tools may not handle extension data properly
Performance Overhead: Extensions can add CPU and memory overhead
Support Challenges: Limited commercial support options

Despite these challenges, columnar storage can be worth it when:

You have large analytics tables (>100GB)
Queries typically access a subset of columns
You need better compression ratios
Read performance is more important than write performance
You can manage your own Postgres infrastructure

If you're on a managed platform that doesn't support these extensions, you might need to:

Use regular Postgres tables with good indexing strategies
Consider a separate analytics database
Look at managed analytics solutions
Evaluate moving to self-hosted Postgres

Remember: adding extensions is a significant architectural decision. Start with standard Postgres features first, and only add extensions when you have clear evidence they'll solve specific problems in your workload.

While these advanced techniques can give you a huge boost in analytics performance, they also add complexity to your system. Start with the basics (materialized views, partitioning, and analytics replicas) and only move to columnar storage when you have clear evidence they'll help your specific workload.

Analytics query optimization techniques

Up to this point I’ve focused on infrastructure and schema changes - in many situations however, smart query optimization often beats brute force.

Understanding analytics query patterns

Analytics queries are fundamentally different from OLTP queries. They often:

Scan large portions of tables
Perform complex aggregations
Join multiple large tables
Use window functions for trend analysis
Need to handle significant intermediate results

Now I’ll walk you through how to optimize each of these patterns.

Window functions

Window functions are incredibly powerful for analytics. They let you perform calculations across sets of rows without the overhead of self-joins or complex subqueries:

Pro tip: Window functions are processed after aggregations. This means you can combine them with GROUP BY:

Common table expressions (CTEs)

Common Table Expressions (CTEs) are often seen as just a way to make queries more readable. But they're also a powerful optimization tool because of how they work in Postgres but, that also makes them a double-edged sword:

The other side of the sword: By default, CTEs in Postgres are materialized - they're calculated once and reused. This is usually great for analytics but can backfire if your CTE returns a large result set that's only used for a small subset of rows.

Parallel query execution

Postgres can parallelize many analytics operations, but you need to tune it right:

Some operations that benefit most from parallelism:

Sequential scans on large tables
Parallel joins (with appropriate settings)
Large aggregations
Window functions

Keep in mind: While parallel queries use more total CPU, they can finish much faster. Monitor your system resources carefully when enabling parallelism.

Essential EXPLAIN analysis

When optimizing analytics queries, EXPLAIN ANALYZE is your best friend. These are some of the specific things to watch for:

Common analytics query pitfalls

Unneeded ORDER BY in subqueries

Missing DISTINCT ON for latest values

Not using date truncation indexes

Query optimization is iterative. Start with the basics, measure, then optimize further based on real usage patterns. And always test with realistic data volumes - what works for 1M rows might fall apart at 100M. Lastly, it’s amazing how far you can get by simply following “The 5 rules of writing faster SQL queries.”

Indexing for analytics

BRIN indexes are Postgres’ secret weapon for analytical workloads. Unlike B-tree indexes that track every single row, BRIN (Block Range INdex) creates a tiny index that stores metadata about blocks of data. Think of it like chapter summaries in a book instead of a detailed index of every word.

For time-series data or any naturally ordered columns (like IDs that only go up), BRIN shines. It's perfect when physical storage order and the indexed column are highly correlated for your data.

In Crunchy Data's testing, on a 42MB table, the B-tree indexes were 21MB while the BRIN indexes were just 24KB! And for large result sets (100K+ rows), BRIN consistently outperformed B-tree indexes.

Here's how to create one:

The pages_per_range parameter is key - it defaults to 128 but tuning it can really boost performance. For narrow queries (returning ~100 rows), smaller values like 8 or 16 work better. For broader analytical queries, stick with larger values.

But BRIN isn't magic. Skip it when:

Your data is randomly distributed
You need exact row lookups
Your queries are highly selective (returning <1% of rows)
Your table gets lots of UPDATEs in the middle

The sweet spot? Analytical queries that scan large chunks of naturally ordered data. Bonus points if you're tight on disk space - BRIN indexes are tiny.

Want to check if BRIN makes sense? Look at the correlation between storage order and your column:

The closer to 1.0, the better BRIN will perform.

Performance tuning for analytics

No matter how well you write your queries, your postgres instance needs proper tuning to handle analytics workloads efficiently. While I covered basic configuration in previous articles, analytics workloads have specific needs that require different trade-offs.

Memory settings

Memory configuration is where analytics workloads differ most from OLTP. You'll need larger memory allocations for complex operations, but you'll also need to be careful not to overallocate:

I/O configuration

Analytics queries often read large amounts of data. These settings help optimize I/O:

Query planning

The query planner needs different settings for analytics:

Vacuum settings

Large analytics tables need different vacuum settings:

Monitoring configuration

Monitoring becomes even more critical with analytics workloads:

Real-world example

Here's what this might look like in practice. Let's say you have:

32 CPU cores
128GB RAM
NVMe storage
Mixed analytics workload

Your configuration might look like:

Don't just copy these numbers and YOLO it. Monitor your actual memory usage using tools like pg_stat_statements and adjust based on your workload. Below I’ll call out some gotchas and guidance for testing your settings.

Configuration gotchas

Watch out for these common issues:

Memory overallocation
- Too high work_mem can cause OOM kills
- Remember it's per operation, not per connection
Excessive parallelism
- Too many parallel workers can cause thrashing
- Monitor CPU usage and adjust
Aggressive vacuum
- Too aggressive settings can impact query performance
- Monitor vacuum timing and adjust cost limits
Checkpoint tuning
- Too large max_wal_size can cause long recovery times
- Too frequent checkpoints can impact performance

Testing your configuration

Always test configuration changes:

Start with a test environment
Benchmark before and after
Monitor system resources
Test with realistic data volumes
Test failure scenarios

Analytics workloads often reveal configuration problems that OLTP workloads miss. What works for millions of small transactions might fail spectacularly for one large analytical query.

Know your limits

Even with all these optimizations, you'll eventually hit limits running analytics on Postgres. Watch for:

Queries taking minutes or longer
High I/O wait times
Memory pressure from large sorts/joins
Vacuum struggling to keep up
Replication lag on analytics replicas

When you see these signs, it's time to start thinking about dedicated analytics solutions. But that's a topic for an upcoming article!

Takeaways

Postgres can handle many analytics workloads if you:

Isolate analytics traffic to dedicated replicas
Optimize your schema and configuration
Use materialized views strategically
Use techniques for optimal analytics querying
Leverage BRIN indexes where they make sense
Monitor and maintain carefully
Know when to quit and move to specialized tools

It's not a perfect analytics database. But with the right approach, it can take you surprisingly far. Just don't forget to keep an eye on those limits - they'll sneak up on you faster than an angry DBA during a production incident.

In the next article, I’ll walk you through when and how to move your analytics workloads off Postgres. Because sometimes, you really do need a hammer instead of that screwdriver.

Need to move your analytics off Postgres?

Tinybird is data infrastructure for software teams. You can query your Postgres tables with SQL and publish queries as dynamic, scalable REST API Endpoints to power in-product reporting, real-time dashboards, and more.

You can try Tinybird for free with no time limit. Sign up here.