ClickHouse comes in two architectural flavors: a client-server database that runs as a standalone service, and chDB, an embedded engine that runs directly inside your Python process. The difference matters because it determines whether you're managing infrastructure or just importing a library.
This article explains how chDB differs from ClickHouse server, when the embedded approach makes sense, and what happens when you outgrow a single-process architecture and need to scale to production workloads.
What is chDB and why did ClickHouse acquire it?
chDB is an embedded OLAP engine that runs ClickHouse directly inside your Python process. Instead of connecting to a database server over a network, you import chDB as a library and run SQL queries in the same memory space as your application code.
ClickHouse acquired chDB in March 2024 after the project gained popularity among data scientists and developers. The acquisition made chDB an official part of the ClickHouse family, alongside ClickHouse server and ClickHouse Cloud.
The project started as a way to use ClickHouse without running a server. You install it with pip install chdb and immediately start querying Parquet files, CSV data, or Pandas DataFrames using ClickHouse SQL syntax. No infrastructure setup, no cluster configuration, no network connections.
chDB solves a specific problem: you want ClickHouse's query speed for local analytics but don't want to manage database infrastructure. This makes it useful for Jupyter notebooks, data exploration scripts, desktop applications, and anywhere you want SQL analytics without external dependencies.
How does an embedded engine differ from ClickHouse server?
The core difference is where the code runs. chDB executes queries inside your application's process, while ClickHouse server runs as a separate service that your application connects to over HTTP or TCP.
Query lifecycle and latency paths
When you run a query with chDB, the SQL executes directly in your Python process. There's no network hop, no serialization step, no separate server process. Data can stay in memory and get passed by reference rather than copied, which is why chDB achieves zero-copy data transfer with Pandas DataFrames.
ClickHouse server takes a different path. Your application sends queries over the network, the server processes them in a separate process, and results come back over that same network connection. Even on localhost, you pay the cost of network stack overhead and data serialization.
The tradeoff is isolation. A runaway query in ClickHouse server won't crash your application because it runs in a separate process. With chDB, a query that consumes too much memory will take down your entire application.
Storage format and durability guarantees
chDB uses temporary storage by default. When you query a Parquet file, chDB reads it directly but doesn't create persistent tables unless you explicitly configure them. Tables live in temporary directories and disappear when your process ends.
ClickHouse server stores data in MergeTree tables with persistent storage, background merges, and compression. Tables survive server restarts, and ClickHouse continuously optimizes data layout for faster queries. This persistent model is what allows ClickHouse server to handle terabytes of data in production.
Concurrency and resource isolation
chDB runs in a single process, so concurrent queries compete for the same CPU cores and memory. Start multiple queries from different threads, and they'll fight for resources. Exceed available RAM, and your process crashes.
ClickHouse server provides multi-tenant isolation with configurable resource limits, query quotas, and user permissions. You can run hundreds of concurrent queries from different clients, and ClickHouse schedules them, applies memory limits, and prevents one user from monopolizing resources.
When is chDB the right choice for your workload?
chDB works well when you want ClickHouse's analytical power without database infrastructure. The key is recognizing when the embedded model's limitations don't matter for your specific use case.
Local analytics inside a desktop or edge application
Building a desktop application that analyzes local data files? chDB gives you SQL analytics without requiring users to install and configure a database. A log analysis tool could use chDB to query gigabytes of log files on a user's machine with zero server setup.
Edge computing scenarios work similarly. An IoT device or edge node that processes sensor data locally can use chDB to run aggregations on-device before sending summaries to a central system.
Ad-hoc data science in Python notebooks
Data scientists working in Jupyter notebooks often want to run exploratory queries on CSV files, Parquet datasets, or Pandas DataFrames. chDB makes this trivial: import chdb and start querying without leaving your notebook or setting up infrastructure.
The zero-copy integration with Pandas is particularly useful here. Load data into a DataFrame, run complex SQL aggregations with chDB, and get results back as a DataFrame without copying data between processes.
CI/CD test harnesses for SQL logic
Testing SQL queries in continuous integration pipelines becomes simpler with chDB. Write tests that create temporary tables, run queries, and validate results, all in a single Python test file without external database dependencies. This approach is faster than spinning up a ClickHouse server container for each test run and keeps your test environment self-contained.
Performance benchmarks: chDB vs ClickHouse server vs DuckDB
Performance comparisons depend heavily on workload, data size, and whether you're measuring cold or warm queries. For single-threaded aggregations on datasets under 1GB, chDB and DuckDB perform similarly, both completing typical GROUP BY queries in under a second. ClickHouse server adds network overhead for small queries, so it may be slightly slower for quick analytics on small datasets when running locally.
The picture changes with larger datasets and more complex queries:
| Scenario | chDB | ClickHouse Server | DuckDB |
|---|---|---|---|
| Small dataset (<1GB) | Sub-second | Slightly slower (network) | Sub-second |
| Large dataset (>10GB) | RAM limited | Scales with disk | RAM limited |
| Multi-threaded joins | Good parallelism | Excellent parallelism | Good parallelism |
| Concurrent queries | Single process bottleneck | Excellent isolation | Single process bottleneck |
Single-thread aggregation on large datasets
When you run a single-threaded aggregation on a 5GB dataset, all three engines perform well if the data fits in memory. chDB and DuckDB use your machine's available RAM, while ClickHouse server can spill to disk when memory runs out.
The practical limit for chDB is your process memory. If your dataset exceeds available RAM, chDB crashes with an out-of-memory error. ClickHouse server handles this more gracefully by using disk-based algorithms for sorting and aggregation when memory runs low.
Multi-thread join operations
All three engines parallelize joins across CPU cores, but ClickHouse server has more sophisticated query scheduling for concurrent workloads. Running multiple joins simultaneously? ClickHouse server balances resources across queries, while chDB and DuckDB handle parallelism within a single query but don't coordinate across multiple concurrent queries.
JSON and Parquet read times
chDB inherits ClickHouse's extensive format support, including 70+ input and output formats. This means chDB reads JSON, Parquet, CSV, Arrow, and many other formats with the same performance characteristics as ClickHouse server. DuckDB also reads Parquet efficiently but has fewer format options overall.
Developer workflow: installing, querying, and shipping to production
Getting started with chDB takes minutes because it's distributed as a Python package with no external dependencies beyond Python itself.
1. Install the Python package
Install chDB using pip in any Python 3.8+ environment:
pip install chdb
The package includes the entire ClickHouse engine compiled as a shared library, so the download is around 300MB. Once installed, you can import chDB in any Python script or notebook without additional configuration.
2. Write and run a SQL query
The simplest way to use chDB is with the query function, which takes SQL as a string and returns results:
import chdb
result = chdb.query("""
SELECT
user_id,
count() as events,
max(timestamp) as last_seen
FROM file('events.parquet', Parquet)
GROUP BY user_id
ORDER BY events DESC
LIMIT 10
""", "DataFrame")
print(result)
This query reads a Parquet file directly without creating a table. The second parameter "DataFrame" tells chDB to return results as a Pandas DataFrame.
3. Package and deploy with your app
Because chDB is a Python package, you include it in your application's dependencies and ship it like any other library. Add chdb to your requirements.txt or pyproject.toml, and it installs when users install your application. For desktop applications, this means users get SQL analytics without installing a database.
The limitation is that chDB only works in Python. Building a Node.js, Go, or Java application? You'll need ClickHouse server or a managed service instead.
Limits of embedded ClickHouse and when to migrate
chDB's embedded architecture creates specific constraints that make it unsuitable for certain workloads. Recognizing these limits helps you decide when to migrate to ClickHouse server or a managed service.
Memory ceilings and long-running queries
The most common failure mode with chDB is running out of memory. Because chDB runs in your application process, it shares the same memory space as your application code. A query that tries to aggregate 10GB of data on a machine with 8GB of RAM crashes your entire process.
ClickHouse server handles this differently by using disk-based algorithms when memory runs out. Server-side queries can process datasets larger than RAM by spilling intermediate results to disk, which prevents out-of-memory crashes. Long-running queries also pose problems for chDB. If a query takes 30 seconds, your application process is blocked for that entire time.
Multi-tenant and auth constraints
chDB has no built-in authentication or multi-tenancy. If multiple users or teams query data, chDB can't enforce access controls or resource limits per user. Every query runs with the same permissions and resource availability.
ClickHouse server provides user management, role-based access control, and query quotas. You can limit how much memory or CPU each user consumes, restrict access to specific databases or tables, and audit who ran which queries. For single-user applications or scripts, chDB's lack of auth doesn't matter. For multi-tenant SaaS applications or shared analytics platforms, you need ClickHouse server's isolation and security features.
Scaling out with Tinybird pipes and APIs
When you outgrow chDB's single-process limitations, migrating to Tinybird gives you managed ClickHouse infrastructure without the operational complexity of running your own cluster. Tinybird handles scaling, replication, and high availability while providing a developer-friendly workflow for defining data pipelines and APIs.
Streaming and real-time ingestion options beyond chDB
chDB was designed for batch analytics on static files, not for continuous data ingestion. This makes it unsuitable for real-time analytics workloads that process streaming data.
Kafka, Pub/Sub, and HTTP ingestion
ClickHouse server has native integrations with Kafka, Google Pub/Sub, AWS Kinesis, and other streaming platforms. You can create Kafka tables that automatically consume messages and insert them into ClickHouse as they arrive, with exactly-once semantics and configurable batching.
chDB has no streaming connectors. You can write Python code to consume from Kafka and insert into chDB tables, but you're responsible for handling failures, batching, and backpressure. This quickly becomes complex and error-prone for production streaming pipelines.
Materialized view patterns for rollups
Materialized views in ClickHouse pre-aggregate data as it's inserted, which substantially speeds up queries on large datasets.
For example, you might create a materialized view that maintains hourly rollups of user activity, so queries for daily or weekly metrics run very quickly.
Operational cost comparison: self-hosted, chDB, and managed ClickHouse
The total cost of running ClickHouse includes more than just infrastructure spend. Engineering time, operational overhead, and opportunity cost all factor in.
Here's how costs break down across deployment options:
- chDB operational costs: No infrastructure costs, minimal engineering time for setup, but limited to single-machine workloads and no production support
- Self-hosted ClickHouse costs: Full control over infrastructure spend, but high engineering time for setup, monitoring, and scaling with dedicated DevOps expertise required
- Managed ClickHouse (Tinybird) costs: Predictable infrastructure costs with usage-based pricing, minimal engineering time, and production support included
Hardware and cloud spend
chDB runs on whatever hardware you're already using, so there's no additional infrastructure cost. Running chDB in a serverless function or container? You pay for the compute time like any other code.
Self-hosted ClickHouse requires provisioning servers with sufficient RAM and fast storage. For a production cluster with replication, you might run 3-6 servers with 64-128GB RAM each, which costs $500-2000/month on AWS or GCP depending on instance types, while Snowflake is 3× more expensive than ClickHouse Cloud for comparable performance. Managed services like Tinybird charge based on data volume and query usage.
Engineering and on-call time
chDB requires minimal operational work because there's no server to manage. However, you're responsible for handling errors, optimizing queries, and debugging memory issues when they occur.
Self-hosted ClickHouse demands significant engineering investment. You configure replication, set up monitoring and alerting, tune query performance, manage schema migrations, and handle operational incidents. This easily consumes 20-40% of an engineer's time for a production deployment, with self-managed ClickHouse representing 78.01% of total infrastructure costs for smaller deployments according to GitLab's reference architecture. Managed services eliminate most operational work by handling infrastructure scaling, monitoring, and incident response.
Opportunity cost to product delivery
The hidden cost of database operations is the product work you're not doing. Every hour spent tuning ClickHouse queries or debugging replication issues is an hour not spent shipping features to customers.
chDB minimizes this opportunity cost for simple use cases but forces you to migrate when you outgrow its limitations. Self-hosted ClickHouse gives you full control but requires ongoing operational investment.
Ownership and future roadmap after the acquisition
ClickHouse's acquisition of chDB in 2024 brought the project under the same governance and development process as ClickHouse server. The chDB repository moved to the ClickHouse organization on GitHub, and development is now coordinated with the broader ClickHouse roadmap.
The acquisition ensures chDB stays compatible with ClickHouse server as new features are added. When ClickHouse releases new SQL functions or table engines, they become available in chDB as well, though sometimes with a delay as the embedded packaging catches up. ClickHouse has indicated that chDB will remain focused on embedded use cases rather than trying to compete with ClickHouse server.
Your next step with Tinybird for managed ClickHouse at scale
When your analytics workload outgrows chDB's single-process architecture, Tinybird provides a managed ClickHouse platform designed for developers who want to integrate ClickHouse into their applications without managing infrastructure. Tinybird handles the operational complexity of running ClickHouse at scale: automatic scaling, replication, backups, monitoring, and query optimization.
You define data sources and SQL queries, and Tinybird deploys them as production-ready APIs with authentication, rate limiting, and observability built in. The developer experience focuses on speed. You can go from raw data to a production API in minutes using the Tinybird CLI, which lets you develop and test locally before deploying to the cloud.
Sign up for a free Tinybird account to get started. The free tier includes substantial limits for development and testing, and you can scale to production workloads with usage-based pricing as your application grows.
FAQs about embedded ClickHouse
How much data can chDB reliably handle?
chDB works well with datasets up to a few gigabytes, depending on your machine's available RAM. The practical limit is around 50-70% of your system memory because queries need working space for intermediate results. For datasets larger than 10GB or queries that require significant memory for sorting and aggregation, ClickHouse server is a better choice because it can spill to disk.
Is chDB production ready for multi-tenant SaaS applications?
chDB lacks the isolation, authentication, and resource management features needed for multi-tenant production environments. It runs in a single process with no user permissions or query quotas, so one user's runaway query can crash the entire application. For SaaS applications serving multiple customers, you need ClickHouse server or a managed service like Tinybird that provides proper multi-tenancy.
What is the license difference between chDB and ClickHouse server?
Both chDB and ClickHouse server use the Apache 2.0 license, which allows commercial use without restrictions. You can use either in proprietary applications, modify the source code, and distribute your own builds. The license terms are identical because chDB is built directly on ClickHouse's codebase.
How do I migrate schemas from chDB to Tinybird?
Export your table schemas from chDB using SHOW CREATE TABLE to get the DDL statements. In Tinybird, create equivalent data sources using the same column definitions and data types. For queries, copy your SQL into Tinybird pipe files and add any necessary parameters. The SQL syntax is identical because Tinybird runs ClickHouse, so your queries work without modification.
/
