Complex SQL queries often become difficult to read when the same filtering or aggregation logic appears multiple times. Common table expressions (CTEs) in ClickHouse solve this by letting you define temporary named result sets using the WITH
clause, which you can then reference throughout your query like a regular table.
This article shows you how to write CTEs in ClickHouse, when to use them instead of subqueries or materialized views, and how to deploy CTE-based queries as production API endpoints.
What is a common table expression in ClickHouse
A common table expression (CTE) in ClickHouse is a temporary named result set that you define within a single query using the WITH
clause. The CTE exists only during query execution and lets you break down complex logic into readable, reusable parts.
You write a CTE by giving a subquery a name, then reference that name later in your main query. This is particularly useful when you need to filter data in stages or when the same intermediate calculation appears multiple times in your query. Instead of copying the same subquery logic repeatedly, you define it once as a CTE and reference it by name.
Here's what a basic CTE looks like:
WITH active_users AS (
SELECT user_id
FROM users
WHERE status = 'active'
)
SELECT
au.user_id,
COUNT(*) AS event_count
FROM events e
JOIN active_users au ON e.user_id = au.user_id
GROUP BY au.user_id
ORDER BY event_count DESC;
In this query, active_users
is the CTE. It filters the users table to only active users, then the main query joins this filtered set with the events table to count how many events each active user has generated.
ClickHouse WITH clause syntax
The WITH
clause in ClickHouse follows a straightforward pattern where you define one or more CTEs before your main SELECT
statement. Each CTE has a name, followed by AS
, followed by a subquery in parentheses.
The basic structure looks like this:
WITH
cte_name AS (
SELECT ...
)
SELECT ...
FROM cte_name;
You can define multiple CTEs by separating them with commas:
WITH
first_cte AS (
SELECT ...
),
second_cte AS (
SELECT ...
)
SELECT ...
FROM first_cte
JOIN second_cte ON ...;
Each CTE has to be defined before you reference it. You can use earlier CTEs within later ones, but you can't reference a CTE before it's defined.
Non-recursive CTE example you can copy-paste
Here's a practical example that calculates user engagement metrics using CTEs. This query finds users who have made at least one purchase, then calculates their average session duration.
WITH
purchasing_users AS (
SELECT DISTINCT user_id
FROM events
WHERE event_type = 'purchase'
),
session_durations AS (
SELECT
user_id,
session_id,
max(timestamp) - min(timestamp) AS duration_seconds
FROM events
WHERE user_id IN (SELECT user_id FROM purchasing_users)
GROUP BY user_id, session_id
)
SELECT
user_id,
COUNT(DISTINCT session_id) AS total_sessions,
AVG(duration_seconds) AS avg_session_duration
FROM session_durations
GROUP BY user_id
ORDER BY avg_session_duration DESC
LIMIT 100;
The first CTE (purchasing_users
) filters down to users who have completed at least one purchase. The second CTE (session_durations
) calculates how long each session lasted for those purchasing users by finding the difference between the earliest and latest event timestamp in each session. The main query then aggregates these durations to show the average session length per user.
This pattern is common in analytics where you filter data in stages and then perform calculations on the filtered results.
Referencing a CTE multiple times in one query
One advantage of CTEs is that you can reference the same named result set multiple times within a single query. This eliminates repeated subquery logic, which both reduces code duplication and makes queries easier to maintain.
Here's an example that references a CTE in two different places:
WITH high_value_customers AS (
SELECT
customer_id,
SUM(order_amount) AS total_spent
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id
HAVING total_spent > 1000
)
SELECT
'High Value' AS segment,
COUNT(*) AS customer_count,
AVG(total_spent) AS avg_spent
FROM high_value_customers
UNION ALL
SELECT
'Top 10' AS segment,
COUNT(*) AS customer_count,
AVG(total_spent) AS avg_spent
FROM (
SELECT *
FROM high_value_customers
ORDER BY total_spent DESC
LIMIT 10
);
In this query, the high_value_customers
CTE appears twice: once in the first SELECT
and again in the subquery within the UNION ALL
. Without the CTE, you'd write the same aggregation logic twice, which increases the chance of inconsistencies if you later modify the filtering criteria.
CTEs also work well in self-joins where you compare rows from the same filtered dataset:
WITH daily_metrics AS (
SELECT
toDate(timestamp) AS date,
COUNT(*) AS event_count
FROM events
GROUP BY date
)
SELECT
today.date,
today.event_count AS today_count,
yesterday.event_count AS yesterday_count,
today.event_count - yesterday.event_count AS daily_change
FROM daily_metrics today
LEFT JOIN daily_metrics yesterday
ON yesterday.date = today.date - INTERVAL 1 DAY
ORDER BY today.date DESC;
Here the same CTE joins to itself to compare daily event counts with the previous day's counts.
Does ClickHouse support recursive CTEs
ClickHouse added limited support for recursive CTEs starting in version 24.4, but the implementation differs from databases like PostgreSQL or SQL Server. Recursive CTEs in ClickHouse work for specific use cases but have constraints that affect how you can use them.
A recursive CTE references itself within its own definition, which allows you to perform iterative calculations or traverse hierarchical data (passing all PostgreSQL tests and supporting even more complex recursive logic). The syntax uses WITH RECURSIVE
instead of just WITH
:
WITH RECURSIVE counter AS (
SELECT 1 AS n
UNION ALL
SELECT n + 1
FROM counter
WHERE n < 10
)
SELECT * FROM counter;
This query generates numbers from 1 to 10 by starting with 1 and recursively adding 1 until the condition n < 10
is no longer true.
However, ClickHouse's recursive CTE implementation is more restrictive than in other databases. The recursion depth is limited, and certain operations that work in other systems might not be supported. For complex hierarchical queries or deep recursion, you might need alternative approaches like array functions or multiple query steps.
If you're working with hierarchical data like organizational charts or category trees, test your recursive queries carefully. Consider whether a materialized view or a different data modeling approach might be more reliable for your use case.
CTE performance and memory considerations
CTEs in ClickHouse are materialized in memory during query execution, which means the result set from each CTE is computed and stored before the main query uses it. This behavior has both advantages and tradeoffs depending on your data volume and query structure.
When you define a CTE, ClickHouse executes that subquery and holds the results in RAM (re-executing each reference rather than reusing the stored result). If the CTE produces a large result set, this can consume significant memory, particularly if you're running multiple concurrent queries (8 GB RAM minimum is frequently insufficient without tuning). The max_memory_usage
setting controls how much memory a single query can use, and exceeding this limit will cause the query to fail.
For CTEs that produce small to medium-sized result sets (thousands to millions of rows), the memory overhead typically isn't a concern. The benefit of having the data readily available for multiple references usually outweighs the memory cost. However, if a CTE produces billions of rows or contains wide rows with many columns, you might hit memory limits.
One approach is to filter or aggregate data as early as possible in your CTE definitions:
-- Less efficient: large intermediate result
WITH all_events AS (
SELECT * FROM events
)
SELECT COUNT(*)
FROM all_events
WHERE event_type = 'purchase';
-- More efficient: filter early
WITH purchase_events AS (
SELECT * FROM events
WHERE event_type = 'purchase'
)
SELECT COUNT(*) FROM purchase_events;
The second version filters down to only purchase events in the CTE, which reduces the amount of data held in memory.
When to use a CTE versus a subquery or materialized view
Choosing between CTEs, subqueries, and materialized views depends on your specific use case, query complexity, and performance requirements. Each approach has different characteristics that make it more suitable for certain scenarios.
CTEs work well when you:
- Reference the same intermediate result multiple times in a single query
- Want to improve readability by breaking complex logic into named steps
- Organize multi-stage filtering or aggregation in a clear sequence
Subqueries are better when you:
- Only need the intermediate result once
- Have a simple filtering condition that doesn't benefit from being named
- Want to avoid the overhead of materializing a result set in memory
Materialized views are the right choice when you:
- Run the same aggregation or transformation repeatedly across many queries
- Can accept slightly stale data that updates periodically rather than in real-time
- Want to pre-compute expensive operations to improve query response times
Here's a comparison of the same logic implemented three ways:
-- As a CTE
WITH recent_orders AS (
SELECT customer_id, SUM(amount) AS total
FROM orders
WHERE order_date >= today() - INTERVAL 30 DAY
GROUP BY customer_id
)
SELECT * FROM recent_orders WHERE total > 500;
-- As a subquery
SELECT * FROM (
SELECT customer_id, SUM(amount) AS total
FROM orders
WHERE order_date >= today() - INTERVAL 30 DAY
GROUP BY customer_id
) WHERE total > 500;
The CTE and subquery versions produce identical results and have similar performance characteristics. The CTE version is more readable if you reference recent_orders
multiple times or if your query has several stages of logic.
A materialized view would store the aggregated results persistently, which makes subsequent queries much faster but requires additional storage and adds complexity to your data pipeline.
Avoiding CTEs with Tinybird's multi-node pipes
Tinybird is a managed ClickHouse service that provides a better alternative to CTEs: multi-node pipes. Instead of nesting logic within a single query using WITH
clauses, you break your query into separate, referenceable nodes that make your logic more modular and reusable.
CTEs in ClickHouse have limitations: each CTE only exists within its parent query, making them non-reusable across different queries. Tinybird's multi-node approach solves this differently:
- Individual nodes can be referenced anywhere within the same pipe - similar to how you'd reference a CTE, but cleaner and more testable
- Complete pipes can be referenced across your entire project using
SELECT * FROM pipe_name
- allowing you to chain entire transformation pipelines together and reuse them project-wide
This means a pipe can contain multiple transformation steps as nodes, and then the entire pipe becomes a reusable building block for other pipes in your project.
Why multi-node pipes are better than CTEs
Problem with CTEs:
-- Every query needs to redefine the same CTE
WITH filtered_events AS (
SELECT * FROM user_events
WHERE timestamp >= today() - 30
)
SELECT ... FROM filtered_events ...
Tinybird's solution:
Define filtered_events
once as a node within a pipe, then:
- Reference it across multiple nodes in the same pipe
- Or reference the entire pipe from other pipes across your project
1. Create the data source
First, install the Tinybird CLI and start a local development environment:
curl -L tinybird.co | sh
tb login
tb local start
Create a sample data file called user_events.ndjson
with event data:
{"user_id": "user_1", "event_type": "login", "timestamp": "2024-01-15 08:30:00", "session_id": "session_100"}
{"user_id": "user_1", "event_type": "pageview", "timestamp": "2024-01-15 08:31:00", "session_id": "session_100"}
{"user_id": "user_2", "event_type": "login", "timestamp": "2024-01-15 09:00:00", "session_id": "session_101"}
{"user_id": "user_2", "event_type": "purchase", "timestamp": "2024-01-15 09:15:00", "session_id": "session_101"}
{"user_id": "user_3", "event_type": "login", "timestamp": "2024-01-15 10:00:00", "session_id": "session_102"}
Generate and populate the data source:
tb datasource generate user_events --file user_events.ndjson
tb build
tb datasource append user_events --file user_events.ndjson
2. Build a multi-node pipe
Create a file called user_activity_summary.pipe
with multiple nodes instead of CTEs:
TOKEN user_activity_read READ
DESCRIPTION >
Calculate user activity metrics using multi-node architecture instead of CTEs
NODE filtered_events
SQL >
%
SELECT
user_id,
event_type,
timestamp,
session_id
FROM user_events
WHERE 1=1
{/% if defined(event_type) %}
AND event_type = {{ String(event_type) }}
{/% end %}
{/% if defined(start_date) %}
AND timestamp >= {{ DateTime(start_date) }}
{/% end %}
NODE user_activity_endpoint
SQL >
%
SELECT
user_id,
COUNT(*) AS event_count,
COUNT(DISTINCT session_id) AS session_count,
min(timestamp) AS first_event,
max(timestamp) AS last_event
FROM filtered_events
GROUP BY user_id
ORDER BY event_count DESC
LIMIT {{ Int32(limit, 10) }}
TYPE endpoint
This pipe defines two nodes:
filtered_events
: Handles filtering logic with parameterizationuser_activity_endpoint
: Aggregates data from the first node
Each node is an independent transformation step. The second node references the first by name, just like you'd reference a CTE, but with key advantages:
- Within-pipe reusability: Multiple nodes in the same pipe can reference
filtered_events
- Cross-project reusability: Other pipes can reference the entire
user_activity_summary
pipe - Testing: You can test each node independently during development
- Clarity: Each transformation step is clearly separated
- Performance: Tinybird optimizes the entire chain of nodes as a single query
For example, within this same pipe, another node could also use the filtered events:
NODE event_type_breakdown
SQL >
SELECT
event_type,
COUNT(*) as type_count
FROM filtered_events
GROUP BY event_type
Or a completely different pipe could reference the entire user_activity_summary
output:
NODE top_users
SQL >
SELECT *
FROM user_activity_summary
WHERE event_count > 10
Test it locally:
tb dev
Then query the endpoint at http://localhost:7181/pipes/user_activity_summary.json?event_type=login&limit=5
.
3. Push and test the API
Deploy your data source and pipe to Tinybird Cloud:
tb --cloud deploy
Get the API endpoint URL:
tb --cloud endpoint url user_activity_summary --language curl
This returns a curl command you can run to test your deployed API:
curl -X GET "https://api.us-east.tinybird.co/v0/pipes/user_activity_summary.json?event_type=login&limit=5&token=p.ey..."
The response is a JSON object with your query results:
{
"meta": [
{"name": "user_id", "type": "String"},
{"name": "event_count", "type": "UInt64"},
{"name": "session_count", "type": "UInt64"},
{"name": "first_event", "type": "DateTime"},
{"name": "last_event", "type": "DateTime"}
],
"data": [
{
"user_id": "user_1",
"event_count": 1,
"session_count": 1,
"first_event": "2024-01-15 08:30:00",
"last_event": "2024-01-15 08:30:00"
}
],
"rows": 1
}
You can now call this API from your application code using any HTTP client. The multi-node query runs on managed ClickHouse infrastructure without you handling cluster scaling, connection pooling, or performance tuning.
Benefits of multi-node pipes over CTEs
This multi-node approach provides several advantages over traditional CTEs:
Pipe reusability: The entire
user_activity_summary
pipe can be referenced by other pipes in your project usingSELECT * FROM user_activity_summary
. This lets you build transformation pipelines once and reuse them across multiple endpoints without duplicating logic.Independent testing: During development with
tb dev
, you can test each node separately to verify the filtering logic works before building the aggregation layer.Better debugging: When a query fails, Tinybird shows you which specific node caused the error, making it easier to identify and fix issues.
Simplified collaboration: Team members can understand and modify individual nodes without parsing complex nested CTEs.
Version control friendly: Each node change shows up clearly in git diffs, unlike CTE changes buried within large query blocks.
Streaming ingestion and CTE consistency
ClickHouse uses an eventually consistent model for data, which means recently ingested data might not immediately appear in query results. This behavior affects CTEs the same way it affects regular queries.
When you ingest data into ClickHouse, it's written to parts that are merged in the background. Queries read from the parts that are currently available, which might not include the most recently written data until the merge completes. This typically happens within seconds, but the exact timing depends on your table engine and merge settings.
If you're using CTEs to filter or aggregate streaming data, be aware that your CTE might not include the absolute latest records. For most analytics use cases, this delay is acceptable because the data is still fresh enough for decision-making.
If you need stronger consistency guarantees, you can use the FINAL
modifier in your queries, though this comes with a performance cost:
WITH recent_events AS (
SELECT *
FROM events FINAL
WHERE timestamp >= now() - INTERVAL 1 HOUR
)
SELECT COUNT(*) FROM recent_events;
The FINAL
modifier forces ClickHouse to merge parts on the fly, which ensures you see the most up-to-date data but makes queries slower, particularly on large tables.
For real-time dashboards or monitoring systems, test your queries to understand the typical lag between data ingestion and query visibility, then design your CTEs and aggregations accordingly.
Next steps for faster analytics with Tinybird
Tinybird provides a managed ClickHouse platform that handles infrastructure complexity so you can focus on building analytics features. The service includes built-in streaming ingestion, automatic scaling, and API generation that turns your SQL queries into production endpoints.
If you're integrating ClickHouse into an application and want to avoid managing clusters, setting up observability, or tuning performance at scale, Tinybird handles those operational concerns. You can define data pipelines as code, test locally, and deploy to production with standard CI/CD workflows.
Sign up for a free Tinybird plan to start building with managed ClickHouse. The platform includes a CLI for local development, comprehensive documentation, and example projects that demonstrate common analytics patterns using CTEs and other ClickHouse features.
FAQs about ClickHouse CTEs
How many CTEs can a single query contain?
ClickHouse doesn't enforce a hard limit on the number of CTEs you can define in one query. However, each CTE consumes memory during execution, so practical limits depend on your available RAM and the size of each CTE's result set. Most queries use between one and five CTEs.
Can I reference the same CTE in a JOIN operation?
Yes, you can reference the same CTE multiple times in JOIN operations, including self-joins where you join a CTE to itself. This is one of the main advantages of CTEs over subqueries, as it lets you write the logic once and reference it multiple times without duplication.