ClickHouse® queries use SQL syntax extended with specialized functions and clauses built for columnar storage and analytical workloads at scale. The database supports standard SELECT
, JOIN
, and GROUP BY
operations while adding features like LIMIT BY
for top-N-per-group queries, SAMPLE
for fast prototyping on large datasets, and approximate aggregation functions that trade precision for speed.
This guide covers the query patterns you'll use most often when working with ClickHouse, from basic filtering and aggregation to advanced optimization techniques and troubleshooting slow queries.
Writing your first ClickHouse select
ClickHouse queries use SQL syntax with extensions built for columnar storage and analytical workloads. A basic SELECT
statement retrieves data from tables, and you can control output format, limit rows, and filter columns using standard SQL clauses.
To start querying ClickHouse, you first need a table with data. Here's how to create a simple events table and run your first queries.
Create table sample_events
The CREATE TABLE
statement defines the schema for storing data. In ClickHouse, you specify column names, data types, and a table engine that determines how data is stored and queried.
CREATE TABLE sample_events (
timestamp DateTime,
user_id UInt64,
event_name String,
event_value Float64
) ENGINE = MergeTree()
ORDER BY (timestamp, user_id);
The MergeTree
engine is the most common choice for analytical queries. The ORDER BY
clause defines the primary key, which determines how data is sorted on disk and affects query performance.
Basic select star
The SELECT *
syntax retrieves all columns from a table. This works well for exploring data but can be inefficient on large tables with many columns.
SELECT * FROM sample_events LIMIT 10;
The LIMIT
clause restricts the number of rows returned. In production queries, always use LIMIT
when exploring data to avoid accidentally pulling millions of rows.
ClickHouse returns results in a default tabular format, but you can specify different output formats using the FORMAT
clause. For example, FORMAT JSON
returns results as JSON objects, while FORMAT TabSeparated
outputs tab-delimited text.
Selecting specific columns
Selecting only the columns you need improves query performance, especially on wide tables. ClickHouse's columnar storage means it only reads the columns referenced in your query, achieving 85% compression rates that significantly reduce I/O operations.
SELECT
toDate(timestamp) AS event_date,
user_id,
event_name
FROM sample_events
LIMIT 100;
Column aliases using AS
make results more readable. Functions like toDate()
transform data types during the query, converting a DateTime
to a Date
by truncating the time portion.
Core query patterns for filtering and ordering
Filtering and sorting data efficiently matters more in ClickHouse than in row-based databases. The order in which you filter data and the functions you use can significantly affect query speed.
Where clause basics
The WHERE
clause filters rows based on conditions. ClickHouse offers specialized functions that work better with its columnar storage than generic SQL functions.
SELECT user_id, event_name, timestamp
FROM sample_events
WHERE event_name = 'page_view'
AND timestamp >= '2024-01-01'
AND user_id > 1000;
String matching functions like startsWith()
, endsWith()
, and match()
often perform better than SQL LIKE
patterns. Date filtering using comparison operators works with both Date
and DateTime
types.
For date-based filtering, ClickHouse provides functions that convert dates to integers for faster comparisons:
SELECT user_id, COUNT(*) AS event_count
FROM sample_events
WHERE startsWith(event_name, 'purchase')
AND toYYYYMM(timestamp) = 202401
GROUP BY user_id;
The toYYYYMM()
function converts dates to year-month integers, which speeds up filtering when you're grouping by month.
Order by vs limit by
ClickHouse includes a LIMIT BY
clause that doesn't exist in standard SQL. While ORDER BY
sorts all results and LIMIT
restricts total rows, LIMIT BY
returns the top N rows per group.
SELECT user_id, event_name, timestamp
FROM sample_events
ORDER BY timestamp DESC
LIMIT 3 BY user_id;
This query returns up to 3 events for each distinct user_id
, ordered by timestamp. You can think of LIMIT BY
as a more efficient alternative to window functions for top-N-per-group queries.
The difference matters for performance: LIMIT BY
can process data as it reads, while window functions typically require more memory and processing.
Using sample for fast prototyping
The SAMPLE
clause lets you query a fraction of your data, which speeds up development and testing on large tables. ClickHouse uses deterministic sampling, so the same sample returns consistent results.
SELECT event_name, COUNT(*) AS count
FROM sample_events
SAMPLE 0.1
GROUP BY event_name;
This query processes approximately 10% of the table. The sampling is deterministic based on the primary key, so repeated queries return the same rows. You can also specify an exact number of rows: SAMPLE 10000
returns roughly 10,000 rows.
Aggregations and group by at scale
ClickHouse aggregation functions are optimized for processing billions of rows quickly. Many functions offer approximate versions that trade precision for speed, which is often acceptable for analytics use cases.
Count distinct tricks
Counting unique values is expensive in any database. ClickHouse provides several functions with different speed and accuracy tradeoffs:
uniq()
: Fast approximate count using HyperLogLog, typically within 2% accuracyuniqExact()
: Precise count but slower and more memory-intensiveuniqCombined()
: Balanced approach using multiple algorithms for better accuracy thanuniq()
SELECT
uniq(user_id) AS approx_users,
uniqExact(user_id) AS exact_users,
uniqCombined(user_id) AS combined_users
FROM sample_events;
For most analytics, uniq()
provides enough accuracy while being 5-10x faster than uniqExact()
. Use uniqExact()
only when you need guaranteed precision, such as for billing or compliance reporting.
Approximate functions
Beyond unique counts, ClickHouse offers approximate functions for percentiles, top values, and other statistical calculations.
SELECT
event_name,
quantile(0.95)(event_value) AS p95_value,
topK(5)(user_id) AS top_5_users
FROM sample_events
GROUP BY event_name;
The quantile()
function calculates percentiles using sampling, which is much faster than sorting all values. The topK()
function returns the K most frequent values using a space-efficient algorithm. These functions work well for dashboards and monitoring where approximate answers delivered quickly are more valuable than exact answers delivered slowly. ClickHouse can process aggregations in 75ms on billion-row datasets.
With totals for rollups
The WITH TOTALS
modifier adds a summary row to your results, which is useful for creating subtotals and grand totals in a single query.
SELECT
event_name,
COUNT(*) AS event_count,
AVG(event_value) AS avg_value
FROM sample_events
GROUP BY event_name
WITH TOTALS;
The final row in the result contains aggregates across all groups. This eliminates the need for a separate query to calculate overall totals.
Joining data the ClickHouse way
Joins in ClickHouse work differently than in traditional databases due to its distributed and columnar architecture. Understanding these differences helps you write faster queries.
Inner vs left join
ClickHouse supports standard SQL join syntax, but join order matters significantly for performance. The table on the right side of the join is loaded into memory, so put the smaller table there.
SELECT
e.user_id,
e.event_name,
u.user_name
FROM sample_events e
INNER JOIN users u ON e.user_id = u.user_id
WHERE e.timestamp >= '2024-01-01';
Always filter the larger table before joining when possible. Moving the WHERE
clause to filter sample_events
before the join reduces the number of rows that need to be matched. LEFT JOIN works similarly but includes all rows from the left table even when there's no match.
Array join for nested data
The ARRAY JOIN
clause flattens array columns into multiple rows, which is common when working with nested or semi-structured data.
SELECT
user_id,
arrayJoin(tags) AS tag
FROM user_profiles
WHERE tag = 'premium';
This query converts an array of tags for each user into separate rows, one per tag. You can then filter, group, or join on the flattened values. ARRAY JOIN
is particularly useful for JSON data that contains arrays.
Global join on distributed tables
When querying distributed tables across multiple servers, regular joins only see data on each individual server. The GLOBAL
keyword broadcasts the right table to all servers.
SELECT
e.user_id,
e.event_count,
u.signup_date
FROM distributed_events e
GLOBAL INNER JOIN users u ON e.user_id = u.user_id;
GLOBAL JOIN
has performance costs because it transfers the entire right table to every server. Use it only when necessary, and always put the smaller table on the right side.
Sampling limit by and other ClickHouse specific clauses
ClickHouse extends standard SQL with clauses designed specifically for analytical queries. These extensions often provide simpler syntax and better performance than standard SQL alternatives.
Limit by pattern
The LIMIT BY
clause returns the top N rows for each group without requiring window functions or subqueries.
SELECT
user_id,
event_name,
timestamp,
event_value
FROM sample_events
ORDER BY event_value DESC
LIMIT 5 BY user_id;
This returns the 5 highest-value events for each user. The query processes data efficiently because it doesn't need to rank all rows or create temporary tables. You can combine LIMIT BY
with OFFSET
to skip rows: LIMIT 5 OFFSET 2 BY user_id
returns rows 3-7 for each user.
Sample clause syntax
Beyond the percentage-based sampling shown earlier, SAMPLE
accepts different formats for controlling how much data to read.
-- Sample 10% of data
SELECT COUNT(*) FROM sample_events SAMPLE 0.1;
-- Sample approximately 1 million rows
SELECT COUNT(*) FROM sample_events SAMPLE 1000000;
-- Sample with offset for A/B testing
SELECT COUNT(*) FROM sample_events SAMPLE 1/2 OFFSET 1/2;
The offset syntax divides data into segments, which is useful for splitting datasets for parallel processing or creating consistent test/control groups.
Final keyword caveats
The FINAL
modifier forces ClickHouse to merge all parts of a ReplacingMergeTree
or CollapsingMergeTree
table, returning only the final version of each row.
SELECT user_id, user_name, last_login
FROM users FINAL
WHERE user_id = 12345;
While FINAL
simplifies queries by handling deduplication automatically, it significantly slows down query execution. The database must merge data at query time instead of using pre-merged parts. Alternatives include using materialized views to pre-aggregate data or designing schemas that avoid the need for FINAL
.
Optimizing query speed with settings and final
Query performance in ClickHouse depends on both how you write SQL and the settings you use. A few key settings control memory usage, parallelism, and query behavior.
The max_threads
setting controls how many CPU cores ClickHouse uses for a query. By default, ClickHouse uses all available cores, but you can limit this for queries that don't need maximum speed:
SELECT COUNT(*) FROM sample_events
SETTINGS max_threads = 4;
The max_memory_usage
setting prevents queries from consuming too much RAM. When a query exceeds this limit, ClickHouse stops execution and returns an error:
SELECT user_id, COUNT(*) FROM sample_events
GROUP BY user_id
SETTINGS max_memory_usage = 10000000000;
For queries on tables with frequent updates, avoid FINAL
when possible. Instead, use GROUP BY
with argMax()
to get the latest value for each key:
SELECT
user_id,
argMax(user_name, updated_at) AS current_name
FROM users
GROUP BY user_id;
This approach processes data faster than FINAL
because it uses ClickHouse's aggregation optimizations. Common optimization patterns include:
- Filter early: Apply
WHERE
clauses beforeJOIN
operations to reduce data volume - Select fewer columns: Only retrieve columns you need, especially on wide tables
- Use primary key order: Query conditions that match the
ORDER BY
clause run faster
Troubleshooting slow queries in production
When queries run slower than expected, ClickHouse provides system tables that show exactly what's happening. These tables contain query logs, performance metrics, and resource usage data.
Using system query_log
The system.query_log
table records every query executed on the server, including execution time, memory usage, and rows processed.
SELECT
query_duration_ms,
query,
read_rows,
read_bytes,
memory_usage
FROM system.query_log
WHERE query_duration_ms > 1000
AND type = 'QueryFinish'
ORDER BY query_duration_ms DESC
LIMIT 10;
This query finds the 10 slowest queries that took over 1 second. The read_rows
and read_bytes
columns show how much data each query processed, which often explains slow performance. You can filter by user, query type, or time range to investigate specific performance issues.
Interpreting system metrics
The system.metrics
table provides real-time counters for server activity, while system.events
shows cumulative statistics since server startup.
SELECT
metric,
value
FROM system.metrics
WHERE metric LIKE '%Memory%'
OR metric LIKE '%Query%';
Key metrics to monitor include:
MemoryTracking
: Current memory usage across all queriesQuery
: Number of queries currently executingBackgroundPoolTask
: Background merge and mutation activity
High values for MemoryTracking
indicate queries using too much RAM, while many BackgroundPoolTask
entries suggest heavy write activity that might slow down reads.
Turning a query into a realtime API with Tinybird
Once you've written a ClickHouse query that answers an analytical question, you often need to expose that query as an API endpoint so you can get the results of the query in your application. Tinybird makes this process simple by deploying SQL queries into parameterized REST APIs. You can also connect directly from your application using Python, Java, or Go client libraries.
Create data source
Tinybird data sources are ClickHouse tables that you can populate using the CLI, API, or streaming connectors. To follow these examples, sign up for a free Tinybird account and install the CLI.
First, create a sample data file called events.ndjson
:
{"timestamp":"2024-01-15 10:23:45","user_id":1001,"event_name":"page_view","event_value":1.5}
{"timestamp":"2024-01-15 10:24:12","user_id":1002,"event_name":"purchase","event_value":49.99}
{"timestamp":"2024-01-15 10:25:03","user_id":1001,"event_name":"add_to_cart","event_value":29.99}
Then create and populate a data source:
tb datasource create events --file events.ndjson
tb datasource append events --file events.ndjson
Tinybird infers the schema from your NDJSON data, creating appropriate column types automatically. You can also define schemas explicitly using .datasource
files for more control.
Pipe file with parameterized SQL
Pipes are Tinybird's way of defining SQL queries that can be deployed as API endpoints. Create a file called top_events.pipe
:
TOKEN top_events_read READ
DESCRIPTION >
Get top events by value for a specific user
NODE endpoint
SQL >
%
SELECT
event_name,
SUM(event_value) AS total_value,
COUNT(*) AS event_count
FROM events
WHERE user_id = {{ Int64(user_id, 1001) }}
AND timestamp >= {{ DateTime(start_date, '2024-01-01 00:00:00') }}
GROUP BY event_name
ORDER BY total_value DESC
LIMIT {{ Int32(limit, 10) }}
TYPE endpoint
The {{ }}
syntax defines query parameters with types and default values. Tinybird validates these parameters and automatically generates API documentation.
Deploy and call the API
tb --cloud deploy
Tinybird returns a URL for your new API endpoint. You can test it immediately:
curl "https://api.tinybird.co/v0/pipes/top_events.json?user_id=1001&start_date=2024-01-01&token=YOUR_TOKEN"
The API returns JSON with your query results, including metadata about rows processed and query execution time. You can call this endpoint from any application without managing database connections or worrying about connection pooling.
Why developer experience beats database operations
Managing ClickHouse infrastructure involves cluster configuration, replication setup, backup management, and performance tuning. These tasks take time away from building features that directly benefit users.
Tinybird provides managed ClickHouse infrastructure so developers can focus on writing queries and building APIs instead of managing servers. The platform handles cluster scaling, backup automation, and performance optimization automatically. The API-first approach means you don't write custom backend code to expose ClickHouse queries.
Here's how the approaches compare:
Self-hosted ClickHouse | Tinybird |
---|---|
Manual cluster setup and configuration | Instant workspace creation |
Custom API development for each query | SQL pipes deploy as APIs automatically |
Manual scaling and performance tuning | Automatic scaling based on usage |
DIY monitoring and alerting | Built-in observability and query analytics |
For teams building analytics features into applications, Tinybird eliminates weeks of infrastructure work. You can go from raw data to production API in hours instead of months. Sign up for a free Tinybird account to start building ClickHouse-powered APIs without managing infrastructure.
FAQs about ClickHouse queries
Can I use window functions in ClickHouse?
Yes, ClickHouse supports window functions like ROW_NUMBER()
, RANK()
, LAG()
, and LEAD()
with OVER()
clauses. They work similarly to standard SQL but benefit from ClickHouse's columnar storage for better performance on large datasets.
Does ClickHouse support multi-statement transactions?
No, ClickHouse doesn't support traditional ACID transactions across multiple statements. Each INSERT
or ALTER
operation is atomic, but you can't group multiple operations into a single transaction with rollback capability.
How do I secure a Tinybird API endpoint?
Tinybird API endpoints use token-based authentication with different permission levels. You can create read-only tokens for public APIs, write tokens for data ingestion, or admin tokens for schema changes. Tokens are included in the URL or headers of API requests./