When analyzing web traffic or API logs in ClickHouse, URLs with query parameters like https://example.com/products?color=blue&size=10
can create noise that makes it difficult to group and analyze page performance effectively. Query strings fragment what should be identical page views into separate entries, skewing your analytics and making trend analysis nearly impossible.
ClickHouse provides several built-in functions specifically designed to clean URLs and extract meaningful paths from messy web data. This guide covers the most efficient methods to strip query strings from URLs, when to apply URL cleaning in your data pipeline, and how to handle common edge cases that can trip up your analysis.
Quick answer: Strip the query string in one function
The fastest way to extract a URL path without the query string in ClickHouse is using the cutQueryString()
function. This function removes everything after the ?
character, including query parameters and fragments, while keeping the protocol, domain, and path structure intact.
SELECT cutQueryString('https://example.com/products/shoes?color=blue&size=10') AS clean_url;
-- Result: https://example.com/products/shoes
1. Use cutQueryString(url)
for Versions ≥21.8
The cutQueryString()
function is your go-to solution for modern ClickHouse installations. It's built specifically for this purpose and handles edge cases automatically.
-- Basic usage
SELECT cutQueryString(url_column) AS clean_url
FROM analytics_events;
-- Web analytics grouping example
SELECT cutQueryString(page_url) AS page, COUNT(*) AS visits
FROM web_logs
GROUP BY page
ORDER BY visits DESC;
2. Fallback replaceRegexpAll
for older servers
If you're running ClickHouse versions before 21.8, you can achieve the same result with a regular expression. This approach works but requires more careful handling of edge cases.
SELECT replaceRegexpAll(url_column, '\\?.*$', '') AS clean_url
FROM your_table;
The regex \\?.*$
matches a literal question mark followed by any characters until the end of the string.
path
, pathFull
, and cutQueryString
explained
ClickHouse offers three main functions for working with URL paths. Each serves different use cases, and understanding their differences helps you pick the right tool for your specific situation.
path(url)
returns only the path
The path()
function extracts just the path portion of a URL, excluding the protocol, domain, query parameters, and fragments.
SELECT path('https://shop.example.com/category/products?filter=new') AS url_path;
-- Result: /category/products
This function works well when you only care about the hierarchical structure within a website, not the full URL context.
pathFull(url)
keeps query and fragment
The pathFull()
function returns the path along with query parameters and fragments. It's essentially everything after the domain name.
SELECT pathFull('https://shop.example.com/search?q=shoes#results') AS full_path;
-- Result: /search?q=shoes#results
Use this when you want to preserve query parameters but don't need the protocol and domain information.
cutQueryString(url)
removes everything after ?
The cutQueryString()
function keeps the complete URL structure while removing only the query parameters and fragments. This preserves the most context while cleaning the URL.
Here's how all three functions compare:
Function | Input | Output |
---|---|---|
path() | https://site.com/page?q=1#top | /page |
pathFull() | https://site.com/page?q=1#top | /page?q=1#top |
cutQueryString() | https://site.com/page?q=1#top | https://site.com/page |
Syntax and examples for URL path extraction
Understanding the exact syntax and behavior of ClickHouse URL functions helps you avoid common pitfalls. Let's look at the function signatures and some practical examples.
Function signatures and return types
All URL functions in ClickHouse accept a single String parameter and return a String result. They handle NULL values gracefully by returning NULL.
- Function signatures:
cutQueryString(url String) -> String
,path(url String) -> String
,pathFull(url String) -> String
- NULL handling: If the input URL is malformed or empty, these functions typically return an empty string rather than throwing an error
Sample queries with expected output
Here are practical examples you might encounter when analyzing web traffic or API logs:
-- E-commerce product page analysis
SELECT
cutQueryString(page_url) AS clean_page,
COUNT(*) AS page_views,
COUNT(DISTINCT user_id) AS unique_visitors
FROM web_analytics
WHERE page_url LIKE '%/products/%'
GROUP BY clean_page
ORDER BY page_views DESC
LIMIT 10;
-- API endpoint grouping
SELECT
path(request_url) AS endpoint,
AVG(response_time_ms) AS avg_response_time
FROM api_logs
WHERE status_code = 200
GROUP BY endpoint
HAVING COUNT(*) > 1000;
These queries demonstrate how URL cleaning enables proper grouping and analysis of web traffic patterns.
Clean paths at ingest or at query time
You can apply URL cleaning at two different stages in your data pipeline. The choice depends on your query patterns, data volume, and performance requirements.
Materialized view pattern for streaming ingest
Processing URLs during data ingestion creates cleaner datasets and faster queries. Materialized views automatically transform incoming data as it arrives.
-- Create a materialized view that cleans URLs on insert
CREATE MATERIALIZED VIEW clean_web_logs_mv
TO clean_web_logs
AS SELECT
timestamp,
user_id,
cutQueryString(raw_url) AS clean_url,
referrer,
user_agent
FROM raw_web_logs;
This approach works well when you consistently need clean URLs across multiple queries and have high query volume on the same dataset.
On-demand transformation in analytics queries
Applying URL functions directly in SELECT statements offers more flexibility but requires more computation at query time.
-- Transform URLs during analysis
SELECT
cutQueryString(page_url) AS page,
DATE(timestamp) AS date,
COUNT(*) AS daily_visits
FROM web_logs
WHERE timestamp >= today() - INTERVAL 30 DAY
GROUP BY page, date
ORDER BY date DESC, daily_visits DESC;
Choose this approach when you need different URL transformations for different analyses or when storage space is a concern.
Handle trailing slash and case sensitivity
Real-world URLs often have inconsistencies that can skew your analysis. Combining URL cleaning with normalization techniques produces more accurate results.
Canonicalize trailing slash for deduplication
URLs like /products
and /products/
typically represent the same page but appear as separate entries in your data. Combining cutQueryString()
with trimRight()
standardizes these variations.
-- Remove query strings and trailing slashes
SELECT
trimRight(cutQueryString(page_url), '/') AS canonical_page,
COUNT(*) AS total_visits
FROM web_logs
GROUP BY canonical_page
ORDER BY total_visits DESC;
This technique is particularly important for content management systems and e-commerce platforms where both URL formats are valid.
Lowercase paths for consistent grouping
Some web servers treat URLs as case-sensitive while others don't. Converting cleaned URLs to lowercase ensures consistent grouping regardless of how users typed the URL.
-- Case-insensitive URL analysis
SELECT
lower(cutQueryString(request_url)) AS normalized_url,
COUNT(*) AS request_count,
AVG(response_size_bytes) AS avg_response_size
FROM access_logs
GROUP BY normalized_url
HAVING request_count > 100;
Apply this normalization when analyzing user-generated URLs or when your application accepts URLs in mixed case.
Performance benchmarks on billion-row tables
URL processing performance becomes critical when working with large datasets. Understanding the computational costs helps you optimize your queries and choose the right approach for your specific situation.
Profiling cutQueryString
vs. regex
Built-in ClickHouse functions like cutQueryString()
are optimized for performance and typically outperform regex-based alternatives, especially on large datasets.
-- Performance test: built-in function
SELECT COUNT(DISTINCT cutQueryString(url))
FROM large_url_table;
-- Typical performance: ~50M rows/second
-- Performance test: regex alternative
SELECT COUNT(DISTINCT replaceRegexpAll(url, '\\?.*$', ''))
FROM large_url_table;
-- Typical performance: ~25M rows/second
The performance difference becomes more pronounced with complex URLs and when processing hundreds of millions of rows.
Memory and CPU considerations
URL functions have minimal memory overhead since they process strings in place without creating large intermediate results. However, the choice of function affects CPU usage patterns.
cutQueryString()
: Minimal CPU overhead, optimized string scanningpath()
: Moderate CPU usage, requires URL parsing- Regex functions: Higher CPU usage, complex pattern matching
For high-throughput applications, consider preprocessing URLs during ingestion rather than applying functions to every query.
Build clean path web analytics APIs with Tinybird
Once you've learned how to clean and analyze URL paths in ClickHouse, Tinybird makes it simple to deploy your web analytics queries as production-ready API endpoints. This approach lets you serve processed data directly to applications without exposing your database.
Step 1: Create a data source for web analytics events
First, create a data source to store web analytics events with full URLs. Create a web_analytics.datasource
file:
SCHEMA >
`timestamp` DateTime64(3, 'UTC'),
`session_id` String,
`user_id` String,
`page_url` String,
`referrer` String,
`user_agent` String,
`country` LowCardinality(String),
`device_type` LowCardinality(String)
ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "timestamp, session_id"
Deploy the data source:
tb --cloud deploy
Ingest sample web analytics data using the Events API:
curl -X POST \
-H "Authorization: Bearer $TINYBIRD_TOKEN" \
-H "Content-Type: application/json" \
"https://api.tinybird.co/v0/events?name=web_analytics" \
-d '[
{"timestamp": "2024-12-01 14:30:00", "session_id": "sess_001", "user_id": "usr_123", "page_url": "https://mysite.com/products/shoes?color=red&size=10", "referrer": "https://google.com", "user_agent": "Mozilla/5.0", "country": "US", "device_type": "desktop"},
{"timestamp": "2024-12-01 14:31:00", "session_id": "sess_002", "user_id": "usr_456", "page_url": "https://mysite.com/blog/analytics-guide?utm_source=newsletter&utm_campaign=dec2024", "referrer": "https://newsletter.com", "user_agent": "Mozilla/5.0", "country": "UK", "device_type": "mobile"},
{"timestamp": "2024-12-01 14:32:00", "session_id": "sess_003", "user_id": "usr_789", "page_url": "https://mysite.com/contact?ref=footer", "referrer": "https://mysite.com/about", "user_agent": "Mozilla/5.0", "country": "DE", "device_type": "tablet"}
]'
Step 2: Create a pageviews by path API endpoint
Create a pipe that extracts clean URL paths and aggregates pageviews. Create pageviews_by_path.pipe
:
TOKEN "pageviews_read" READ
NODE pageviews_analysis
SQL >
%
SELECT
cutQueryString(page_url) AS page_path,
count() AS pageviews,
uniq(session_id) AS unique_sessions,
uniq(user_id) AS unique_users,
round(count() / uniq(session_id), 2) AS pages_per_session
FROM web_analytics
WHERE timestamp >= {{DateTime(start_date, '2024-12-01 00:00:00')}}
AND timestamp < {{DateTime(end_date, '2024-12-02 00:00:00')}}
{\% if defined(path_filter) %}
AND cutQueryString(page_url) LIKE {{String(path_filter, '%')}}
{\% end %}
{\% if defined(country_filter) %}
AND country = {{String(country_filter)}}
{\% end %}
GROUP BY page_path
ORDER BY pageviews DESC
{\% if defined(limit) %}
LIMIT {{Int32(limit, 20)}}
{\% end %}
TYPE ENDPOINT
Deploy the API endpoint:
tb --cloud deploy
Step 3: Query the pageviews API
Your clean path analytics API is now available. Query it for different use cases:
# Get top pages for yesterday
curl "https://api.tinybird.co/v0/pipes/pageviews_by_path.json?start_date=2024-12-01&end_date=2024-12-01" \
-H "Authorization: Bearer $TINYBIRD_TOKEN"
# Filter by specific path patterns
curl "https://api.tinybird.co/v0/pipes/pageviews_by_path.json?start_date=2024-12-01&end_date=2024-12-01&path_filter=%/products/%" \
-H "Authorization: Bearer $TINYBIRD_TOKEN"
# Analyze blog posts with country filter
curl "https://api.tinybird.co/v0/pipes/pageviews_by_path.json?start_date=2024-12-01&end_date=2024-12-01&path_filter=%/blog/%&country_filter=US" \
-H "Authorization: Bearer $TINYBIRD_TOKEN"
The API returns clean URL paths without query parameters, enabling clear pageview analysis:
{
"meta": [
{"name": "page_path", "type": "String"},
{"name": "pageviews", "type": "UInt64"},
{"name": "unique_sessions", "type": "UInt64"},
{"name": "unique_users", "type": "UInt64"},
{"name": "pages_per_session", "type": "Float64"}
],
"data": [
{
"page_path": "https://mysite.com/products/shoes",
"pageviews": 1,
"unique_sessions": 1,
"unique_users": 1,
"pages_per_session": 1.0
},
{
"page_path": "https://mysite.com/blog/analytics-guide",
"pageviews": 1,
"unique_sessions": 1,
"unique_users": 1,
"pages_per_session": 1.0
}
],
"rows": 2,
"statistics": {
"elapsed": 0.001,
"rows_read": 3,
"bytes_read": 512
}
}
This approach provides:
- Clean URL grouping by removing query parameters with
cutQueryString()
- Flexible time range filtering with parameterized start and end dates
- Path pattern matching for analyzing specific sections of your site
- Rich analytics metrics including unique sessions and engagement ratios
This pattern works particularly well for real-time dashboards and applications that need fresh analytics data.
Build ClickHouse-based APIs with Tinybird
ClickHouse is the perfect database for analyzing web clickstream data. As your data grows and your analytics requirements become more complex, however, managing ClickHouse infrastructure can consume significant engineering resources.
Tinybird provides a managed ClickHouse with developer-focused tooling that reduces the infrastructure overhead associated with managing ClickHouse clusters. The platform includes built-in scaling and monitoring capabilities as part of its managed service, as well as API generation.
Instead of spending time on database administration, cluster management, and performance tuning, you can focus on building the analytics features your users actually want. Start building with Tinybird's free plan and deploy your first API quickly.
You can sign up for free to Tinybird here
Additional resources
FAQs about extracting URL paths in ClickHouse
What happens if the URL has no path?
The path()
function returns /
for URLs without explicit paths (like https://example.com
), while cutQueryString()
preserves the original URL structure without the query string. For domain-only URLs, cutQueryString()
simply returns the original URL since there's no query string to remove.
Can I strip both query string and fragment in one call?
Yes, cutQueryString()
removes everything after the ?
including fragments (the #
portion), so you don't need separate function calls. If you need more granular control, you can combine cutQueryString()
with cutFragment()
, though this is rarely necessary since cutQueryString()
handles both cases.