Web analytics URLs often arrive cluttered with tracking parameters (those ?utm_source=google&utm_medium=cpc
strings) that make URL analysis messy and grouping difficult. Query parameters serve their purpose for campaign tracking and session management, but they create noise when you're trying to analyze page performance or user behavior patterns.
ClickHouse provides the cutQueryString()
function specifically for this problem, letting you strip everything after the question mark with a single function call. This guide covers the function syntax, performance considerations, and production patterns for cleaning URLs at scale, plus how to build selective parameter removal when you need more control than the all-or-nothing approach.
Quick answer with cutQueryString
When you're working with URLs in ClickHouse, you'll often find yourself needing to clean them up. Query parameters (those bits after the question mark like ?utm_source=google&utm_medium=cpc
) can clutter your analytics and make URL grouping difficult.
ClickHouse gives you the cutQueryString()
function to handle this exact problem. This function removes everything after the question mark, including the question mark itself, leaving you with just the clean base URL.
SELECT cutQueryString('https://example.com/page?utm_source=google&utm_medium=cpc') AS clean_url;
-- Returns: https://example.com/page
Run SELECT cutQueryString(url)
The most straightforward approach takes your URL column and applies the function directly:
SELECT
original_url,
cutQueryString(original_url) AS clean_url
FROM your_table;
This works whether your URLs have query parameters or not—URLs without parameters come back unchanged.
Function syntax and return type
The cutQueryString()
function is purpose-built for URL manipulation, which makes it more reliable than trying to cobble together a solution with generic string functions. It takes one argument and returns a cleaned URL.
Arguments
You pass in a single String argument representing the URL you want to clean. The URL can be absolute like https://example.com/page?param=value
or relative like /page?param=value
.
Return value
The function returns a String with the original URL minus everything from the first question mark onward. If there's no question mark in your input, you get back the original string exactly as it was.
Supported versions
You'll find cutQueryString()
in ClickHouse 19.14 and later. Most production deployments will have access to this function—it's been around long enough to be widely available.
Working example on a sample table
Let's walk through a realistic scenario using web analytics data. You're tracking user behavior, but your URLs are cluttered with tracking parameters that make analysis harder than it needs to be.
Create test data
First, we'll set up a table with URLs that mirror what you'd see in real web analytics:
CREATE TABLE web_events (
timestamp DateTime,
user_id UInt32,
page_url String,
referrer String
) ENGINE = MergeTree()
ORDER BY timestamp;
INSERT INTO web_events VALUES
('2024-01-15 10:30:00', 1001, 'https://shop.example.com/products?utm_source=google&utm_medium=cpc&utm_campaign=winter_sale', 'https://google.com'),
('2024-01-15 10:31:00', 1002, 'https://shop.example.com/checkout?session_id=abc123&ref=homepage', 'https://shop.example.com'),
('2024-01-15 10:32:00', 1003, 'https://shop.example.com/about', ''),
('2024-01-15 10:33:00', 1004, 'https://shop.example.com/contact?source=newsletter&campaign=jan2024', 'https://newsletter.example.com');
Compare before and after
Now you can see exactly how cutQueryString()
transforms your URLs:
SELECT
page_url AS original,
cutQueryString(page_url) AS cleaned
FROM web_events
ORDER BY timestamp;
The transformation becomes clear when you see the results side by side:
- Original:
https://shop.example.com/products?utm_source=google&utm_medium=cpc&utm_campaign=winter_sale
- Cleaned:
https://shop.example.com/products
- Original:
https://shop.example.com/checkout?session_id=abc123&ref=homepage
- Cleaned:
https://shop.example.com/checkout
- Original:
https://shop.example.com/about
- Cleaned:
https://shop.example.com/about
(unchanged)
Does cutQueryString remove URL fragments
Here's something that trips people up: cutQueryString()
only removes query parameters (everything after ?
) but leaves URL fragments (everything after #
) alone. This distinction matters when you're working with single-page applications or anchor links.
Behavior with #fragment
Watch what happens when your URL has both query parameters and fragments:
SELECT cutQueryString('https://example.com/page?param=value#section') AS result;
-- Returns: https://example.com/page#section
The query parameters disappear, but the fragment stays put. If you want to remove fragments too, you'll combine cutQueryString()
with cutFragment()
:
SELECT cutFragment(cutQueryString('https://example.com/page?param=value#section')) AS clean_url;
-- Returns: https://example.com/page
Performance versus regular expressions
Native ClickHouse URL functions like cutQueryString()
perform better than regex-based alternatives. The built-in function is optimized specifically for URL parsing and skips the overhead that comes with pattern matching.
Benchmark results
While exact performance depends on your data and hardware, the differences are significant enough to matter:
Method | Readability | Performance | Use Case |
---|---|---|---|
cutQueryString | High | Fast | Remove all parameters |
replaceRegexpAll | Medium | Slower | Complex pattern matching |
substring + indexOf | Low | Medium | Custom logic needed |
For straightforward query parameter removal, cutQueryString()
gives you the best combination of speed and simplicity. However, when you need more complex URL manipulation—like removing only specific parameters—regex functions become necessary.
Keeping some parameters while dropping others
Sometimes you want to keep certain query parameters while removing others. Unfortunately, cutQueryString()
is an all-or-nothing function—it removes everything. For selective parameter removal, you'll need different approaches.
Using replaceRegexpAll
To remove specific parameters while preserving others, regex patterns work well:
SELECT replaceRegexpAll(
'https://example.com/page?keep=yes&remove=no&also_keep=maybe',
'[?&]remove=[^&]*&?',
''
) AS selective_clean;
-- Removes only the 'remove' parameter
Regex fallback
For more complex scenarios, you might combine multiple regex operations or build custom logic using extractURLParameter()
to rebuild URLs with only the parameters you want:
SELECT
cutQueryString(url) || '?keep=' || extractURLParameter(url, 'keep') AS rebuilt_url
FROM your_table
WHERE extractURLParameter(url, 'keep') != '';
Automating cleanup with a materialized view
In production, you typically want to clean URLs automatically as data arrives rather than processing them during every query. Materialized views give you an elegant solution for this pattern—they do the work once when data comes in, not every time you query it.
View DDL
Create a materialized view that automatically strips query parameters from incoming URLs:
CREATE MATERIALIZED VIEW clean_web_events
ENGINE = MergeTree()
ORDER BY timestamp
AS SELECT
timestamp,
user_id,
cutQueryString(page_url) AS clean_page_url,
cutQueryString(referrer) AS clean_referrer
FROM web_events;
Real-time verification
After creating the materialized view, verify it's working correctly with new data:
-- Insert new test data
INSERT INTO web_events VALUES
('2024-01-15 11:00:00', 1005, 'https://example.com/new-page?test=123', '');
-- Check the materialized view
SELECT * FROM clean_web_events WHERE user_id = 1005;
-- Shows clean_page_url as 'https://example.com/new-page'
The materialized view processes new data automatically, giving you cleaned URLs without any query-time overhead.
Building a URL cleaning API with Tinybird
Let's create a complete example that demonstrates how to build a real-time API for URL cleaning using Tinybird's managed ClickHouse platform.
This walkthrough shows how to ingest web analytics data, process URLs with ClickHouse functions, and expose clean URL analytics through a real-time API endpoint.
1. Create the data source
First, create a data source to store web analytics events:
SCHEMA >
`timestamp` DateTime64(3) `json:$.timestamp`,
`user_id` String `json:$.user_id`,
`page_url` String `json:$.page_url`,
`referrer` String `json:$.referrer`,
`utm_source` String `json:$.utm_source`,
`utm_medium` String `json:$.utm_medium`
ENGINE MergeTree
ENGINE_SORTING_KEY timestamp
ENGINE_PARTITION_KEY toYYYYMM(timestamp)
Now build the project locally:
tb dev
This builds the project on Tinybird Local and creates your data source. Then ingest some sample data using the Events API:
curl -X POST "http://localhost:7181/v0/events?name=web_analytics" \
-H "Content-Type: application/json" \
-d '{"timestamp": "2025-01-15 14:30:15.123", "user_id": "user123", "page_url": "https://shop.example.com/products?utm_source=google&utm_medium=cpc&utm_campaign=winter_sale", "referrer": "https://google.com", "utm_source": "google", "utm_medium": "cpc"}
{"timestamp": "2025-01-15 14:30:16.456", "user_id": "user456", "page_url": "https://shop.example.com/checkout?session_id=abc123&ref=homepage", "referrer": "https://shop.example.com", "utm_source": "", "utm_medium": ""}
{"timestamp": "2025-01-15 14:30:17.789", "user_id": "user789", "page_url": "https://shop.example.com/about", "referrer": "", "utm_source": "", "utm_medium": ""}'
2. Create the pipe files
Create a pipe that processes URL data and provides analytics:
NODE web_events
SQL >
SELECT
timestamp,
user_id,
page_url,
referrer,
utm_source,
utm_medium,
cutQueryString(page_url) AS clean_page_url,
cutQueryString(referrer) AS clean_referrer,
domain(page_url) AS domain,
extractURLParameter(page_url, 'utm_campaign') AS utm_campaign
FROM web_analytics
WHERE timestamp >= now64(3) - INTERVAL 1 HOUR
NODE url_analytics
SQL >
%
SELECT
clean_page_url,
domain,
utm_source,
utm_medium,
COUNT(*) AS page_views,
COUNT(DISTINCT user_id) AS unique_users,
MIN(timestamp) AS first_seen,
MAX(timestamp) AS last_seen
FROM web_events
WHERE
{\% if defined(domain) %}
domain = {{ String(domain) }}
{\% end %}
{\% if defined(utm_source) %}
AND utm_source = {{ String(utm_source) }}
{\% end %}
{\% if defined(hours_back) %}
AND timestamp >= now64(3) - INTERVAL {{ Int32(hours_back) }} HOUR
{\% end %}
GROUP BY clean_page_url, domain, utm_source, utm_medium
ORDER BY page_views DESC
LIMIT {{ Int32(limit, 100) }}
TYPE endpoint
3. Deploy and test the API
Deploy your project to Tinybird Cloud:
tb --cloud deploy
Then test your API endpoint:
# Get all page analytics from the last hour
curl -H "Authorization: Bearer <your_token>" \
"https://api.tinybird.co/v0/pipes/url_analytics.json"
# Filter by domain
curl -H "Authorization: Bearer <your_token>" \
"https://api.tinybird.co/v0/pipes/url_analytics.json?domain=shop.example.com"
# Filter by UTM source and time range
curl -H "Authorization: Bearer <your_token>" \
"https://api.tinybird.co/v0/pipes/url_analytics.json?utm_source=google&hours_back=2"
4. Sample API response
The API returns structured data with clean URL analytics:
{
"meta": [
{
"name": "clean_page_url",
"type": "String"
},
{
"name": "domain",
"type": "String"
},
{
"name": "utm_source",
"type": "String"
},
{
"name": "utm_medium",
"type": "String"
},
{
"name": "page_views",
"type": "UInt64"
},
{
"name": "unique_users",
"type": "UInt64"
},
{
"name": "first_seen",
"type": "DateTime64(3)"
},
{
"name": "last_seen",
"type": "DateTime64(3)"
}
],
"data": [
{
"clean_page_url": "https://shop.example.com/products",
"domain": "shop.example.com",
"utm_source": "google",
"utm_medium": "cpc",
"page_views": 1247,
"unique_users": 892,
"first_seen": "2025-01-15 14:30:15.123",
"last_seen": "2025-01-15 15:45:22.789"
},
{
"clean_page_url": "https://shop.example.com/checkout",
"domain": "shop.example.com",
"utm_source": "",
"utm_medium": "",
"page_views": 456,
"unique_users": 234,
"first_seen": "2025-01-15 14:30:16.456",
"last_seen": "2025-01-15 15:30:18.345"
}
],
"rows": 2,
"statistics": {
"elapsed": 0.045,
"rows_read": 2000,
"bytes_read": 89000
}
}
Other useful URL functions
ClickHouse provides several other URL manipulation functions that work well alongside cutQueryString()
for comprehensive URL processing workflows.
hostname
Extract the domain name from URLs using the hostname()
function:
SELECT hostname('https://shop.example.com/products?sale=true') AS domain;
-- Returns: shop.example.com
domain
Get the top-level domain portion with the domain()
function:
SELECT domain('https://shop.example.com/products') AS tld;
-- Returns: example.com
extractURLParameter
Retrieve specific query parameter values before removing them:
SELECT
extractURLParameter(page_url, 'utm_source') AS traffic_source,
cutQueryString(page_url) AS clean_url
FROM web_events;
This combination lets you capture important parameter values while still cleaning the URLs for analysis.
Build real-time, web analytics APIs with ClickHouse
Managing ClickHouse infrastructure for URL processing involves complexity—from optimizing storage and query performance to handling URL parsing edge cases and maintaining consistent data quality across distributed systems.
Tinybird's managed ClickHouse service reduces infrastructure overhead by providing a managed ClickHouse environment that abstracts optimization, scaling, and maintenance. Tinybird exposes ClickHouse features such as cutQueryString()
and URL parsing without requiring direct infrastructure management, including database clusters, performance monitoring, or version upgrades.
This enables teams to focus on application development rather than database operations or configuring ClickHouse for production workloads. Tinybird's platform supports developer efficiency through managed scaling and workflow integrations.
Sign up for a free Tinybird account to build and test your first ClickHouse-based API in just a few minutes.
Additional resources
- cutQueryString function
- cutFragment function
- cutURLParameter function
- extractURLParameter function
- URL functions overview
- How to build URL hierarchy in ClickHouse
- Extract query parameter in ClickHouse
- Extract URL fragment in ClickHouse
FAQs about stripping query parameters in ClickHouse
How do I backfill an existing table with cleaned URLs without downtime?
Use ALTER TABLE UPDATE
with cutQueryString()
to modify existing data in place. ClickHouse processes updates asynchronously without blocking queries, though large tables may take time to complete the mutation.
Can I strip only dynamic tracking params like gclid but keep others?
Use replaceRegexpAll()
with specific parameter patterns or combine extractURLParameter()
with cutQueryString()
for selective removal. Note that cutQueryString()
removes all parameters, so selective removal requires regex or custom rebuilding logic.
Does cutQueryString handle non-UTF-8 URLs correctly?
ClickHouse cutQueryString()
works with any valid URL string encoding. Invalid URL formats return the original string unchanged rather than throwing errors, making the function safe for messy real-world data.