URL-encoded strings appear everywhere in web data, from search parameters to API logs. To support URL-encoding, special characters get converted to percent signs followed by hexadecimal codes like %20
for spaces. ClickHouse's decodeURLComponent
function reverses this encoding, transforming strings like search%3Fq%3Danalytics%26type%3Ddocs
back into readable text.
This guide covers the function's syntax, performance optimization strategies for large datasets, and how to handle edge cases like double encoding and Unicode sequences.
What decodeURLComponent does
ClickHouse includes a built-in decodeURLComponent
function that reverses URL encoding, also called percent-encoding. When browsers send data through URLs, they convert special characters into percent signs followed by hexadecimal values—spaces become %20
, forward slashes become %2F
, and ampersands become %26
.
The function takes encoded strings and converts them back to readable text. You'll encounter this frequently when working with web analytics data, API logs, or any dataset containing URLs with encoded parameters.
decodeURLComponent syntax and return type
The function accepts a single String argument and returns a String with the decoded result:
decodeURLComponent(encoded_string)
ClickHouse handles standard percent-encoding sequences and returns the original string unchanged if it encounters invalid encoding patterns. This fail-safe behavior prevents query errors when processing mixed or malformed data.
Quick example on a single string
Here's how decodeURLComponent
works with a simple encoded string:
SELECT decodeURLComponent('Hello%20World%21') AS decoded_text;
This returns Hello World!
where %20
becomes a space and %21
becomes an exclamation mark. You can also decode more complex URL components:
SELECT decodeURLComponent('search%3Fq%3Dclickhouse%26type%3Ddocs') AS decoded_url;
-- Returns: search?q=clickhouse&type=docs
Decoding an entire column efficiently
When working with tables containing URL-encoded data, you'll often decode entire columns rather than individual strings.
1. Selecting a decoded virtual column
The most straightforward approach creates a computed column in your SELECT statement:
SELECT
original_url,
decodeURLComponent(original_url) AS decoded_url
FROM web_logs
WHERE decoded_url LIKE '%search%';
This approach works well for ad-hoc queries but recalculates the decoding operation every time you run the query.
2. Creating a materialized view for constant-time access
For frequently accessed decoded data, a materialized view pre-computes and stores the results:
CREATE MATERIALIZED VIEW decoded_urls_mv
ENGINE = MergeTree()
ORDER BY timestamp
AS SELECT
timestamp,
original_url,
decodeURLComponent(original_url) AS decoded_url
FROM web_logs;
This approach trades storage space for query performance, making decoded URL searches nearly instantaneous.
Handling edge cases and double encoding
Real-world URL data often contains encoding quirks that require special attention.
Plus signs vs percent-20
Web forms sometimes encode spaces as plus signs (+
) instead of %20
, but decodeURLComponent
doesn't automatically convert plus signs to spaces. You'll handle this separately:
SELECT
decodeURLComponent(replace(url_param, '+', ' ')) AS properly_decoded
FROM form_submissions;
This pattern first replaces plus signs with spaces, then applies URL decoding to handle other encoded characters.
Unicode multibyte sequences
URLs containing non-ASCII characters get encoded as multiple percent sequences. For example, the character é
becomes %C3%A9
in UTF-8 encoding:
SELECT decodeURLComponent('caf%C3%A9') AS decoded_text;
-- Returns: café
The function automatically handles multibyte sequences, reconstructing the original Unicode characters.
Detecting double encoding
Sometimes URLs get encoded twice, creating sequences like %2520
(which represents %20
). You can detect and handle double encoding:
SELECT
CASE
WHEN url LIKE '%25%'
THEN decodeURLComponent(decodeURLComponent(url))
ELSE decodeURLComponent(url)
END AS final_decoded_url
FROM suspicious_data;
Performance tips at billions of rows
URL decoding operations can become expensive at scale, but several optimization strategies help maintain query performance.
Avoiding repeated decoding in nested subqueries
When your query decodes the same column multiple times, use a WITH clause to decode once and reuse the result:
WITH decoded_data AS (
SELECT
user_id,
decodeURLComponent(search_query) AS clean_query
FROM search_logs
WHERE timestamp > now() - INTERVAL 1 DAY
)
SELECT
user_id,
clean_query,
length(clean_query) AS query_length
FROM decoded_data
WHERE clean_query LIKE '%analytics%';
This pattern eliminates redundant decoding operations and often improves query performance significantly.
Using LowCardinality for decoded text
When decoded URLs contain repeated patterns or limited unique values, wrap the result in LowCardinality for better compression:
CREATE TABLE decoded_search_logs (
timestamp DateTime,
user_id UInt64,
decoded_query LowCardinality(String)
) ENGINE = MergeTree()
ORDER BY timestamp;
LowCardinality works particularly well for decoded search terms, category names, or other URL parameters with high repetition rates.
When to decode on ingest using materialized views
For high-throughput applications, decoding URLs during data insertion rather than query time often provides better performance.
Step 1: Define the target schema
Create a table that stores both raw and decoded URL data:
CREATE TABLE web_events (
timestamp DateTime,
raw_url String,
decoded_url String,
user_agent String
) ENGINE = MergeTree()
ORDER BY timestamp;
Step 2: Create the materialized view with decodeURLComponent
Set up a materialized view that automatically decodes incoming URL data:
CREATE MATERIALIZED VIEW decode_urls_mv TO web_events AS
SELECT
timestamp,
url as raw_url,
decodeURLComponent(url) as decoded_url,
user_agent
FROM raw_web_logs;
Step 3: Backfill existing data
Use INSERT SELECT to decode and populate existing raw URL data:
INSERT INTO web_events
SELECT
timestamp,
url,
decodeURLComponent(url),
user_agent
FROM raw_web_logs
WHERE timestamp < (SELECT min(timestamp) FROM web_events);
This approach moves the computational cost to ingestion time, keeping query response times consistently fast.
Building ClickHouse-based web analytics APIs with Tinybird
Tinybird is a managed service for ClickHouse that makes it simple to build real-time APIs over web clickstream data. Here's how to create a complete web/URL analytics pipeline with data sources, materialized views, and API endpoints.
Step 1: Create a data source for encoded URL strings
First, create a data source to store raw web analytics events with encoded URLs. Create a web_events.datasource
file:
SCHEMA >
`timestamp` DateTime64(3, 'UTC'),
`session_id` String,
`user_id` String,
`encoded_url` String,
`referrer` String,
`user_agent` String,
`country` LowCardinality(String),
`device_type` LowCardinality(String)
ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "timestamp, session_id"
Deploy the data source:
tb --cloud deploy
Ingest some clickstream data using the Events API. Here's an example:
curl -X POST \
-H "Authorization: Bearer $TINYBIRD_TOKEN" \
-H "Content-Type: application/json" \
"https://api.tinybird.co/v0/events?name=web_events" \
-d '[
{"timestamp": "2024-12-01 14:30:00", "session_id": "sess_001", "user_id": "usr_123", "encoded_url": "/search%3Fq%3Danalytics%26category%3Dtools", "referrer": "https://google.com", "user_agent": "Mozilla/5.0", "country": "US", "device_type": "desktop"},
{"timestamp": "2024-12-01 14:31:00", "session_id": "sess_002", "user_id": "usr_456", "encoded_url": "/products%2Fdatabase%2Fclickhouse%3Fref%3Dhomepage", "referrer": "https://tinybird.co", "user_agent": "Mozilla/5.0", "country": "UK", "device_type": "mobile"},
{"timestamp": "2024-12-01 14:32:00", "session_id": "sess_003", "user_id": "usr_789", "encoded_url": "/blog%2Freal-time%20analytics%3Futm_source%3Dnewsletter", "referrer": "https://newsletter.com", "user_agent": "Mozilla/5.0", "country": "DE", "device_type": "tablet"}
]'
Step 2: Create a materialized view with URL decoding and aggregation
Create a materialized view that decodes URLs in real-time and performs aggregations. Create a decoded_page_views.datasource
file:
SCHEMA >
`page_path` String,
`country` LowCardinality(String),
`device_type` LowCardinality(String),
`hour` DateTime,
`page_views` UInt64,
`unique_sessions` UInt64,
`unique_users` UInt64
ENGINE "SummingMergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(hour)"
ENGINE_SORTING_KEY "hour, page_path, country, device_type"
Create a materialized view pipe decode_and_aggregate.pipe
:
NODE decode_and_aggregate
SQL >
SELECT
decodeURLComponent(encoded_url) AS page_path,
country,
device_type,
toStartOfHour(timestamp) AS hour,
1 AS page_views,
1 AS unique_sessions,
1 AS unique_users
FROM web_events
WHERE page_path IS NOT NULL
AND page_path != ''
TYPE MATERIALIZED
DATASOURCE decoded_page_views
Deploy the materialized view:
tb --cloud deploy
This materialized view automatically:
- Decodes URL-encoded strings in real-time during ingestion
- Aggregates page views by hour, country, and device type
- Pre-computes unique session and user counts
- Stores results in an optimized SummingMergeTree for fast queries
Step 3: Create an API endpoint to query the materialized view
Create a pipe to query the materialized view and expose it as an API. Create page_analytics_api.pipe
:
TOKEN "page_analytics_read" READ
NODE page_analytics
SQL >
%
SELECT
page_path,
country,
device_type,
toDate(hour) AS date,
toHour(hour) AS hour_of_day,
sum(page_views) AS total_page_views,
sum(unique_sessions) AS total_sessions,
sum(unique_users) AS total_users,
round(sum(page_views) / sum(unique_sessions), 2) AS pages_per_session
FROM decoded_page_views
WHERE hour >= toStartOfDay({{DateTime(start_date, '2024-12-01 00:00:00')}})
AND hour < toStartOfDay({{DateTime(end_date, '2024-12-02 00:00:00')}}) + INTERVAL 1 DAY
{\% if defined(country_filter) %}
AND country = {{String(country_filter)}}
{\% end %}
{\% if defined(device_filter) %}
AND device_type = {{String(device_filter)}}
{\% end %}
{\% if defined(path_filter) %}
AND page_path LIKE {{String(path_filter, '%')}}
{\% end %}
GROUP BY page_path, country, device_type, date, hour_of_day
ORDER BY total_page_views DESC
{\% if defined(limit) %}
LIMIT {{Int32(limit, 100)}}
{\% end %}
TYPE ENDPOINT
Deploy the API endpoint:
tb --cloud deploy
Step 4: Query the API endpoint
Your decoded URL analytics API is now available. Query it using curl:
# Get top pages for a specific date range
curl "https://api.tinybird.co/v0/pipes/page_analytics_api.json?start_date=2024-12-01&end_date=2024-12-01" \
-H "Authorization: Bearer $TINYBIRD_TOKEN"
# Filter by country and device type
curl "https://api.tinybird.co/v0/pipes/page_analytics_api.json?start_date=2024-12-01&end_date=2024-12-01&country_filter=US&device_filter=mobile" \
-H "Authorization: Bearer $TINYBIRD_TOKEN"
# Search for specific pages
curl "https://api.tinybird.co/v0/pipes/page_analytics_api.json?start_date=2024-12-01&end_date=2024-12-01&path_filter=%search%" \
-H "Authorization: Bearer $TINYBIRD_TOKEN"
The API returns properly decoded URLs and pre-computed analytics:
{
"meta": [
{"name": "page_path", "type": "String"},
{"name": "country", "type": "String"},
{"name": "device_type", "type": "String"},
{"name": "date", "type": "Date"},
{"name": "hour_of_day", "type": "UInt8"},
{"name": "total_page_views", "type": "UInt64"},
{"name": "total_sessions", "type": "UInt64"},
{"name": "total_users", "type": "UInt64"},
{"name": "pages_per_session", "type": "Float64"}
],
"data": [
{
"page_path": "/search?q=analytics&category=tools",
"country": "US",
"device_type": "desktop",
"date": "2024-12-01",
"hour_of_day": 14,
"total_page_views": 1,
"total_sessions": 1,
"total_users": 1,
"pages_per_session": 1.0
},
{
"page_path": "/products/database/clickhouse?ref=homepage",
"country": "UK",
"device_type": "mobile",
"date": "2024-12-01",
"hour_of_day": 14,
"total_page_views": 1,
"total_sessions": 1,
"total_users": 1,
"pages_per_session": 1.0
}
],
"rows": 2,
"statistics": {
"elapsed": 0.002,
"rows_read": 3,
"bytes_read": 1024
}
}
This approach provides:
- Real-time URL decoding during data ingestion via materialized views
- Pre-computed aggregations for sub-100ms API response times
- Flexible filtering by country, device, time range, and page path patterns
- Scalable architecture that handles millions of events per second
Your decoded URL data becomes available as a production-ready API endpoint designed for high throughput and low latency. Start building with Tinybird's free plan to create your own URL analytics APIs.
Related URL and string functions in ClickHouse
ClickHouse provides several complementary functions for comprehensive URL manipulation workflows.
encodeURLComponent
The opposite of decodeURLComponent
, this function URL-encodes strings for safe transmission:
SELECT encodeURLComponent('search query with spaces') AS encoded;
-- Returns: search%20query%20with%20spaces
extractURLParameter
After decoding URLs, you often extract specific parameter values:
SELECT
decodeURLComponent(extractURLParameter(full_url, 'q')) AS search_term
FROM search_logs;
substring and length helpers
String functions work well alongside URL decoding for text analysis:
- substring(): Truncates decoded URLs to specific lengths for display or storage optimization
- length(): Measures decoded string lengths for validation or analytics
- position(): Finds specific patterns within decoded URL components
Next steps with Tinybird's managed ClickHouse
Tinybird provides a managed ClickHouse service that handles operational complexity while offering the same SQL functions and performance characteristics.
With Tinybird, you get fast deployments, scalable infrastructure, and built-in API generation so you can focus on building features rather than managing databases. The platform includes live schema migrations, built-in observability, and data-as-code workflows that make URL processing pipelines easier to develop and maintain.
Additional resources
- Functions for Working with URLs
- String Functions
- Materialized Views
- Use Materialized Views
- Tinybird Web Analytics Starter Kit
FAQs about URL decoding in ClickHouse
Does decodeURLComponent work on arrays of strings?
The function only accepts single String arguments, but you can use arrayMap(x -> decodeURLComponent(x), array_column)
to decode string arrays element by element.
Can decodeURLComponent handle malformed percent sequences?
ClickHouse returns the original string unchanged when it encounters invalid percent-encoding sequences rather than throwing errors, making it safe to use on mixed or dirty data.
Does decoding impact query performance significantly?
String decoding adds computational overhead, so consider using materialized views or pre-computed columns for frequently accessed decoded data, especially when processing billions of rows.