URL fragments—the text after the # symbol in web addresses—help browsers navigate to specific sections within pages, but extracting them for analysis requires the right approach. ClickHouse's fragment()
function strips away the hash symbol and returns just the meaningful anchor text, making it straightforward to analyze user navigation patterns and page engagement.
This guide covers the syntax and performance considerations for fragment extraction, along with edge cases where regex patterns work better than native functions. You'll also see how to transform fragment analysis into production APIs using managed ClickHouse platforms.
How fragment extraction works in ClickHouse
ClickHouse provides the fragment()
function to extract the part of a URL that comes after the hash symbol (#) without including the hash itself. When you call fragment('https://example.com/page#section1')
, it returns just section1
, not #section1
.
This might seem straightforward, but there's a key distinction here. URL fragments serve as anchor points within web pages—they tell browsers where to jump on a page. Unlike query parameters that get sent to servers, fragments stay in the browser and guide navigation to specific sections.
The fragment()
function handles this extraction cleanly, which matters when you're analyzing user behavior or building navigation systems. You get the meaningful part without having to strip characters manually.
Fragment vs path vs query string
Understanding URL structure helps clarify what the fragment()
function actually extracts:
- Fragment: The anchor portion after # (
section1
inhttps://example.com/page#section1
) - Path: The directory structure (
/page/article
inhttps://example.com/page/article
) - Query string: Parameters after ? (
id=123&type=news
inhttps://example.com?id=123&type=news
)
Each piece serves a different purpose. Paths define resources on servers, query strings pass data to applications, and fragments control client-side navigation.
Syntax and minimal example using fragment()
The fragment()
function takes a single string argument containing a URL and returns the fragment portion as a string. If no fragment exists, it returns an empty string rather than NULL.
SELECT fragment(url_column) FROM your_table;
1. Create sample table
Start by creating a test table with various URL examples to see how fragment extraction works:
CREATE TABLE url_examples (
id UInt32,
url String
) ENGINE = MergeTree()
ORDER BY id;
INSERT INTO url_examples VALUES
(1, 'https://example.com/page#intro'),
(2, 'https://docs.site.com/guide#getting-started'),
(3, 'https://blog.example.com/post'),
(4, 'https://shop.example.com/product#reviews');
Notice how some URLs have fragments while others don't. This mix helps demonstrate different scenarios you'll encounter with real data.
2. Run a select using fragment()
Extract fragments from your sample URLs using the fragment()
function:
SELECT
url,
fragment(url) AS extracted_fragment
FROM url_examples;
The query processes each URL and pulls out whatever comes after the hash symbol. For URLs without fragments, you get empty strings.
3. Verify output without the hash
The query returns clean fragment values without the hash symbol:
┌─url─────────────────────────────────────┬─extracted_fragment─┐
│ https://example.com/page#intro │ intro │
│ https://docs.site.com/guide#getting-started │ getting-started │
│ https://blog.example.com/post │ │
│ https://shop.example.com/product#reviews │ reviews │
└─────────────────────────────────────────┴────────────────────┘
You can see how the function strips the # character completely. This saves you from having to clean the data later.
Handling empty or missing fragments
URLs without fragments return empty strings rather than NULL values, but you might want different behavior depending on your use case. The fragment()
function consistently handles missing fragments by returning empty strings, which works well for most analytics scenarios.
However, sometimes you need to distinguish between URLs that have empty fragments (#
) and URLs with no fragment at all. The function treats both cases the same way.
Coalesce to empty string
When you want consistent string output for downstream processing, wrap the function with COALESCE()
:
SELECT
url,
COALESCE(fragment(url), 'no-fragment') AS fragment_or_default
FROM url_examples;
This approach works well when you're building reports or dashboards where empty values might cause display issues.
Preserve nulls for upstream apps
For applications that distinguish between empty fragments and missing fragments, use conditional logic:
SELECT
url,
CASE
WHEN position(url, '#') = 0 THEN NULL
ELSE fragment(url)
END AS fragment_with_nulls
FROM url_examples;
This pattern helps when your application logic needs to handle the two cases differently.
Performance and storage considerations at scale
Processing millions of URLs requires optimization strategies beyond basic function calls. ClickHouse offers several approaches to improve performance when working with fragments at scale, especially when you're running the same fragment extraction repeatedly.
The key insight here is that fragment extraction can become expensive when you're doing it on every query. Pre-computing and storing fragments often makes more sense.
Add a materialized column
Create a computed column that stores fragment values automatically:
ALTER TABLE url_examples
ADD COLUMN url_fragment String MATERIALIZED fragment(url);
- Storage trade-off: You use more disk space but gain query speed
- Automatic updates: New rows get fragment values computed on insert
- Query optimization: Filtering and grouping by fragments becomes much faster
This approach works particularly well when you frequently filter or group by fragments in your queries.
Use LowCardinality for group by
When fragments have limited unique values, wrap them in LowCardinality
for better compression and performance:
SELECT
toLowCardinality(fragment(url)) AS fragment_category,
count() AS page_views
FROM url_examples
GROUP BY fragment_category;
LowCardinality works best when you have many repeated fragment values—think navigation sections that appear across many pages.
Benchmark native functions vs regex
The fragment()
function typically outperforms regex alternatives, but you can verify with your data:
-- Native function (recommended)
SELECT fragment(url) FROM large_table;
-- Regex alternative (slower)
SELECT extractAll(url, '#(.*)$')[1] FROM large_table;
We've found the native function runs about 3x faster on typical URL datasets, but your mileage may vary depending on URL complexity.
When to use regex instead of native URL functions
While fragment()
handles standard URLs effectively, certain scenarios require regex patterns for more flexible extraction. The native function assumes well-formed URLs, which isn't always the case with real-world data.
You'll encounter edge cases where the built-in function doesn't work as expected. Malformed URLs, custom protocols, or legacy data often require more flexible approaches.
Pre-20.3 ClickHouse versions
Earlier ClickHouse versions lack the fragment()
function, requiring regex-based extraction:
-- For ClickHouse versions before 20.3
SELECT
url,
extractAll(url, '#(.*)')[1] AS fragment_extracted
FROM url_examples;
If you're stuck on an older version, this regex pattern captures everything after the first hash symbol. It's not as fast as the native function, but it works reliably.
Non-standard URI schemes
Custom protocols or malformed URLs might need regex handling:
-- Extract fragments from custom schemes
SELECT
url,
extractAll(url, '#([^?&]*)')[1] AS custom_fragment
FROM url_examples
WHERE url LIKE 'custom://%';
This pattern stops at query parameters or additional hash symbols, which can be useful for cleaning up messy data.
Related URL functions developers confuse with fragment()
ClickHouse provides several URL manipulation functions that serve different purposes than fragment extraction. The naming similarities often lead to confusion, especially when you're working quickly or learning the functions for the first time.
Getting these mixed up can lead to subtle bugs that are hard to track down. Let's clarify what each function actually does.
urlCutFragment
This function removes fragments entirely rather than extracting them:
-- Removes fragment, returns URL without #section
SELECT cutFragment('https://example.com/page#section');
-- Result: 'https://example.com/page'
-- Extracts fragment, returns just the fragment part
SELECT fragment('https://example.com/page#section');
-- Result: 'section'
Use cutFragment()
when you want clean URLs for canonical purposes or when fragments interfere with your analysis.
urlCutQueryString
Query string manipulation affects the portion after the ? symbol, not fragments:
SELECT
cutQueryString('https://example.com/page?id=123#section') AS no_query,
fragment('https://example.com/page?id=123#section') AS fragment_part;
-- Results: 'https://example.com/page#section', 'section'
Notice how cutting the query string leaves the fragment intact. These functions work on different parts of the URL structure.
extractURLParameter
Parameter extraction works with query strings, while fragments represent page anchors:
SELECT
extractURLParameter('https://example.com/page?id=123&type=news', 'id') AS param_value,
fragment('https://example.com/page?id=123&type=news#comments') AS fragment_value;
-- Results: '123', 'comments'
Parameters carry data to servers, fragments control browser navigation. The distinction matters when you're building analytics that track both user interactions and page sections.
Building ClickHouse-based user navigation analytics APIs with Tinybird
You can transform your fragment extraction into a comprehensive user navigation analytics API using Tinybird's managed ClickHouse platform.
Instead of managing ClickHouse infrastructure, you define your analytics logic as code and Tinybird handles the operational complexity. The platform provides sub-second query performance for millions of navigation events while abstracting away database administration.
1. Create a data source for user navigation events
Define your navigation analytics data structure in a data source file:
SCHEMA >
`timestamp` DateTime64(3, 'UTC'),
`user_id` String,
`session_id` String,
`page_url` String,
`referrer_url` String,
`user_agent` String,
`country` LowCardinality(String),
`device_type` LowCardinality(String),
`page_title` String,
`time_on_page` UInt32,
`is_bounce` UInt8
ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "timestamp, user_id"
This schema captures comprehensive user navigation data including fragment-based navigation patterns. The partitioning by month ensures fast queries as your analytics data grows.
2. Build navigation analytics pipes
Create pipes that analyze fragment-based user journeys and section engagement:
NODE section_engagement_analysis
SQL >
%
WITH navigation_events AS (
SELECT
timestamp,
user_id,
session_id,
page_url,
page_title,
country,
device_type,
time_on_page,
is_bounce,
fragment(page_url) AS section_fragment
FROM user_navigation
WHERE timestamp >= {{DateTime(start_date, '2024-01-01 00:00:00')}}
AND timestamp <= {{DateTime(end_date, '2024-12-31 23:59:59')}}
{\% if defined(country_filter) %}
AND country = {{String(country_filter)}}
{\% end %}
{\% if defined(device_filter) %}
AND device_type = {{String(device_filter)}}
{\% end %}
)
SELECT
section_fragment,
page_title,
count() AS section_views,
uniq(user_id) AS unique_users,
uniq(session_id) AS unique_sessions,
countIf(is_bounce = 1) AS bounces,
round(countIf(is_bounce = 1) * 100.0 / count(), 2) AS bounce_rate_percent,
round(avg(time_on_page), 2) AS avg_time_on_section,
-- Navigation flow analysis
round(count() * 100.0 / sum(count()) OVER (), 2) AS traffic_share_percent,
-- Top referrers to this section
topK(3)(referrer_url) AS top_referrers
FROM navigation_events
WHERE section_fragment != ''
GROUP BY section_fragment, page_title
ORDER BY section_views DESC
{\% if defined(limit) %}
LIMIT {{Int32(limit, 50)}}
{\% end %}
TYPE ENDPOINT
Create a user journey analysis pipe:
NODE user_journey_analysis
SQL >
%
WITH user_sessions AS (
SELECT
user_id,
session_id,
timestamp,
page_url,
fragment(page_url) AS section_fragment,
time_on_page,
row_number() OVER (PARTITION BY session_id ORDER BY timestamp) AS navigation_step
FROM user_navigation
WHERE timestamp >= {{DateTime(start_date, '2024-01-01 00:00:00')}}
AND timestamp <= {{DateTime(end_date, '2024-12-31 23:59:59')}}
{\% if defined(user_id_filter) %}
AND user_id = {{String(user_id_filter)}}
{\% end %}
),
journey_flows AS (
SELECT
section_fragment,
navigation_step,
count() AS step_occurrences,
uniq(session_id) AS unique_sessions,
round(avg(time_on_page), 2) AS avg_time_at_step,
-- Calculate drop-off rate
round((count() - lead(count()) OVER (ORDER BY navigation_step)) * 100.0 / count(), 2) AS drop_off_rate
FROM user_sessions
WHERE section_fragment != ''
GROUP BY section_fragment, navigation_step
)
SELECT
section_fragment,
navigation_step,
step_occurrences,
unique_sessions,
avg_time_at_step,
drop_off_rate,
-- Identify critical drop-off points
CASE
WHEN drop_off_rate > 50 THEN 'high_dropoff'
WHEN drop_off_rate > 25 THEN 'medium_dropoff'
ELSE 'low_dropoff'
END AS dropoff_category
FROM journey_flows
ORDER BY navigation_step ASC, step_occurrences DESC
{\% if defined(limit) %}
LIMIT {{Int32(limit, 100)}}
{\% end %}
TYPE ENDPOINT
3. Deploy and test your analytics APIs
Deploy your pipes and test the generated API endpoints:
tb --cloud deploy
Test your navigation analytics APIs:
# Section engagement analysis
curl "https://api.tinybird.co/v0/pipes/section_engagement_analysis.json?start_date=2024-01-01&end_date=2024-01-31&country_filter=US&limit=20" \
-H "Authorization: Bearer $TINYBIRD_TOKEN"
# User journey analysis
curl "https://api.tinybird.co/v0/pipes/user_journey_analysis.json?start_date=2024-01-01&end_date=2024-01-31&user_id_filter=user_123" \
-H "Authorization: Bearer $TINYBIRD_TOKEN"
The APIs return comprehensive navigation analytics including section engagement metrics, user journey flows, and drop-off analysis. This enables real-time insights into how users navigate through complex interfaces, identify content bottlenecks, and optimize user experience based on fragment-based navigation patterns.
Next steps with Tinybird for real-time web analytics
Building production-ready URL analytics requires reliable ingestion, low-latency queries, and scalable APIs. Managing your own ClickHouse infrastructure means dealing with replication, backups, monitoring, and scaling challenges that can distract from building features.
Tinybird handles the operational complexity while providing extensive ClickHouse capabilities for analytics workloads.
Whether you're tracking user navigation patterns, analyzing content engagement, or building real-time dashboards, the managed approach allows you to focus on analytics logic rather than database administration.
If you want to try Tinybird, you can sign up for free to build and test your first API in a few minutes.
FAQs about URL fragment extraction in ClickHouse
What happens if the URL has multiple "#" characters?
The fragment()
function returns everything after the first # symbol, treating subsequent # characters as literal text within the fragment. For example, fragment('https://example.com#section#subsection')
returns section#subsection
.
Does fragment() validate percent-encoding?
The function extracts fragments as-is without decoding percent-encoded characters, so you'll need additional processing for decoded values. Use decodeURLComponent()
if you need to decode the extracted fragment.
Can I update existing rows to add a fragment column?
Yes, use ALTER TABLE
to add a materialized column with fragment(url_column)
in standard ClickHouse, or compute fragments dynamically in Tinybird queries. Materialized columns automatically compute values for new inserts while requiring manual updates for existing data.