When you're analyzing web traffic or building domain-based analytics in ClickHouse, raw URLs like news.bbc.co.uk
and www.bbc.co.uk
may represent the same core domain but can get aggregated separately. The cutToFirstSignificantSubdomain()
function solves this by extracting the registrable domain – the core organizational identifier that strips away prefixes while respecting complex TLD structures.
This guide covers the function's syntax and behavior, practical implementation patterns, and how to build real-time APIs that normalize domains at scale.
What is the first significant subdomain?
The first significant subdomain is the part of a domain that includes top-level subdomains up to the "first significant subdomain"; it's essentially the registrable domain plus its public suffix. This strips away non-essential prefixes like www, m, or api while preserving the organization's core domain identity.
Think of it this way: when you see news.bbc.co.uk
, the first significant subdomain is bbc.co.uk
. The function ignores the news
part because it's just a subdomain, but keeps bbc.co.uk
because that's where the actual domain registration happens.
Here's how it works with different URL patterns:
support.google.co.jp
becomesgoogle.co.jp
m.example.com
becomesexample.com
api.internal.company.org
becomescompany.org
The function handles edge cases automatically. IPv4 and IPv6 addresses return empty strings since they're not domains, and port numbers get stripped from the result entirely.
When should you use cutToFirstSignificantSubdomain?
Devs building web analytics views use this function when they want to group metrics by organization rather than tracking every single subdomain separately. Instead of having separate entries for blog.company.com
, shop.company.com
, and support.company.com
, you get unified analytics under company.com
.
Content categorization becomes much cleaner too. Rather than maintaining complex rules for every possible subdomain combination, you can normalize sources to a single domain identity for deduplication and classification work.
The function exists because maintaining your own public suffix list is a nightmare, while building regex patterns for domain parsing often produces wrong results. ClickHouse's built-in function handles the complexity of multi-level TLDs and international domains without you having to think about it.
You'll find this particularly useful when building dashboards, attribution models, or any system where consistent domain normalization matters more than preserving exact subdomain details.
Syntax of cutToFirstSignificantSubdomain
The function signature is straightforward: cutToFirstSignificantSubdomain(URL: String) → String
. It takes any string containing a URL or host and returns the normalized domain as a string.
The URL parameter accepts full URLs with schemes, bare hostnames, or even malformed strings. Only properly formatted URLs produce meaningful results though – garbage input returns an empty string rather than throwing an error.
This makes the function safe to use in aggregations and joins without worrying about null handling or error catching.
Quick example query
Here's how the function behaves with different types of input:
SELECT cutToFirstSignificantSubdomain('https://news.bbc.co.uk/world') AS result;
-- Result: 'bbc.co.uk'
SELECT cutToFirstSignificantSubdomain('http://www.github.com/docs') AS result;
-- Result: 'github.com'
SELECT cutToFirstSignificantSubdomain('https://store.google.co.jp/cart') AS result;
-- Result: 'google.co.jp'
And some edge cases to show robustness:
SELECT cutToFirstSignificantSubdomain('http://localhost:8080') AS result;
-- Result: '' (localhost isn't a public domain)
SELECT cutToFirstSignificantSubdomain('not a url') AS result;
-- Result: '' (invalid input)
SELECT cutToFirstSignificantSubdomain('https://192.168.1.1/path') AS result;
-- Result: '' (IP addresses return empty)
The function handles internationalized domains as well by converting them to punycode internally, so domains with non-ASCII characters work correctly.
SELECT cutToFirstSignificantSubdomain('https://bücher.example.de/path') AS result;
-- Result: 'example.de'
SELECT cutToFirstSignificantSubdomain('https://新闻.example.com/news') AS result;
-- Result: 'example.com'
Handling multi-level TLDs and IDNs
.co.uk and other public suffixes
The function uses ClickHouse's built-in public suffix list to correctly interpret complex TLD structures. This means domains like .co.uk
, .com.au
, and .gov.uk
are properly recognized as single units rather than being split incorrectly.
For a.b.example.co.uk
, the function correctly returns example.co.uk
because it understands that .co.uk
is a single public suffix. Without this knowledge, naive parsing might incorrectly return b.example.co.uk
or fail entirely.
Government and educational domains benefit particularly from this handling since they often use multi-level TLD structures that trip up simple string splitting approaches.
Internationalized domain names
Non-ASCII domains get automatically normalized using punycode encoding under the hood. This ensures consistent results regardless of whether your input uses Unicode characters or their encoded equivalents.
When processing domains like https://bücher.de
, the function converts the Unicode characters to their punycode representation before applying the extraction logic. Your results will show the punycode form, which maintains consistency across different systems and character encodings.
Unicode in URL paths or query parameters doesn't affect domain extraction – the function focuses solely on the host portion of the URL.
Performance benchmarks vs regex parsing
CPU usage comparison
ClickHouse's native implementation significantly outperforms regex-based alternatives because it avoids the backtracking and state management overhead that complex regular expressions require. The C++ implementation processes domains using optimized string operations rather than pattern matching.
Vectorized execution means the function can process entire columns efficiently in parallel, while regex solutions typically evaluate row-by-row with much higher per-operation overhead. This difference becomes dramatic when processing millions of URLs.
Memory allocation comparison
The built-in function minimizes temporary allocations by reusing internal buffers and avoiding the capture groups that regex engines create. Regular expressions for URL parsing often require multiple allocation cycles for backreferences and intermediate results.
Lower memory pressure translates to more stable performance under concurrent load, since there's less garbage collection overhead and fewer cache misses from scattered allocations.
Throughput under concurrency
When multiple queries run simultaneously, the native function scales naturally with ClickHouse's parallel execution model. Complex regex patterns often become bottlenecks under concurrency due to their computational complexity and inability to leverage vectorization.
In practice, you'll see 3-5x better throughput with the native function compared to equivalent regex solutions, especially when processing wide tables with many URL columns.
Persisting the parsed domain in a materialized view
1. Create table with raw URL
Start by creating a table to store your raw URL data:
CREATE TABLE events_raw
(
ts DateTime,
url String,
user_id UInt64
)
ENGINE = MergeTree
ORDER BY ts;
This table captures the incoming data exactly as received, without any preprocessing. The simple structure makes it fast for inserts while providing a foundation for downstream transformations.
2. Define materialized view with function
Create a target table to store the processed results:
CREATE TABLE events_domains
(
ts DateTime,
url String,
first_domain String,
user_id UInt64
)
ENGINE = MergeTree
ORDER BY (ts, first_domain);
Then create the materialized view that automatically processes new data:
CREATE MATERIALIZED VIEW mv_events_domains
TO events_domains
AS
SELECT
ts,
url,
cutToFirstSignificantSubdomain(url) AS first_domain,
user_id
FROM events_raw;
The materialized view triggers automatically on each insert to events_raw
, extracting the first significant subdomain and storing it alongside the original data.
3. Verify results
Test your setup with sample data:
INSERT INTO events_raw VALUES
(now(), 'https://news.bbc.co.uk/world', 12345),
(now(), 'http://www.github.com/docs', 67890),
(now(), 'http://localhost:8080/test', 11111);
SELECT ts, url, first_domain, user_id
FROM events_domains
ORDER BY ts
LIMIT 10;
You'll see results with bbc.co.uk
for the BBC news URL, github.com
for the GitHub documentation, and an empty string for localhost since it's not a public domain. This approach gives you both the raw data for debugging and the processed domain for fast analytics queries.
Alternatives and related URL functions
extractHost
The extractHost()
function returns the complete host component exactly as it appears in the URL, including all subdomains. For https://news.bbc.co.uk/path
, it returns news.bbc.co.uk
rather than the normalized bbc.co.uk
.
Use extractHost()
when you need fine-grained subdomain analysis or want to preserve the exact host for logging and debugging purposes. It's particularly useful for security analysis where the specific subdomain matters.
topLevelDomain
The topLevelDomain()
function extracts only the public suffix portion of a domain. For https://news.bbc.co.uk
, it returns co.uk
, giving you just the TLD information.
This function works well for market segmentation analysis, geographic inference based on country-code TLDs, or understanding the distribution of domain types in your dataset.
cutWWW
The cutWWW()
function provides lighter normalization by removing common vanity prefixes like 'www.' from hostnames. For www.github.com
, it returns github.com
without considering public suffix rules.
You can combine functions for custom normalization: cutWWW(extractHost(url))
gives you subdomain preservation with vanity trimming, while cutToFirstSignificantSubdomain(url)
provides full public-suffix-aware normalization.
Example: Building ClickHouse-based conversion funnel analytics APIs with Tinybird
A common e-commerce analytics pattern built on ClickHouse involves tracking user conversion funnels in multi-tenant environments across different domains and subdomains. Using cutToFirstSignificantSubdomain()
, you can normalize domains to analyze conversion rates by organization rather than getting fragmented metrics across subdomains.
Here's how to build a complete funnel analytics API that tracks pageviews → cart additions → checkouts across normalized domains using Tinybird's managed ClickHouse platform.
Step 1: Create a data source for e-commerce events
First, create a data source to store your e-commerce tracking events. Create an ecommerce_events.datasource
file:
SCHEMA >
`timestamp` DateTime64(3, 'UTC') `json:$.timestamp`,
`session_id` String `json:$.session_id`,
`user_id` String `json:$.user_id`,
`event_type` LowCardinality(String) `json:$.event_type`,
`page_url` String `json:$.page_url`,
`referrer_url` String `json:$.referrer_url`,
`product_id` String `json:$.product_id`,
`product_category` LowCardinality(String) `json:$.product_category`,
`revenue` Float64 `json:$.revenue`,
`user_agent` String `json:$.user_agent`,
`country` LowCardinality(String) `json:$.country`
ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "timestamp, session_id"
Step 2: Create a materialized view for funnel aggregation
Create a target table to store pre-aggregated funnel metrics by normalized domain. Create a funnel_metrics.datasource
:
SCHEMA >
`normalized_domain` String,
`event_date` Date,
`product_category` LowCardinality(String),
`country` LowCardinality(String),
`pageviews` UInt64,
`cart_additions` UInt64,
`checkouts` UInt64,
`total_revenue` Float64,
`unique_sessions` UInt64,
`unique_users` UInt64
ENGINE "SummingMergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(event_date)"
ENGINE_SORTING_KEY "event_date, normalized_domain, product_category, country"
Create a materialized view pipe to aggregate funnel events. Create funnel_aggregation.pipe
:
NODE funnel_aggregation
SQL >
SELECT
cutToFirstSignificantSubdomain(page_url) AS normalized_domain,
toDate(timestamp) AS event_date,
product_category,
country,
countIf(event_type = 'pageview') AS pageviews,
countIf(event_type = 'add_to_cart') AS cart_additions,
countIf(event_type = 'checkout') AS checkouts,
sumIf(revenue, event_type = 'checkout') AS total_revenue,
uniq(session_id) AS unique_sessions,
uniq(user_id) AS unique_users
FROM ecommerce_events
WHERE normalized_domain != ''
GROUP BY
normalized_domain,
event_date,
product_category,
country
TYPE MATERIALIZED
DATASOURCE funnel_metrics
Step 3: Create conversion funnel analytics APIs
Create a pipe for detailed funnel analysis. Create conversion_funnel_api.pipe
:
NODE funnel_analysis
SQL >
%
SELECT
normalized_domain,
product_category,
country,
sum(pageviews) AS total_pageviews,
sum(cart_additions) AS total_cart_additions,
sum(checkouts) AS total_checkouts,
sum(total_revenue) AS total_revenue,
sum(unique_sessions) AS total_sessions,
sum(unique_users) AS total_users,
-- Conversion rates
round((sum(cart_additions) * 100.0) / nullIf(sum(pageviews), 0), 2) AS pageview_to_cart_rate,
round((sum(checkouts) * 100.0) / nullIf(sum(cart_additions), 0), 2) AS cart_to_checkout_rate,
round((sum(checkouts) * 100.0) / nullIf(sum(pageviews), 0), 2) AS overall_conversion_rate,
-- Revenue metrics
round(sum(total_revenue) / nullIf(sum(checkouts), 0), 2) AS average_order_value,
round(sum(total_revenue) / nullIf(sum(unique_users), 0), 2) AS revenue_per_user
FROM funnel_metrics
WHERE event_date >= {{Date(start_date, '2024-12-01')}}
AND event_date <= {{Date(end_date, '2024-12-31')}}
{\% if defined(domain_filter) %}
AND normalized_domain = {{String(domain_filter)}}
{\% end %}
{\% if defined(category_filter) %}
AND product_category = {{String(category_filter)}}
{\% end %}
{\% if defined(country_filter) %}
AND country = {{String(country_filter)}}
{\% end %}
GROUP BY normalized_domain, product_category, country
ORDER BY total_revenue DESC
{\% if defined(limit) %}
LIMIT {{Int32(limit, 50)}}
{\% end %}
TYPE ENDPOINT
Create a daily funnel trends API. Create daily_funnel_trends.pipe
:
NODE daily_trends
SQL >
%
SELECT
event_date,
normalized_domain,
sum(pageviews) AS daily_pageviews,
sum(cart_additions) AS daily_cart_additions,
sum(checkouts) AS daily_checkouts,
sum(total_revenue) AS daily_revenue,
-- Daily conversion rates
round((sum(cart_additions) * 100.0) / nullIf(sum(pageviews), 0), 2) AS daily_pageview_to_cart_rate,
round((sum(checkouts) * 100.0) / nullIf(sum(cart_additions), 0), 2) AS daily_cart_to_checkout_rate,
round((sum(checkouts) * 100.0) / nullIf(sum(pageviews), 0), 2) AS daily_overall_conversion_rate,
-- Running totals for trend analysis
sum(sum(pageviews)) OVER (
PARTITION BY normalized_domain
ORDER BY event_date
ROWS UNBOUNDED PRECEDING
) AS cumulative_pageviews,
sum(sum(checkouts)) OVER (
PARTITION BY normalized_domain
ORDER BY event_date
ROWS UNBOUNDED PRECEDING
) AS cumulative_checkouts
FROM funnel_metrics
WHERE event_date >= {{Date(start_date, '2024-12-01')}}
AND event_date <= {{Date(end_date, '2024-12-31')}}
{\% if defined(domain_filter) %}
AND normalized_domain = {{String(domain_filter)}}
{\% end %}
GROUP BY event_date, normalized_domain
ORDER BY event_date DESC, daily_revenue DESC
{\% if defined(limit) %}
LIMIT {{Int32(limit, 100)}}
{\% end %}
TYPE ENDPOINT
Step 4: Test locally with sample data
Start the local Tinybird development server:
tb dev
This starts a local Tinybird instance at http://localhost:7181
. Your data sources and pipes are automatically deployed to the local environment.
Ingest sample e-commerce funnel data using the local Events API:
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TINYBIRD_TOKEN" \
"http://localhost:7181/v0/events?name=ecommerce_events" \
-d '[
{"timestamp": "2024-12-01 10:30:00", "session_id": "sess_001", "user_id": "user_123", "event_type": "pageview", "page_url": "https://shop.example.com/products", "referrer_url": "https://www.google.com", "product_id": "", "product_category": "", "revenue": 0, "user_agent": "Mozilla/5.0", "country": "US"},
{"timestamp": "2024-12-01 10:32:00", "session_id": "sess_001", "user_id": "user_123", "event_type": "add_to_cart", "page_url": "https://shop.example.com/products/laptop", "referrer_url": "https://shop.example.com/products", "product_id": "laptop_001", "product_category": "Electronics", "revenue": 0, "user_agent": "Mozilla/5.0", "country": "US"},
{"timestamp": "2024-12-01 10:35:00", "session_id": "sess_001", "user_id": "user_123", "event_type": "checkout", "page_url": "https://checkout.example.com/complete", "referrer_url": "https://shop.example.com/cart", "product_id": "laptop_001", "product_category": "Electronics", "revenue": 1299.99, "user_agent": "Mozilla/5.0", "country": "US"},
{"timestamp": "2024-12-01 11:15:00", "session_id": "sess_002", "user_id": "user_456", "event_type": "pageview", "page_url": "https://m.example.com/products", "referrer_url": "https://facebook.com", "product_id": "", "product_category": "", "revenue": 0, "user_agent": "Mobile Safari", "country": "CA"},
{"timestamp": "2024-12-01 11:18:00", "session_id": "sess_002", "user_id": "user_456", "event_type": "add_to_cart", "page_url": "https://m.example.com/products/phone", "referrer_url": "https://m.example.com/products", "product_id": "phone_002", "product_category": "Electronics", "revenue": 0, "user_agent": "Mobile Safari", "country": "CA"}
]'
Test your funnel analytics APIs locally:
# Test the main funnel analysis API
curl "http://localhost:7181/v0/pipes/conversion_funnel_api.json?start_date=2024-12-01&end_date=2024-12-31"
# Test with domain filtering
curl "http://localhost:7181/v0/pipes/conversion_funnel_api.json?domain_filter=example.com&category_filter=Electronics"
# Test daily trends API
curl "http://localhost:7181/v0/pipes/daily_funnel_trends.json?domain_filter=example.com&start_date=2024-12-01&end_date=2024-12-07"
You should see the domain normalization in action - both shop.example.com
, m.example.com
, and checkout.example.com
URLs get normalized to example.com
in the results.
Step 5: Deploy to production
Once you've tested your funnel analytics locally and verified the domain normalization works correctly, deploy to your Tinybird cloud workspace:
tb --cloud deploy
Your APIs are now available in production. Query them using your cloud workspace URLs:
# Production API calls (replace YOUR_TOKEN with your actual token)
curl "$TINYBIRD_HOST/v0/pipes/conversion_funnel_api.json?start_date=2024-12-01&end_date=2024-12-31" \
-H "Authorization: Bearer $TINYBIRD_TOKEN"
curl "$TINYBIRD_HOST/v0/pipes/daily_funnel_trends.json?domain_filter=example.com" \
-H "Authorization: Bearer $TINYBIRD_TOKEN"
The APIs return comprehensive funnel analytics with domain normalization:
{
"meta": [
{"name": "normalized_domain", "type": "String"},
{"name": "product_category", "type": "String"},
{"name": "country", "type": "String"},
{"name": "total_pageviews", "type": "UInt64"},
{"name": "total_cart_additions", "type": "UInt64"},
{"name": "total_checkouts", "type": "UInt64"},
{"name": "total_revenue", "type": "Float64"},
{"name": "total_sessions", "type": "UInt64"},
{"name": "total_users", "type": "UInt64"},
{"name": "pageview_to_cart_rate", "type": "Float64"},
{"name": "cart_to_checkout_rate", "type": "Float64"},
{"name": "overall_conversion_rate", "type": "Float64"},
{"name": "average_order_value", "type": "Float64"},
{"name": "revenue_per_user", "type": "Float64"}
],
"data": [
{
"normalized_domain": "example.com",
"product_category": "Electronics",
"country": "US",
"total_pageviews": 1,
"total_cart_additions": 1,
"total_checkouts": 1,
"total_revenue": 1299.99,
"total_sessions": 1,
"total_users": 1,
"pageview_to_cart_rate": 100.0,
"cart_to_checkout_rate": 100.0,
"overall_conversion_rate": 100.0,
"average_order_value": 1299.99,
"revenue_per_user": 1299.99
}
],
"rows": 1,
"statistics": {
"elapsed": 0.003,
"rows_read": 5,
"bytes_read": 640
}
}
This approach provides:
- Domain normalization using
cutToFirstSignificantSubdomain()
to unify metrics across subdomains - Complete funnel tracking from pageviews through cart additions to final checkouts
- Pre-aggregated metrics via materialized views for sub-millisecond API responses
- Conversion rate calculations at each funnel stage with percentage breakdowns
- Revenue analytics including average order value and revenue per user
- Time-series analysis with daily trends and cumulative metrics
- Multi-dimensional filtering by domain, product category, and geographic region
Your e-commerce funnel data becomes available as production-ready APIs that automatically normalize domains while providing detailed conversion insights. Start building with Tinybird's free plan to create your ClickHouse-based own web analytics APIs.
What to do next with ClickHouse and Tinybird
Working with URL parsing in production often means dealing with streaming data, real-time transformations, and the need to expose insights via APIs for your app to consume, all while maintaining sub-second query performance. Managing ClickHouse infrastructure for these requirements can quickly become complex.
Tinybird abstracts much of the infrastructure management required for ClickHouse, providing broad access to ClickHouse functions like cutToFirstSignificantSubdomain
. You can stream URLs, normalize domains in real-time, and expose the results through low-latency REST APIs, with much of the operational overhead handled by the platform.
The platform includes built-in data ingestion via HTTP or Kafka, automatic schema evolution, and rapid API publishing, allowing you to focus on building features rather than managing database infrastructure.
You can sign up for free to build and test your first ClickHouse-backed API in a few minutes.
Additional resources
- Functions for Working with URLs
- String Functions
- Materialized Views
- SummingMergeTree Engine
- Real-time analytics API at scale with billions of rows
- Tinybird vs. ClickHouse: What's the difference?
- Web Analytics Dashboard Starter Kit
FAQs about extracting the first significant subdomain
Which ClickHouse version introduced cutToFirstSignificantSubdomain?
The function has been available in ClickHouse for many years as part of the core URL functions library. If you're running any supported stable release, it will be present. You can verify by running SELECT version()
and checking system.functions
for the complete list of available functions.
What happens if the URL has no scheme or host?
Malformed inputs or strings without a valid host component return an empty string rather than throwing an error. For bare hostnames without schemes, the function attempts to parse them, but including a scheme like http://
or https://
ensures more consistent results.
Does the function keep or strip port numbers?
Port numbers are automatically stripped from the result. Whether your input is https://api.example.com:8443/path
or example.com:3000
, the function returns the clean registrable domain without any port information.