ClickHouse is great for analyzing web traffic or building domain classification systems. In these cases, extracting the top-level domain from URLs is a frequent task. ClickHouse provides the topLevelDomain() function specifically for this purpose, returning just the domain extension (com, org, or uk) from any URL string.
This guide covers how to use the function effectively, handle edge cases with complex domains, and how optimize performance when processing millions of URLs through materialized views and API endpoints created using managed ClickHouse services like Tinybird.
What topLevelDomain() does and why it matters
ClickHouse gives you the topLevelDomain() function to pull the top-level domain (TLD) from any URL string. The function grabs the rightmost part of a domain name (e.g. com, org, uk, or net) which sits at the highest level in the domain hierarchy.
This becomes useful when you're analyzing web traffic patterns, sorting domains by region, or building security systems that flag suspicious domain extensions. You might want to group user sessions by country-specific TLDs or filter out certain extensions from your analytics dashboard.
The function works with URLs whether they include the protocol prefix or not, making it flexible for different data sources.
Function signature and basic example
The topLevelDomain() function takes a single string parameter containing a URL and returns the TLD as a string. Here's how it looks in practice:
SELECT topLevelDomain('https://example.com/path?param=value');
-- Returns: com
You can test it with different URL formats to see how it handles various cases:
SELECT
topLevelDomain('https://www.google.com') AS with_www,
topLevelDomain('subdomain.github.io') AS subdomain,
topLevelDomain('example.org.uk') AS multi_part_tld;
Parameter and return types
The function expects a String type input and returns a String containing just the TLD portion. When you pass in a malformed URL or something without a valid domain, the function returns an empty string rather than throwing an error.
This graceful handling means your queries won't break when they encounter bad data, though you'll want to filter out empty results in most cases.
Minimal select query
Here's the simplest way to test the function:
SELECT topLevelDomain('https://tinybird.co');
-- Output: co
Notice how the function strips away everything except the final domain extension. This clean output makes it perfect for domain classification tasks where you only care about the TLD type.
Handling subdomains and www prefixes
The topLevelDomain() function extracts the TLD regardless of how complex your subdomains get. Whether you're dealing with www.example.com, api.v2.service.example.com, or just example.com, the function consistently returns the same TLD.
However, you might want to clean up domains before extracting the TLD, especially when your data has inconsistent URL formats.
Using cutWWW() before topLevelDomain()
While topLevelDomain() handles www prefixes correctly on its own, you can chain it with cutWWW() for cleaner domain processing:
SELECT
topLevelDomain(cutWWW('www.example.com')) AS clean_tld,
domain(cutWWW('www.example.com')) AS clean_domain;
-- Returns: com, example.com
This approach becomes useful when you're building materialized views or APIs that require consistent domain formatting across your data pipeline.
Using domain() for the full domain
Sometimes you want the main domain rather than just the TLD. The domain() function extracts the domain portion while topLevelDomain() gives you only the extension:
SELECT
domain('https://blog.example.com/post') AS full_domain,
topLevelDomain('https://blog.example.com/post') AS just_tld;
-- Returns: blog.example.com, com
Dealing with multi-part TLDs and edge cases
The topLevelDomain() function handles most common TLD formats, but it has limitations with complex multi-part TLDs and internationalized domains. Understanding where it falls short helps you build more robust URL processing pipelines.
SELECT
topLevelDomain('example.co.uk') AS uk_domain,
topLevelDomain('test.com.au') AS au_domain,
topLevelDomain('192.168.1.1') AS ip_address;
-- Returns: uk, au, (empty string)
Limitations with complex TLDs
The function extracts only the final part of multi-segment TLDs. For example.co.uk, you get uk rather than co.uk. This behavior might not match your expectations if you're doing geographic analysis based on country-specific TLD patterns.
- Geographic analysis:
.co.ukdomains appear asuk, losing the commercial designation - IP addresses: The function returns empty strings for IP addresses since they lack traditional TLD structure
- Malformed URLs: Invalid or incomplete URLs return empty strings rather than errors
Working with internationalized domains
Internationalized domain names (IDNs) and punycode domains can produce unexpected results. The function handles basic ASCII domains reliably but may not process non-Latin characters as expected.
For more accurate TLD parsing with complex cases, you might need to use regex patterns or external domain classification data alongside the built-in function.
topLevelDomain() vs domain() and other URL functions
ClickHouse offers several URL functions that serve different purposes. Choosing the right function depends on whether you want the full domain, just the TLD, or other URL components.
| Function | Input: https://blog.example.com/post | Output |
|---|---|---|
topLevelDomain() | Extract TLD only | com |
domain() | Extract full domain | blog.example.com |
cutWWW() | Remove www prefix | https://blog.example.com/post |
protocol() | Extract protocol | https |
When to use each function
Choose topLevelDomain() when you're categorizing domains by extension type or building TLD-based filters. Use domain() when you want the complete domain for grouping or analysis, and protocol() for security analysis or URL scheme validation.
The performance characteristics are similar across URL functions since they all parse the same string, though topLevelDomain() does slightly less work by stopping at the TLD extraction.
Performance tips for large URL datasets
When processing millions of URLs, how you structure your queries and tables can significantly impact performance. The topLevelDomain() function itself runs fast, but you can optimize further through smart table design.
Pre-computing TLD values during data ingestion often outperforms extracting them at query time, especially for frequently accessed analytics queries.
Exposing domain analytics through Tinybird APIs
Tinybird enables you to publish ClickHouse queries as REST APIs optimized for analytics workloads, abstracting infrastructure management.
Creating a Tinybird data source
First, create a data source to store web analytics data that includes URLs for TLD analysis. Here's an example url_analytics.datasource file:
SCHEMA >
`timestamp` DateTime,
`user_id` String,
`session_id` String,
`url` String,
`referrer` String,
`user_agent` String
ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "timestamp, user_id"
This schema captures essential web analytics data with URLs that we can analyze using topLevelDomain(). Deploy the data source:
tb --cloud deploy
Next, populate it with sample web analytics events using the Tinybird Events API:
# Stream web analytics events with diverse URL patterns
curl -X POST \
-H "Authorization: Bearer $TINYBIRD_TOKEN" \
-H "Content-Type: application/json" \
"$TINYBIRD_HOST/v0/events?name=url_analytics" \
-d '[
{"timestamp": "2024-01-01 12:30:45", "user_id": "user_001", "session_id": "sess_001", "url": "https://www.example.com/products", "referrer": "https://google.com", "user_agent": "Chrome/118.0"},
{"timestamp": "2024-01-01 12:31:12", "user_id": "user_002", "session_id": "sess_002", "url": "https://blog.github.io/posts/1", "referrer": "direct", "user_agent": "Safari/16.0"},
{"timestamp": "2024-01-01 12:31:45", "user_id": "user_003", "session_id": "sess_003", "url": "https://shop.amazon.co.uk/books", "referrer": "https://bing.com", "user_agent": "Firefox/119.0"},
{"timestamp": "2024-01-01 12:32:18", "user_id": "user_004", "session_id": "sess_004", "url": "https://docs.tinybird.co/getting-started", "referrer": "https://twitter.com", "user_agent": "Chrome/118.0"},
{"timestamp": "2024-01-01 12:32:51", "user_id": "user_005", "session_id": "sess_005", "url": "https://news.bbc.com.au/technology", "referrer": "direct", "user_agent": "Safari/16.0"}
]'
This sample data includes various TLD patterns (.com, .io, .co.uk, .com.au) that demonstrate how topLevelDomain() handles different domain structures in real web analytics scenarios.
Creating a Tinybird pipe
Save this query as a .pipe file to create an API endpoint:
TOKEN "read_token" READ
DESCRIPTION >
Get TLD distribution for a specific time range
NODE endpoint
DESCRIPTION >
Returns top domains by TLD with visit counts
SQL >
%
SELECT
topLevelDomain(url) AS tld,
count() AS visits,
uniq(user_id) AS unique_visitors
FROM url_analytics
WHERE timestamp >= {{DateTime(start_date, '2024-01-01 00:00:00')}}
AND timestamp < {{DateTime(end_date, '2024-01-02 00:00:00')}}
AND topLevelDomain(url) != ''
GROUP BY tld
ORDER BY visits DESC
LIMIT {{Int32(limit, 10)}}
Deploying the API
Deploy your pipe to create the API endpoint:
tb --cloud deploy
Tinybird generates a REST endpoint that accepts start_date, end_date, and limit query parameters, returning JSON results. Sub-second latency can be achieved for many analytical queries, depending on query complexity and data volume.
Ready to build fast web analytics with zero ops?
Managing ClickHouse infrastructure for web analytics can become complex as your data grows. Tinybird provides a managed ClickHouse platform that reduces operational complexity, handles deployments, and exposes your queries as production-ready APIs.
Tinybird's developer-focused analytics platform abstracts infrastructure management and enables rapid API development on top of ClickHouse. You can sign up for free to use Tinybird and deploy your first web analytics API in a few minutes.
Additional Resources
- ClickHouse topLevelDomain() function docs
- Tinybird Web Analytics Starter Kit for building real-time analytics APIs and dashboards over web clickstream data
- Tinybird vs. ClickHouse: What's the difference?
FAQs about extracting top-level domains in ClickHouse
Does topLevelDomain() work with IP addresses?
The function returns an empty string for IP addresses since they don't follow traditional domain name structure. IP addresses like 192.168.1.1 or 10.0.0.1 don't have TLDs in the conventional sense.
What happens when the URL column contains NULL values?
The function returns NULL when the input URL is NULL. Use coalesce(topLevelDomain(url), 'unknown') or similar conditional logic if you want to handle NULL values explicitly in your queries.
Can I extract country-specific TLD information?
The function returns only the final TLD segment, so .co.uk becomes uk and .com.au becomes au. For full country-code TLD analysis, you might want additional logic to reconstruct the complete TLD or use external domain classification data.
