When you're analyzing web traffic, referrer data, or any dataset containing URLs, you'll often need to extract just the hostname portion by stripping away protocols, paths, and query parameters to focus on the domain itself. ClickHouse provides the domain() function specifically for this purpose, turning messy URLs into clean, groupable hostnames.
This guide covers the domain() function syntax, edge cases with malformed URLs, performance optimization techniques, and how to build real-time hostname extraction APIs using managed ClickHouse platforms like Tinybird.
How the domain function extracts the hostname
The domain() function in ClickHouse pulls the hostname straight out of any URL string you throw at it. It strips away everything else—the protocol, the path, query parameters—and gives you just the domain portion.
SELECT domain('https://www.example.com/path?query=value');
-- Returns: www.example.com
You'll find this particularly useful when analyzing web logs or referrer data where you want to group by the source domain rather than dealing with full URLs.
Argument type
The function expects a single String argument containing your URL. ClickHouse doesn't care if the URL follows standard formatting or uses custom protocols—it'll parse what it can.
SELECT domain('ftp://files.example.com/downloads/');
-- Returns: files.example.com
Return type
What comes back is a clean String with just the hostname. Here's what gets stripped away during the extraction process:
- Protocol removal:
https://disappears entirely - Path elimination:
/path/to/pagegets cut off - Query parameter stripping:
?param=valuevanishes - Fragment removal:
#sectiongets dropped
SELECT domain('https://api.example.com:8080/v1/users?limit=10#results');
-- Returns: api.example.com:8080
Notice how port numbers stick around—that's intentional and often what you want for distinguishing different services.
Syntax and return type of domain
The complete function signature is straightforward: domain(url). That's it. The URL parameter is a String, and you get a String back with the extracted hostname.
SELECT
domain('https://www.tinybird.co/blog') AS with_www,
domain('http://localhost:3000') AS with_port,
domain('example.com/page') AS without_protocol;
Basic usage example
Here's how the function handles different URL formats in a single query:
SELECT
url,
domain(url) AS extracted_domain
FROM (
SELECT arrayJoin([
'https://www.google.com/search',
'http://localhost:8080/api',
'ftp://files.company.com',
'api.service.com/v2/endpoint'
]) AS url
);
This demonstrates the function's consistent behavior across various URL schemes and formats. Even URLs missing protocols get parsed correctly, though you'll want to be careful with malformed data.
Handling Protocol-Less, Port, and IPv6 URLs
Real-world URL data rarely comes in perfect formats. The domain() function handles common edge cases, though some scenarios require extra attention.
Missing protocol
URLs without http:// or https:// still work fine with the domain function. ClickHouse assumes the first part before any slash is the hostname.
SELECT
domain('example.com/path') AS no_protocol,
domain('subdomain.example.com') AS subdomain_only;
-- Returns: example.com, subdomain.example.com
However, completely malformed URLs might give you unexpected results. If your data source has inconsistent URL formatting, consider cleaning it first.
URLs with ports
Port numbers become part of the extracted hostname, which is usually what you want for distinguishing different services on the same host.
SELECT
domain('https://api.service.com:443/endpoint') AS https_port,
domain('http://localhost:8080/dev') AS local_port;
-- Returns: api.service.com:443, localhost:8080
IPv6 literals
IPv6 addresses in URLs use bracket notation, and the domain() function preserves this format in the output.
SELECT domain('http://[2001:db8::1]:8080/path') AS ipv6_domain;
-- Returns: [2001:db8::1]:8080
Removing the www prefix with domainWithoutWWW
Sometimes you want to normalize domains by stripping the common www prefix. ClickHouse provides domainWithoutWWW() for exactly this purpose.
| Input URL | domain() | domainWithoutWWW() |
|---|---|---|
| https://www.example.com | <www.example.com> | example.com |
| https://blog.example.com | blog.example.com | blog.example.com |
| https://example.com | example.com | example.com |
SELECT
url,
domain(url) AS with_www,
domainWithoutWWW(url) AS without_www
FROM web_analytics
WHERE url LIKE '%www.%'
LIMIT 5;
This becomes particularly useful for analytics where you want to treat www.example.com and example.com as the same domain.
Comparing domain, hostName, and other URL helpers
ClickHouse offers several functions that might seem similar but serve completely different purposes. Getting these mixed up is a common source of confusion.
domain
The domain() function extracts hostnames from URL strings in your data. This is what you'll use for parsing web logs, referrer data, or any dataset containing URLs.
hostName
The hostName() function returns the hostname of the actual ClickHouse server, not anything from your data. It takes no parameters and is used for system monitoring.
SELECT hostName(); -- Returns the ClickHouse server's hostname
SELECT domain('https://example.com'); -- Returns 'example.com' from the URL
cutURLParameter
The cutURLParameter() function removes specific query parameters while keeping the rest of the URL intact, including the domain.
SELECT cutURLParameter('https://example.com?utm_source=email&id=123', 'utm_source');
-- Returns: https://example.com?id=123
Persisting computed hostnames in a materialized view
When you're processing high volumes of URLs, repeatedly computing domain extraction gets expensive. Materialized views let you extract and store hostnames once, then query the pre-computed results.
Creating the view
A materialized view can automatically extract domains from incoming URL data and store them efficiently:
CREATE MATERIALIZED VIEW web_traffic_domains
ENGINE = MergeTree()
ORDER BY (domain, timestamp)
AS SELECT
timestamp,
domain(url) AS domain,
domainWithoutWWW(url) AS normalized_domain,
count() AS page_views
FROM web_traffic
GROUP BY timestamp, domain, normalized_domain;
This approach works particularly well when you have predictable query patterns that always group by domain.
Backfilling Historical Data
You can populate the materialized view with existing data using an INSERT INTO SELECT pattern:
INSERT INTO web_traffic_domains
SELECT
timestamp,
domain(url) AS domain,
domainWithoutWWW(url) AS normalized_domain,
count() AS page_views
FROM web_traffic
WHERE timestamp >= '2024-01-01'
GROUP BY timestamp, domain, normalized_domain;
Query performance and indexing tips
Domain extraction at scale requires some optimization techniques to keep queries fast. Here are the most effective approaches we've found.
Using LOWCARDINALITY
Domain names repeat frequently in web analytics data, making them perfect candidates for the LOWCARDINALITY type:
CREATE TABLE web_analytics (
timestamp DateTime,
url String,
domain LowCardinality(String) MATERIALIZED domain(url),
user_id UInt64
) ENGINE = MergeTree()
ORDER BY (timestamp, domain);
This optimization reduces memory usage and speeds up aggregations on domain data. The trade-off is slightly higher insertion overhead, but it's usually worth it for frequently queried columns.
Pushing function into index
Creating indexes on domain expressions enables faster filtering when you frequently query by hostname:
ALTER TABLE web_analytics
ADD INDEX domain_idx domain(url) TYPE bloom_filter GRANULARITY 1;
Be mindful that function-based indexes add overhead during data insertion, so only create them for columns you filter on regularly.
Building a real-time web analytics API With Tinybird
Tinybird provides a managed ClickHouse environment that allows users to expose domain extraction through REST APIs, without requiring direct infrastructure management.
Create the data source
First, define a data source for URL ingestion using Tinybird's .datasource syntax:
SCHEMA >
`timestamp` DateTime,
`url` String,
`user_id` String,
`session_id` String
ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "timestamp, user_id"
This schema accommodates typical web analytics data while optimizing for time-series queries.
Write the pipe
Create a .pipe file that processes URLs and extracts hostnames with parameterized filtering:
NODE hostname_analysis
SQL >%
SELECT
domain(url) AS hostname,
domainWithoutWWW(url) AS normalized_hostname,
count() AS visits,
uniq(user_id) AS unique_users,
max(timestamp) AS last_seen
FROM web_analytics
WHERE timestamp >= now() - INTERVAL {{String(hours, '24')}} HOUR
{\% if defined(domain_filter) %}
AND domain(url) LIKE {{String(domain_filter, '%')}}
{\% end %}
GROUP BY hostname, normalized_hostname
ORDER BY visits DESC
LIMIT {{Int32(limit, 100)}}
Publish the endpoint
Deploy your pipe as an API endpoint using Tinybird's command-line interface:
tb --cloud deploy
Once deployed, you can query your hostname API with curl:
curl "https://api.tinybird.co/v0/pipes/hostname_analysis.json?token=YOUR_TOKEN&hours=6&limit=50"
The API returns real-time web analytics with sub-second latency. Tinybird abstracts operational complexity of ClickHouse and exposes analytics via REST API endpoints.
Common mistakes when parsing domains
Several pitfalls commonly occur when working with domain extraction in production environments. Being aware of them helps you build more robust data pipelines.
Missing http:// scheme
URLs without proper schemes can produce unexpected parsing results. The domain() function assumes the first component before a slash is the hostname, which might not always be correct:
- Complete URLs work fine:
domain('www.example.com')returns'www.example.com' - Paths with domains work:
domain('example.com/path')returns'example.com' - Relative paths fail:
domain('/relative/path')returns empty string - Malformed data is unpredictable: Random strings produce inconsistent results
Consider data validation before applying domain extraction to ensure consistent results across your dataset.
Non-ASCII characters
Internationalized domain names (IDN) contain non-ASCII characters that require special encoding. ClickHouse handles these reasonably well, but be aware of potential encoding issues:
SELECT domain('https://münchen.example.com/page');
-- May return punycode representation or preserve Unicode depending on input
For applications serving international users, consider how you want to handle and display domain names with special characters.
Start building ClickHouse-based APIs with Tinybird
Rather than managing ClickHouse infrastructure yourself, Tinybird provides a managed platform that exposes most core ClickHouse SQL functions, including URL functions.
You can build and deploy real-time analytics APIs with sub-second query latency and built-in security. Tinybird is designed for scalability, but 'automatic scaling' may not be guaranteed in all contexts.
Whether you're processing web analytics, building user-facing dashboards, or creating internal monitoring tools, Tinybird's developer-centric approach means you can focus on your application logic rather than database administration. Skip the infrastructure work and ship your first API in minutes.
FAQs about extracting hostnames in ClickHouse
How does the domain function handle uppercase letters in URLs?
The domain function preserves the original case of the hostname as provided in the input URL string. ClickHouse does not automatically convert hostnames to lowercase during extraction.
Does the domain function validate if the extracted hostname is a real domain?
No. The domain function performs string parsing only and does not validate whether the extracted hostname corresponds to an actual registered domain or reachable server.
Can I extract hostnames from array columns containing multiple URLs?
Yes. Use arrayMap(domain, url_array) to apply the domain function to each URL in an array column, returning an array of extracted hostnames.
