Name: Tinybird
Brand: Tinybird
Rating: 5.0 (10 reviews)

When you're analyzing web traffic, referrer data, or any dataset containing URLs, you'll often need to extract just the hostname portion by stripping away protocols, paths, and query parameters to focus on the domain itself. ClickHouse® provides the domain() function specifically for this purpose, turning messy URLs into clean, groupable hostnames.

This guide covers the domain() function syntax, edge cases with malformed URLs, performance optimization techniques, and how to build real-time hostname extraction APIs using managed ClickHouse® platforms like Tinybird.

How the `domain` function extracts the hostname

The domain() function in ClickHouse® pulls the hostname straight out of any URL string you throw at it. It strips away everything else—the protocol, the path, query parameters—and gives you just the domain portion.

SELECT domain('https://www.example.com/path?query=value');
-- Returns: www.example.com

You'll find this particularly useful when analyzing web logs or referrer data where you want to group by the source domain rather than dealing with full URLs.

Argument type

The function expects a single String argument containing your URL. ClickHouse® doesn't care if the URL follows standard formatting or uses custom protocols—it'll parse what it can.

SELECT domain('ftp://files.example.com/downloads/');
-- Returns: files.example.com

Return type

What comes back is a clean String with just the hostname. Here's what gets stripped away during the extraction process:

Protocol removal: https:// disappears entirely
Path elimination: /path/to/page gets cut off
Query parameter stripping: ?param=value vanishes
Fragment removal: #section gets dropped

SELECT domain('https://api.example.com:8080/v1/users?limit=10#results');
-- Returns: api.example.com:8080

Notice how port numbers stick around—that's intentional and often what you want for distinguishing different services.

Syntax and return type of `domain`

The complete function signature is straightforward: domain(url). That's it. The URL parameter is a String, and you get a String back with the extracted hostname.

SELECT
    domain('https://www.tinybird.co/blog') AS with_www,
    domain('http://localhost:3000') AS with_port,
    domain('example.com/page') AS without_protocol;

Basic usage example

Here's how the function handles different URL formats in a single query:

SELECT
    url,
    domain(url) AS extracted_domain
FROM (
    SELECT arrayJoin([
        'https://www.google.com/search',
        'http://localhost:8080/api',
        'ftp://files.company.com',
        'api.service.com/v2/endpoint'
    ]) AS url
);

This demonstrates the function's consistent behavior across various URL schemes and formats. Even URLs missing protocols get parsed correctly, though you'll want to be careful with malformed data.

Handling Protocol-Less, Port, and IPv6 URLs

Real-world URL data rarely comes in perfect formats. The domain() function handles common edge cases, though some scenarios require extra attention.

Missing protocol

URLs without http:// or https:// still work fine with the domain function. ClickHouse® assumes the first part before any slash is the hostname.

SELECT
    domain('example.com/path') AS no_protocol,
    domain('subdomain.example.com') AS subdomain_only;
-- Returns: example.com, subdomain.example.com

However, completely malformed URLs might give you unexpected results. If your data source has inconsistent URL formatting, consider cleaning it first.

URLs with ports

Port numbers become part of the extracted hostname, which is usually what you want for distinguishing different services on the same host.

SELECT
    domain('https://api.service.com:443/endpoint') AS https_port,
    domain('http://localhost:8080/dev') AS local_port;
-- Returns: api.service.com:443, localhost:8080

IPv6 literals

IPv6 addresses in URLs use bracket notation, and the domain() function preserves this format in the output.

SELECT domain('http://[2001:db8::1]:8080/path') AS ipv6_domain;
-- Returns: [2001:db8::1]:8080

Removing the `www` prefix with `domainWithoutWWW`

Sometimes you want to normalize domains by stripping the common www prefix. ClickHouse® provides domainWithoutWWW() for exactly this purpose.

Input URL	`domain()`	`domainWithoutWWW()`
https://www.example.com	<www.example.com>	example.com
https://blog.example.com	blog.example.com	blog.example.com
https://example.com	example.com	example.com

SELECT
    url,
    domain(url) AS with_www,
    domainWithoutWWW(url) AS without_www
FROM web_analytics
WHERE url LIKE '%www.%'
LIMIT 5;

This becomes particularly useful for analytics where you want to treat www.example.com and example.com as the same domain.

Comparing `domain`, `hostName`, and other URL helpers

ClickHouse® offers several functions that might seem similar but serve completely different purposes. Getting these mixed up is a common source of confusion.

`domain`

The domain() function extracts hostnames from URL strings in your data. This is what you'll use for parsing web logs, referrer data, or any dataset containing URLs.

`hostName`

The hostName() function returns the hostname of the actual ClickHouse® server, not anything from your data. It takes no parameters and is used for system monitoring.

SELECT hostName(); -- Returns the ClickHouse® server's hostname
SELECT domain('https://example.com'); -- Returns 'example.com' from the URL

`cutURLParameter`

The cutURLParameter() function removes specific query parameters while keeping the rest of the URL intact, including the domain.

SELECT cutURLParameter('https://example.com?utm_source=email&id=123', 'utm_source');
-- Returns: https://example.com?id=123

Persisting computed hostnames in a materialized view

When you're processing high volumes of URLs, repeatedly computing domain extraction gets expensive. Materialized views let you extract and store hostnames once, then query the pre-computed results.

Creating the view

A materialized view can automatically extract domains from incoming URL data and store them efficiently:

CREATE MATERIALIZED VIEW web_traffic_domains
ENGINE = MergeTree()
ORDER BY (domain, timestamp)
AS SELECT
    timestamp,
    domain(url) AS domain,
    domainWithoutWWW(url) AS normalized_domain,
    count() AS page_views
FROM web_traffic
GROUP BY timestamp, domain, normalized_domain;

This approach works particularly well when you have predictable query patterns that always group by domain.

Backfilling Historical Data

You can populate the materialized view with existing data using an INSERT INTO SELECT pattern:

INSERT INTO web_traffic_domains
SELECT
    timestamp,
    domain(url) AS domain,
    domainWithoutWWW(url) AS normalized_domain,
    count() AS page_views
FROM web_traffic
WHERE timestamp >= '2024-01-01'
GROUP BY timestamp, domain, normalized_domain;

Query performance and indexing tips

Domain extraction at scale requires some optimization techniques to keep queries fast. Here are the most effective approaches we've found.

Using `LOWCARDINALITY`

Domain names repeat frequently in web analytics data, making them perfect candidates for the LOWCARDINALITY type:

CREATE TABLE web_analytics (
    timestamp DateTime,
    url String,
    domain LowCardinality(String) MATERIALIZED domain(url),
    user_id UInt64
) ENGINE = MergeTree()
ORDER BY (timestamp, domain);

This optimization reduces memory usage and speeds up aggregations on domain data. The trade-off is slightly higher insertion overhead, but it's usually worth it for frequently queried columns.

Pushing function into index

Creating indexes on domain expressions enables faster filtering when you frequently query by hostname:

ALTER TABLE web_analytics
ADD INDEX domain_idx domain(url) TYPE bloom_filter GRANULARITY 1;

Be mindful that function-based indexes add overhead during data insertion, so only create them for columns you filter on regularly.

Building a real-time web analytics API With Tinybird

Tinybird provides a managed ClickHouse® environment that allows users to expose domain extraction through REST APIs, without requiring direct infrastructure management.

Create the data source

First, define a data source for URL ingestion using Tinybird's .datasource syntax:

SCHEMA >
    `timestamp` DateTime,
    `url` String,
    `user_id` String,
    `session_id` String

ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "timestamp, user_id"

This schema accommodates typical web analytics data while optimizing for time-series queries.

Write the pipe

Create a .pipe file that processes URLs and extracts hostnames with parameterized filtering:

NODE hostname_analysis
SQL >%
    SELECT
        domain(url) AS hostname,
        domainWithoutWWW(url) AS normalized_hostname,
        count() AS visits,
        uniq(user_id) AS unique_users,
        max(timestamp) AS last_seen
    FROM web_analytics
    WHERE timestamp >= now() - INTERVAL {{String(hours, '24')}} HOUR
        {\% if defined(domain_filter) %}
        AND domain(url) LIKE {{String(domain_filter, '%')}}
        {\% end %}
    GROUP BY hostname, normalized_hostname
    ORDER BY visits DESC
    LIMIT {{Int32(limit, 100)}}

Publish the endpoint

Deploy your pipe as an API endpoint using Tinybird's command-line interface:

tb --cloud deploy

Once deployed, you can query your hostname API with curl:

curl "https://api.tinybird.co/v0/pipes/hostname_analysis.json?token=YOUR_TOKEN&hours=6&limit=50"

The API returns real-time web analytics with sub-second latency. Tinybird abstracts operational complexity of ClickHouse® and exposes analytics via REST API endpoints.

Common mistakes when parsing domains

Several pitfalls commonly occur when working with domain extraction in production environments. Being aware of them helps you build more robust data pipelines.

Missing `http://` scheme

URLs without proper schemes can produce unexpected parsing results. The domain() function assumes the first component before a slash is the hostname, which might not always be correct:

Complete URLs work fine: domain('www.example.com') returns 'www.example.com'
Paths with domains work: domain('example.com/path') returns 'example.com'
Relative paths fail: domain('/relative/path') returns empty string
Malformed data is unpredictable: Random strings produce inconsistent results

Consider data validation before applying domain extraction to ensure consistent results across your dataset.

Non-ASCII characters

Internationalized domain names (IDN) contain non-ASCII characters that require special encoding. ClickHouse® handles these reasonably well, but be aware of potential encoding issues:

SELECT domain('https://münchen.example.com/page');
-- May return punycode representation or preserve Unicode depending on input

For applications serving international users, consider how you want to handle and display domain names with special characters.

Start building ClickHouse®-based APIs with Tinybird

Rather than managing ClickHouse® infrastructure yourself, Tinybird provides a managed platform that exposes most core ClickHouse® SQL functions, including URL functions.

You can build and deploy real-time analytics APIs with sub-second query latency and built-in security. Tinybird is designed for scalability, but 'automatic scaling' may not be guaranteed in all contexts.

Whether you're processing web analytics, building user-facing dashboards, or creating internal monitoring tools, Tinybird's developer-centric approach means you can focus on your application logic rather than database administration. Skip the infrastructure work and ship your first API in minutes.

FAQs about extracting hostnames in ClickHouse®

How does the `domain` function handle uppercase letters in URLs?

The domain function preserves the original case of the hostname as provided in the input URL string. ClickHouse® does not automatically convert hostnames to lowercase during extraction.

Does the `domain` function validate if the extracted hostname is a real domain?

No. The domain function performs string parsing only and does not validate whether the extracted hostname corresponds to an actual registered domain or reachable server.

Can I extract hostnames from array columns containing multiple URLs?

Yes. Use arrayMap(domain, url_array) to apply the domain function to each URL in an array column, returning an array of extracted hostnames.

Skip the infra work. Deploy your first ClickHouse^®
project now.

Our Columns:

Skip the infra work. Deploy your first ClickHouse^®
project now.

Skip the infra work. Deploy your first ClickHouse^®
project now.

How to get the hostname from a URL in ClickHouse^®

How the `domain` function extracts the hostname

Argument type

Return type

Syntax and return type of `domain`

Basic usage example

Handling Protocol-Less, Port, and IPv6 URLs

Missing protocol

URLs with ports

IPv6 literals

Removing the `www` prefix with `domainWithoutWWW`

Comparing `domain`, `hostName`, and other URL helpers

`domain`

`hostName`

`cutURLParameter`

Persisting computed hostnames in a materialized view

Creating the view

Backfilling Historical Data

Query performance and indexing tips

Using `LOWCARDINALITY`

Pushing function into index

Building a real-time web analytics API With Tinybird

Create the data source

Write the pipe

Publish the endpoint

Common mistakes when parsing domains

Missing `http://` scheme

Non-ASCII characters

Start building ClickHouse®-based APIs with Tinybird

FAQs about extracting hostnames in ClickHouse®

How does the `domain` function handle uppercase letters in URLs?

Does the `domain` function validate if the extracted hostname is a real domain?

Can I extract hostnames from array columns containing multiple URLs?

Skip the infra work. Deploy your first ClickHouse^®
project now.

Skip the infra work. Deploy your first ClickHouse® project now.

Our Columns:

Skip the infra work. Deploy your first ClickHouse® project now.

Skip the infra work. Deploy your first ClickHouse® project now.

How to get the hostname from a URL in ClickHouse®

How the domain function extracts the hostname

Argument type

Return type

Syntax and return type of domain

Basic usage example

Handling Protocol-Less, Port, and IPv6 URLs

Missing protocol

URLs with ports

IPv6 literals

Removing the www prefix with domainWithoutWWW

Comparing domain, hostName, and other URL helpers

domain

hostName

cutURLParameter

Persisting computed hostnames in a materialized view

Creating the view

Backfilling Historical Data

Query performance and indexing tips

Using LOWCARDINALITY

Pushing function into index

Building a real-time web analytics API With Tinybird

Create the data source

Write the pipe

Publish the endpoint

Common mistakes when parsing domains

Missing http:// scheme

Non-ASCII characters

Start building ClickHouse®-based APIs with Tinybird

FAQs about extracting hostnames in ClickHouse®

How does the domain function handle uppercase letters in URLs?

Does the domain function validate if the extracted hostname is a real domain?

Can I extract hostnames from array columns containing multiple URLs?

Skip the infra work. Deploy your first ClickHouse® project now.

Skip the infra work. Deploy your first ClickHouse^®
project now.

Skip the infra work. Deploy your first ClickHouse^®
project now.

Skip the infra work. Deploy your first ClickHouse^®
project now.

How to get the hostname from a URL in ClickHouse^®

How the `domain` function extracts the hostname

Syntax and return type of `domain`

Removing the `www` prefix with `domainWithoutWWW`

Comparing `domain`, `hostName`, and other URL helpers

`domain`

`hostName`

`cutURLParameter`

Using `LOWCARDINALITY`

Missing `http://` scheme

How does the `domain` function handle uppercase letters in URLs?

Does the `domain` function validate if the extracted hostname is a real domain?

Skip the infra work. Deploy your first ClickHouse^®
project now.