---
title: "How to extract RFC-compliant hostnames from URLs in ClickHouse®"
excerpt: "Learn how to extract RFC-compliant hostnames from URLs in ClickHouse® using domainRFC(), handle edge cases, optimize performance, and build production APIs."
authors: "Cameron Archer"
categories: "AI Resources"
createdOn: "2025-09-30 17:18:12"
publishedOn: "2025-10-03 10:30:00"
updatedOn: "2025-10-03 10:30:00"
status: "published"
---

When you're processing millions of URLs in ClickHouse®, the difference between getting `example.com` and an empty string can break your entire analytics pipeline. The `domainRFC()` function extracts RFC 3986-compliant hostnames from URLs, handling edge cases that the simpler `domain()` function can't parse correctly.

This guide covers the syntax differences between ClickHouse®'s hostname extraction functions, performance optimization strategies for large datasets, and how to build production-ready APIs that process URL data reliably.

## What makes a hostname RFC-compliant in ClickHouse®

In ClickHouse®, the `domainRFC` function extracts hostnames from URLs following the RFC 3986 standard. This function handles complex URL structures that the simpler `domain` function can't parse correctly.

RFC 3986 defines how URLs work (everything from the protocol (`https://`) to the hostname (`example.com`) to the path (`/api/data`)). When you're dealing with URLs that contain user credentials, ports, or unusual formatting, RFC compliance becomes important for getting consistent results.

Here's what I mean: if you have a URL like `https://user:password@api.example.com:8080/data`, the RFC-compliant approach correctly identifies `api.example.com` as the hostname while ignoring the authentication details and port number.

## The domainRFC function versus domain

ClickHouse® gives you two options for hostname extraction, but they behave very differently. The `domain()` function prioritizes speed over accuracy, while `domainRFC()` follows web standards even when URLs get messy.

Let me show you exactly what happens with a tricky URL:

```sql
SELECT
    domain('http://user:password@example.com:8080/path') AS domain_result,
    domainRFC('http://user:password@example.com:8080/path') AS rfc_result;
```

| domain_result | rfc_result |
|---------------|------------|
| (empty string)| example.com|

The `domain()` function gives up when it sees user credentials in the URL. Meanwhile, `domainRFC()` correctly extracts the hostname you actually want. This difference matters when you're processing real-world data where URLs come from browsers, APIs, and log files that don't always follow perfect formatting.

## Syntax and minimal query to extract hostnames

The `domainRFC()` function takes one parameter—the URL string—and returns the hostname as a string.

### 1. Run a one-line select

```sql
SELECT domainRFC('https://clickhouse.com/docs/sql-reference/') AS hostname;
```

This returns `clickhouse.com`. The function automatically strips away the protocol, path, query parameters, and everything else that isn't the hostname.

### 2. Cast result and check nullability

The function returns a `String` type, but you'll get an empty string when the URL is malformed. You can check for valid results like this:

```sql
SELECT
    url,
    domainRFC(url) AS hostname,
    hostname != '' AS is_valid
FROM my_table;
```

When ClickHouse® can't find a valid hostname in your URL, `domainRFC()` returns an empty string rather than throwing an error that would break your query.

## Edge cases you handle in production

Real URL data gets messy fast. Web logs contain everything from IPv6 addresses to internationalized domains to completely broken URLs that somehow still need processing.

### IPv6 literals

URLs can contain IPv6 addresses wrapped in square brackets:

```sql
SELECT domainRFC('http://[2001:db8::1]:8080/path') AS ipv6_host;
```

The function returns `[2001:db8::1]`, keeping the bracket notation that distinguishes IPv6 addresses from regular hostnames.

### Internationalized domain names

Unicode domain names get processed according to RFC standards:

```sql
SELECT domainRFC('https://例え.テスト/path') AS unicode_domain;
```

The function handles internationalized domain names (IDNs) by returning the Unicode representation as it appears in the original URL.

### URLs with ports

Port numbers get stripped automatically:

```sql
SELECT domainRFC('https://api.example.com:443/v1/data') AS hostname_no_port;
```

This returns `api.example.com`, removing the `:443` port specification that's redundant for HTTPS URLs anyway.

- **Missing schemes:** URLs without `http://` or `https://` may return empty strings
- **Invalid characters:** Special characters get handled according to RFC encoding rules
- **Empty inputs:** Null or empty strings return empty results without breaking your query

## Performance tactics for large tables

When you're processing millions of URLs, hostname extraction can slow down your queries significantly. However, a few optimization strategies can make a significant difference in performance.

### 1. Create a materialized column

Pre-compute hostnames when data arrives instead of parsing URLs repeatedly:

```sql
ALTER TABLE url_logs
ADD COLUMN hostname String MATERIALIZED domainRFC(url);
```

This approach calculates the hostname once during insertion and stores it permanently. Queries against the `hostname` column run much faster than calling `domainRFC()` every time you run analytics.

### 2. Use projections for rollups

Create projections that pre-aggregate data by hostname:

```sql
ALTER TABLE url_logs
ADD PROJECTION hostname_stats (
    SELECT hostname, count(), sum(bytes)
    GROUP BY hostname
);
```

Projections work like [materialized views](https://www.tinybird.co/docs/forward/work-with-data/optimize/materialized-views) that ClickHouse® maintains automatically. When you query traffic patterns by domain, ClickHouse® uses the pre-computed aggregations instead of scanning the entire table.

### 3. Apply low-cardinality dictionaries

When your data contains many repeated hostnames, use `LowCardinality` encoding:

```sql
CREATE TABLE url_logs (
    timestamp DateTime,
    url String,
    hostname LowCardinality(String) MATERIALIZED domainRFC(url)
) ENGINE = MergeTree()
ORDER BY timestamp;
```

This optimization works best when you have thousands of URLs but only hundreds of unique hostnames—a common pattern in web analytics where users visit many pages on the same domains.

### 4. Avoid repeated parsing in joins

Extract hostnames once and reuse the result throughout complex queries:

```sql
WITH extracted AS (
    SELECT url, domainRFC(url) AS hostname
    FROM url_logs
    WHERE timestamp >= yesterday()
)
SELECT
    e.hostname,
    count(*) AS requests,
    d.category
FROM extracted e
JOIN domain_categories d ON e.hostname = d.hostname
GROUP BY e.hostname, d.category;
```

This pattern prevents calling `domainRFC()` multiple times for the same URLs within a single query, which can add up to significant time savings on large datasets.

## Building ClickHouse®-based web and URL analytics APIs with Tinybird

Let's create a complete example that demonstrates how to build a real-time API for hostname extraction using [Tinybird's](https://www.tinybird.co) managed ClickHouse® platform.

This walkthrough shows how to ingest web analytics data, process URLs with ClickHouse® functions, and expose hostname analytics through a real-time API endpoint.

### 1. Create the data source

First, create a data source to store web analytics events:

```tinybird
SCHEMA >
    `timestamp` DateTime64(3) `json:$.timestamp`,
    `url` String `json:$.url`,
    `user_agent` String `json:$.user_agent`,
    `status_code` UInt16 `json:$.status_code`,
    `user_id` String `json:$.user_id`

ENGINE MergeTree
ENGINE_SORTING_KEY timestamp
ENGINE_PARTITION_KEY toYYYYMM(timestamp)
```

Now build the project locally:

```bash
tb dev
```

This builds the project on [Tinybird Local](https://www.tinybird.co/docs/forward/install-tinybird/local) and creates your data source. Then ingest some sample data using the Events API:

```bash
curl -X POST "http://localhost:7181/v0/events?name=web_analytics" \
  -H "Content-Type: application/json" \
  -d '{"timestamp": "2025-10-03 14:30:15.123", "url": "https://api.example.com:8080/v1/data?param=value", "user_agent": "Mozilla/5.0", "status_code": 200, "user_id": "user123"}
{"timestamp": "2025-10-03 14:30:16.456", "url": "https://www.shop.example.com/products?category=electronics", "user_agent": "Chrome/91.0", "status_code": 200, "user_id": "user456"}
{"timestamp": "2025-10-03 14:30:17.789", "url": "https://user:pass@secure.example.com:8443/admin", "user_agent": "Firefox/89.0", "status_code": 401, "user_id": "user789"}'
```

### 2. Create the pipe files

Create a pipe that processes URL data and provides analytics:

```tinybird
NODE web_events
SQL >
    SELECT
        timestamp,
        url,
        user_agent,
        status_code,
        user_id,
        domainRFC(url) AS hostname,
        path(url) AS url_path,
        protocol(url) AS scheme,
        queryString(url) AS query_params
    FROM web_analytics
    WHERE timestamp >= now64(3) - INTERVAL 1 HOUR

NODE hostname_analytics
SQL >
    %
    SELECT
        hostname,
        scheme,
        COUNT(*) AS request_count,
        COUNT(DISTINCT user_id) AS unique_users,
        COUNTIf(status_code >= 400) AS error_count,
        MIN(timestamp) AS first_seen,
        MAX(timestamp) AS last_seen
    FROM web_events
    WHERE
        {\% if defined(hostname_filter) %}
            hostname LIKE {{ String(hostname_filter) }}
        {\% end %}
        {\% if defined(hours_back) %}
            AND timestamp >= now64(3) - INTERVAL {{ Int32(hours_back) }} HOUR
        {\% end %}
    GROUP BY hostname, scheme
    ORDER BY request_count DESC
    LIMIT {{ Int32(limit, 100) }}

TYPE endpoint
```

### 3. Deploy and test the API

Deploy your project to Tinybird Cloud:

```bash
tb --cloud deploy
```

Then test your API endpoint:

```bash
# Get all hostname analytics from the last hour
curl -H "Authorization: Bearer <your_token>" \
  "https://api.tinybird.co/v0/pipes/hostname_analytics.json"

# Filter by hostname pattern
curl -H "Authorization: Bearer <your_token>" \
  "https://api.tinybird.co/v0/pipes/hostname_analytics.json?hostname_filter=%example.com"

# Filter by time range
curl -H "Authorization: Bearer <your_token>" \
  "https://api.tinybird.co/v0/pipes/hostname_analytics.json?hours_back=2"
```

### 4. Sample API response

The API returns structured data with hostname analytics:

```json
{
  "meta": [
    {
      "name": "hostname",
      "type": "String"
    },
    {
      "name": "scheme",
      "type": "String"
    },
    {
      "name": "request_count",
      "type": "UInt64"
    },
    {
      "name": "unique_users",
      "type": "UInt64"
    },
    {
      "name": "error_count",
      "type": "UInt64"
    },
    {
      "name": "first_seen",
      "type": "DateTime64(3)"
    },
    {
      "name": "last_seen",
      "type": "DateTime64(3)"
    }
  ],
  "data": [
    {
      "hostname": "api.example.com",
      "scheme": "https",
      "request_count": 1247,
      "unique_users": 892,
      "error_count": 23,
      "first_seen": "2025-10-03 14:30:15.123",
      "last_seen": "2025-10-03 15:45:22.789"
    },
    {
      "hostname": "shop.example.com",
      "scheme": "https",
      "request_count": 456,
      "unique_users": 234,
      "error_count": 5,
      "first_seen": "2025-10-03 14:30:16.456",
      "last_seen": "2025-10-03 15:30:18.345"
    }
  ],
  "rows": 2,
  "statistics": {
    "elapsed": 0.045,
    "rows_read": 2000,
    "bytes_read": 89000
  }
}
```

## Common mistakes and how to debug them

Hostname extraction looks simple until you encounter real-world data complexity. Understanding common pitfalls helps you build more robust URL processing pipelines.

### Choosing domain instead of domainRFC

The `domain()` function fails silently with many URL formats, returning empty strings where you'd expect valid hostnames:

```sql
SELECT
    url,
    domain(url) AS domain_result,
    domainRFC(url) AS rfc_result
FROM (
    SELECT 'ftp://user:pass@files.example.com/data' AS url
    UNION ALL
    SELECT 'https://api.example.com:8443/v1'
)
WHERE domain_result != rfc_result;
```

Always use `domainRFC()` unless you've specifically tested `domain()` against your complete dataset and confirmed it handles all your URL patterns correctly.

### Ignoring null results on malformed URLs

Malformed URLs produce empty strings that can break downstream processing:

```sql
SELECT
    url,
    domainRFC(url) AS hostname,
    CASE
        WHEN hostname = '' THEN 'invalid_url'
        ELSE hostname
    END AS processed_hostname
FROM url_table
WHERE url IS NOT NULL;
```

This pattern handles invalid URLs explicitly rather than letting empty strings propagate through your analysis and create confusing results.

### Parsing inside a subquery without limit

Nested hostname extraction without proper limits can cause performance issues:

```sql
-- Avoid this pattern
SELECT hostname, count(*)
FROM (
    SELECT domainRFC(url) AS hostname
    FROM large_url_table  -- millions of rows
    WHERE hostname IN (
        SELECT domainRFC(competitor_url)
        FROM competitor_urls  -- parsed repeatedly
    )
)
GROUP BY hostname;
```

Instead, extract hostnames once and join on the results, or use materialized columns to avoid repeated parsing overhead.

## Related URL parsing functions to know

ClickHouse® provides a complete suite of URL manipulation functions that work alongside `domainRFC()` for comprehensive URL analysis.

### path

Extract URL path components after the hostname:

```sql
SELECT
    domainRFC(url) AS hostname,
    path(url) AS url_path
FROM url_logs
WHERE url_path LIKE '/api/%';
```

The `path()` function returns everything between the hostname and query string, including the leading slash.

### protocol

Get scheme information to understand connection types:

```sql
SELECT
    protocol(url) AS scheme,
    domainRFC(url) AS hostname,
    count(*) AS requests
FROM url_logs
GROUP BY scheme, hostname
ORDER BY requests DESC;
```

This helps analyze traffic patterns across HTTP, HTTPS, FTP, and other protocol types in your data.

### queryString

Parse query parameters for detailed request analysis:

```sql
SELECT
    domainRFC(url) AS hostname,
    queryString(url) AS params,
    count(*) AS requests
FROM url_logs
WHERE params != ''
GROUP BY hostname, params;
```

Query strings contain the parameters passed after the `?` character in URLs, which can be valuable for understanding user behavior patterns.

### topLevelDomain

Extract TLD information for geographic or organizational analysis:

```sql
SELECT
    topLevelDomain(domainRFC(url)) AS tld,
    count(*) AS requests
FROM url_logs
GROUP BY tld
ORDER BY requests DESC;
```

This function works on the output of `domainRFC()` to identify domains by their top-level domain suffix like `.com`, `.org`, or country codes.

| Function | Purpose | Example Input | Example Output |
| -------- | ------- | ------------- | -------------- |
| `domainRFC()` | Extract hostname | `https://api.example.com:443/v1?key=value` | `api.example.com` |
| `path()` | Extract path | `https://api.example.com:443/v1?key=value` | `/v1` |
| `protocol()` | Extract scheme | `https://api.example.com:443/v1?key=value` | `https` |
| `queryString()` | Extract parameters | `https://api.example.com:443/v1?key=value` | `key=value` |
| `topLevelDomain()` | Extract TLD | `api.example.com` | `com` |

## Using Tinybird as a managed ClickHouse®

Managing ClickHouse® infrastructure for hostname extraction involves complexity—from optimizing storage and query performance to handling URL edge cases and maintaining consistent data quality across distributed systems.

Tinybird's [managed ClickHouse® service](https://www.tinybird.co/clickhouse) reduces infrastructure overhead by providing a managed ClickHouse® environment that abstracts optimization, scaling, and maintenance. Tinybird exposes ClickHouse® features such as `domainRFC()` and URL parsing without requiring direct infrastructure management, including database clusters, performance monitoring, or version upgrades.

This enables teams to focus on application development rather than database operations or configuring ClickHouse® for production workloads. Tinybird's platform supports developer efficiency through managed scaling and workflow integrations.

[Sign up for a free Tinybird account](https://cloud.tinybird.co/signup) to build and test your first ClickHouse®-based API in just a few minutes.

## Additional resources

- [domainRFC function](https://clickhouse.com/docs/en/sql-reference/functions/url-functions#domainrfc)
- [domain function](https://clickhouse.com/docs/en/sql-reference/functions/url-functions#domain)
- [path function](https://clickhouse.com/docs/en/sql-reference/functions/url-functions#path)
- [protocol function](https://clickhouse.com/docs/en/sql-reference/functions/url-functions#protocol)
- [queryString function](https://clickhouse.com/docs/en/sql-reference/functions/url-functions#querystring)
- [topLevelDomain function](https://clickhouse.com/docs/en/sql-reference/functions/url-functions#topleveldomain)
- [URL functions overview](https://clickhouse.com/docs/en/sql-reference/functions/url-functions)
- [How to extract domain without www in ClickHouse®](https://www.tinybird.co/blog-posts/clickhouse-extract-domain-without-www)
- [Extract URL query string in ClickHouse®](https://www.tinybird.co/blog-posts/extract-url-query-string-clickhouse)
- [How to build URL hierarchy in ClickHouse®](https://www.tinybird.co/blog-posts/how-to-build-url-hierarchy-clickhouse)

## FAQs about extracting RFC-compliant hostnames in ClickHouse®

### Which ClickHouse® version introduced domainRFC?

The `domainRFC()` function was added in ClickHouse® version 21.4 as an improvement over the original `domain()` function. Earlier versions only had the less reliable `domain()` function available.

### How do I strip subdomains like www from the extracted hostname?

Use `domainWithoutWWW(domainRFC(url))` to remove common www prefixes from the extracted hostname. This function specifically handles the www subdomain while preserving other subdomain structures.

### Does domainRFC handle ftp or custom protocol schemes?

Yes, `domainRFC()` extracts hostnames from any valid URL regardless of the protocol scheme used. It works with HTTP, HTTPS, FTP, SSH, and custom protocols as long as the URL follows standard formatting.

### Is domainRFC deterministic for materialized views?

Yes, `domainRFC()` produces consistent results for the same input, making it safe for materialized view calculations. The function's deterministic behavior ensures that materialized columns and projections update correctly during data ingestion.
