---
title: "How to extract the first Significant Subdomain From URLs in ClickHouse®"
excerpt: "Learn how to use ClickHouse®'s cutToFirstSignificantSubdomain function to extract core domains from URLs, with syntax examples, performance tips, and real-world use cases."
authors: "Cameron Archer"
categories: "AI Resources"
createdOn: "2025-09-30 17:25:43"
publishedOn: "2025-09-30 17:25:43"
updatedOn: "2025-09-30 17:25:43"
status: "published"
---

When you're analyzing web traffic or building domain-based analytics in ClickHouse{% sup %}®{% /sup %}, raw URLs like `news.bbc.co.uk` and `www.bbc.co.uk` may represent the same core domain but can get aggregated separately. The `cutToFirstSignificantSubdomain()` function solves this by extracting the registrable domain – the core organizational identifier that strips away prefixes while respecting complex TLD structures.

This guide covers the function's syntax and behavior, practical implementation patterns, and how to build real-time APIs that normalize domains at scale.

## What is the first significant subdomain?

The first significant subdomain is the part of a domain that includes top-level subdomains up to the "first significant subdomain"; it's essentially the registrable domain plus its public suffix. This strips away non-essential prefixes like www, m, or api while preserving the organization's core domain identity.

Think of it this way: when you see `news.bbc.co.uk`, the first significant subdomain is `bbc.co.uk`. The function ignores the `news` part because it's just a subdomain, but keeps `bbc.co.uk` because that's where the actual domain registration happens.

Here's how it works with different URL patterns:

- `support.google.co.jp` becomes `google.co.jp`
- `m.example.com` becomes `example.com`
- `api.internal.company.org` becomes `company.org`

The function handles edge cases automatically. IPv4 and IPv6 addresses return empty strings since they're not domains, and port numbers get stripped from the result entirely.

## When should you use cutToFirstSignificantSubdomain?

Devs building web analytics views use this function when they want to group metrics by organization rather than tracking every single subdomain separately. Instead of having separate entries for `blog.company.com`, `shop.company.com`, and `support.company.com`, you get unified analytics under `company.com`.

Content categorization becomes much cleaner too. Rather than maintaining complex rules for every possible subdomain combination, you can normalize sources to a single domain identity for deduplication and classification work.

The function exists because maintaining your own public suffix list is a nightmare, while building regex patterns for domain parsing often produces wrong results. ClickHouse{% sup %}®{% /sup %}'s built-in function handles the complexity of multi-level TLDs and international domains without you having to think about it.

You'll find this particularly useful when building dashboards, attribution models, or any system where consistent domain normalization matters more than preserving exact subdomain details.

## Syntax of cutToFirstSignificantSubdomain

The function signature is straightforward: `cutToFirstSignificantSubdomain(URL: String) → String`. It takes any string containing a URL or host and returns the normalized domain as a string.

The URL parameter accepts full URLs with schemes, bare hostnames, or even malformed strings. Only properly formatted URLs produce meaningful results though – garbage input returns an empty string rather than throwing an error.

This makes the function safe to use in aggregations and joins without worrying about null handling or error catching.

## Quick example query

Here's how the function behaves with different types of input:

```sql
SELECT cutToFirstSignificantSubdomain('https://news.bbc.co.uk/world') AS result;
-- Result: 'bbc.co.uk'

SELECT cutToFirstSignificantSubdomain('http://www.github.com/docs') AS result;
-- Result: 'github.com'

SELECT cutToFirstSignificantSubdomain('https://store.google.co.jp/cart') AS result;
-- Result: 'google.co.jp'
```

And some edge cases to show robustness:

```sql
SELECT cutToFirstSignificantSubdomain('http://localhost:8080') AS result;
-- Result: '' (localhost isn't a public domain)

SELECT cutToFirstSignificantSubdomain('not a url') AS result;
-- Result: '' (invalid input)

SELECT cutToFirstSignificantSubdomain('https://192.168.1.1/path') AS result;
-- Result: '' (IP addresses return empty)
```

The function handles internationalized domains as well by converting them to punycode internally, so domains with non-ASCII characters work correctly.

```sql
SELECT cutToFirstSignificantSubdomain('https://bücher.example.de/path') AS result;
-- Result: 'example.de'

SELECT cutToFirstSignificantSubdomain('https://新闻.example.com/news') AS result;
-- Result: 'example.com'
```

## Handling multi-level TLDs and IDNs

### .co.uk and other public suffixes

The function uses ClickHouse{% sup %}®{% /sup %}'s built-in public suffix list to correctly interpret complex TLD structures. This means domains like `.co.uk`, `.com.au`, and `.gov.uk` are properly recognized as single units rather than being split incorrectly.

For `a.b.example.co.uk`, the function correctly returns `example.co.uk` because it understands that `.co.uk` is a single public suffix. Without this knowledge, naive parsing might incorrectly return `b.example.co.uk` or fail entirely.

Government and educational domains benefit particularly from this handling since they often use multi-level TLD structures that trip up simple string splitting approaches.

### Internationalized domain names

Non-ASCII domains get automatically normalized using punycode encoding under the hood. This ensures consistent results regardless of whether your input uses Unicode characters or their encoded equivalents.

When processing domains like `https://bücher.de`, the function converts the Unicode characters to their punycode representation before applying the extraction logic. Your results will show the punycode form, which maintains consistency across different systems and character encodings.

Unicode in URL paths or query parameters doesn't affect domain extraction – the function focuses solely on the host portion of the URL.

## Performance benchmarks vs regex parsing

### CPU usage comparison

ClickHouse{% sup %}®{% /sup %}'s native implementation significantly outperforms regex-based alternatives because it avoids the backtracking and state management overhead that complex regular expressions require. The C++ implementation processes domains using optimized string operations rather than pattern matching.

Vectorized execution means the function can process entire columns efficiently in parallel, while regex solutions typically evaluate row-by-row with much higher per-operation overhead. This difference becomes dramatic when processing millions of URLs.

### Memory allocation comparison

The built-in function minimizes temporary allocations by reusing internal buffers and avoiding the capture groups that regex engines create. Regular expressions for URL parsing often require multiple allocation cycles for backreferences and intermediate results.

Lower memory pressure translates to more stable performance under concurrent load, since there's less garbage collection overhead and fewer cache misses from scattered allocations.

### Throughput under concurrency

When multiple queries run simultaneously, the native function scales naturally with ClickHouse{% sup %}®{% /sup %}'s parallel execution model. Complex regex patterns often become bottlenecks under concurrency due to their computational complexity and inability to leverage vectorization.

In practice, you'll see 3-5x better throughput with the native function compared to equivalent regex solutions, especially when processing wide tables with many URL columns.

## Persisting the parsed domain in a materialized view

### 1. Create table with raw URL

Start by creating a table to store your raw URL data:

```sql
CREATE TABLE events_raw
(
    ts DateTime,
    url String,
    user_id UInt64
)
ENGINE = MergeTree
ORDER BY ts;
```

This table captures the incoming data exactly as received, without any preprocessing. The simple structure makes it fast for inserts while providing a foundation for downstream transformations.

### 2. Define materialized view with function

Create a target table to store the processed results:

```sql
CREATE TABLE events_domains
(
    ts DateTime,
    url String,
    first_domain String,
    user_id UInt64
)
ENGINE = MergeTree
ORDER BY (ts, first_domain);
```

Then create the [materialized view](https://tinybird.co/docs/forward/work-with-data/optimize/materialized-views) that automatically processes new data:

```sql
CREATE MATERIALIZED VIEW mv_events_domains
TO events_domains
AS
SELECT
    ts,
    url,
    cutToFirstSignificantSubdomain(url) AS first_domain,
    user_id
FROM events_raw;
```

The materialized view triggers automatically on each insert to `events_raw`, extracting the first significant subdomain and storing it alongside the original data.

### 3. Verify results

Test your setup with sample data:

```sql
INSERT INTO events_raw VALUES
    (now(), 'https://news.bbc.co.uk/world', 12345),
    (now(), 'http://www.github.com/docs', 67890),
    (now(), 'http://localhost:8080/test', 11111);

SELECT ts, url, first_domain, user_id
FROM events_domains
ORDER BY ts
LIMIT 10;
```

You'll see results with `bbc.co.uk` for the BBC news URL, `github.com` for the GitHub documentation, and an empty string for localhost since it's not a public domain. This approach gives you both the raw data for debugging and the processed domain for fast analytics queries.

## Alternatives and related URL functions

### extractHost

The `extractHost()` function returns the complete host component exactly as it appears in the URL, including all subdomains. For `https://news.bbc.co.uk/path`, it returns `news.bbc.co.uk` rather than the normalized `bbc.co.uk`.

Use `extractHost()` when you need fine-grained subdomain analysis or want to preserve the exact host for logging and debugging purposes. It's particularly useful for security analysis where the specific subdomain matters.

### topLevelDomain

The `topLevelDomain()` function extracts only the public suffix portion of a domain. For `https://news.bbc.co.uk`, it returns `co.uk`, giving you just the TLD information.

This function works well for market segmentation analysis, geographic inference based on country-code TLDs, or understanding the distribution of domain types in your dataset.

### cutWWW

The `cutWWW()` function provides lighter normalization by removing common vanity prefixes like 'www.' from hostnames. For `www.github.com`, it returns `github.com` without considering public suffix rules.

You can combine functions for custom normalization: `cutWWW(extractHost(url))` gives you subdomain preservation with vanity trimming, while `cutToFirstSignificantSubdomain(url)` provides full public-suffix-aware normalization.

## Example: Building ClickHouse{% sup %}®{% /sup %}-based conversion funnel analytics APIs with Tinybird

A common e-commerce analytics pattern built on ClickHouse{% sup %}®{% /sup %} involves tracking user conversion funnels in multi-tenant environments across different domains and subdomains. Using `cutToFirstSignificantSubdomain()`, you can normalize domains to analyze conversion rates by organization rather than getting fragmented metrics across subdomains.

Here's how to build a complete funnel analytics API that tracks pageviews → cart additions → checkouts across normalized domains using Tinybird's [managed ClickHouse{% sup %}®{% /sup %} platform](https://www.tinybird.co/clickhouse).

### Step 1: Create a data source for e-commerce events

First, create a data source to store your e-commerce tracking events. Create an `ecommerce_events.datasource` file:

```tinybird
SCHEMA >
    `timestamp` DateTime64(3, 'UTC') `json:$.timestamp`,
    `session_id` String `json:$.session_id`,
    `user_id` String `json:$.user_id`,
    `event_type` LowCardinality(String) `json:$.event_type`,
    `page_url` String `json:$.page_url`,
    `referrer_url` String `json:$.referrer_url`,
    `product_id` String `json:$.product_id`,
    `product_category` LowCardinality(String) `json:$.product_category`,
    `revenue` Float64 `json:$.revenue`,
    `user_agent` String `json:$.user_agent`,
    `country` LowCardinality(String) `json:$.country`

ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "timestamp, session_id"
```

### Step 2: Create a materialized view for funnel aggregation

Create a target table to store pre-aggregated funnel metrics by normalized domain. Create a `funnel_metrics.datasource`:

```tinybird
SCHEMA >
    `normalized_domain` String,
    `event_date` Date,
    `product_category` LowCardinality(String),
    `country` LowCardinality(String),
    `pageviews` UInt64,
    `cart_additions` UInt64,
    `checkouts` UInt64,
    `total_revenue` Float64,
    `unique_sessions` UInt64,
    `unique_users` UInt64

ENGINE "SummingMergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(event_date)"
ENGINE_SORTING_KEY "event_date, normalized_domain, product_category, country"
```

Create a materialized view pipe to aggregate funnel events. Create `funnel_aggregation.pipe`:

```tinybird
NODE funnel_aggregation
SQL >
    SELECT
        cutToFirstSignificantSubdomain(page_url) AS normalized_domain,
        toDate(timestamp) AS event_date,
        product_category,
        country,
        countIf(event_type = 'pageview') AS pageviews,
        countIf(event_type = 'add_to_cart') AS cart_additions,
        countIf(event_type = 'checkout') AS checkouts,
        sumIf(revenue, event_type = 'checkout') AS total_revenue,
        uniq(session_id) AS unique_sessions,
        uniq(user_id) AS unique_users
    FROM ecommerce_events
    WHERE normalized_domain != ''
    GROUP BY
        normalized_domain,
        event_date,
        product_category,
        country

TYPE MATERIALIZED
DATASOURCE funnel_metrics
```

### Step 3: Create conversion funnel analytics APIs

Create a pipe for detailed funnel analysis. Create `conversion_funnel_api.pipe`:

```tinybird
NODE funnel_analysis
SQL >
    %
    SELECT
        normalized_domain,
        product_category,
        country,
        sum(pageviews) AS total_pageviews,
        sum(cart_additions) AS total_cart_additions,
        sum(checkouts) AS total_checkouts,
        sum(total_revenue) AS total_revenue,
        sum(unique_sessions) AS total_sessions,
        sum(unique_users) AS total_users,
        -- Conversion rates
        round((sum(cart_additions) * 100.0) / nullIf(sum(pageviews), 0), 2) AS pageview_to_cart_rate,
        round((sum(checkouts) * 100.0) / nullIf(sum(cart_additions), 0), 2) AS cart_to_checkout_rate,
        round((sum(checkouts) * 100.0) / nullIf(sum(pageviews), 0), 2) AS overall_conversion_rate,
        -- Revenue metrics
        round(sum(total_revenue) / nullIf(sum(checkouts), 0), 2) AS average_order_value,
        round(sum(total_revenue) / nullIf(sum(unique_users), 0), 2) AS revenue_per_user
    FROM funnel_metrics
    WHERE event_date >= {{Date(start_date, '2024-12-01')}}
      AND event_date <= {{Date(end_date, '2024-12-31')}}
    {\% if defined(domain_filter) %}
      AND normalized_domain = {{String(domain_filter)}}
    {\% end %}
    {\% if defined(category_filter) %}
      AND product_category = {{String(category_filter)}}
    {\% end %}
    {\% if defined(country_filter) %}
      AND country = {{String(country_filter)}}
    {\% end %}
    GROUP BY normalized_domain, product_category, country
    ORDER BY total_revenue DESC
    {\% if defined(limit) %}
    LIMIT {{Int32(limit, 50)}}
    {\% end %}

TYPE ENDPOINT
```

Create a daily funnel trends API. Create `daily_funnel_trends.pipe`:

```tinybird
NODE daily_trends
SQL >
    %
    SELECT
        event_date,
        normalized_domain,
        sum(pageviews) AS daily_pageviews,
        sum(cart_additions) AS daily_cart_additions,
        sum(checkouts) AS daily_checkouts,
        sum(total_revenue) AS daily_revenue,
        -- Daily conversion rates
        round((sum(cart_additions) * 100.0) / nullIf(sum(pageviews), 0), 2) AS daily_pageview_to_cart_rate,
        round((sum(checkouts) * 100.0) / nullIf(sum(cart_additions), 0), 2) AS daily_cart_to_checkout_rate,
        round((sum(checkouts) * 100.0) / nullIf(sum(pageviews), 0), 2) AS daily_overall_conversion_rate,
        -- Running totals for trend analysis
        sum(sum(pageviews)) OVER (
            PARTITION BY normalized_domain
            ORDER BY event_date
            ROWS UNBOUNDED PRECEDING
        ) AS cumulative_pageviews,
        sum(sum(checkouts)) OVER (
            PARTITION BY normalized_domain
            ORDER BY event_date
            ROWS UNBOUNDED PRECEDING
        ) AS cumulative_checkouts
    FROM funnel_metrics
    WHERE event_date >= {{Date(start_date, '2024-12-01')}}
      AND event_date <= {{Date(end_date, '2024-12-31')}}
    {\% if defined(domain_filter) %}
      AND normalized_domain = {{String(domain_filter)}}
    {\% end %}
    GROUP BY event_date, normalized_domain
    ORDER BY event_date DESC, daily_revenue DESC
    {\% if defined(limit) %}
    LIMIT {{Int32(limit, 100)}}
    {\% end %}

TYPE ENDPOINT
```

### Step 4: Test locally with sample data

Start the local Tinybird development server:

```bash
tb dev
```

This starts a local Tinybird instance at `http://localhost:7181`. Your data sources and pipes are automatically deployed to the local environment.

Ingest sample e-commerce funnel data using the local [Events API](https://www.tinybird.co/docs/forward/get-data-in/events-api):

```bash
curl -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TINYBIRD_TOKEN" \
  "http://localhost:7181/v0/events?name=ecommerce_events" \
  -d '[
    {"timestamp": "2024-12-01 10:30:00", "session_id": "sess_001", "user_id": "user_123", "event_type": "pageview", "page_url": "https://shop.example.com/products", "referrer_url": "https://www.google.com", "product_id": "", "product_category": "", "revenue": 0, "user_agent": "Mozilla/5.0", "country": "US"},
    {"timestamp": "2024-12-01 10:32:00", "session_id": "sess_001", "user_id": "user_123", "event_type": "add_to_cart", "page_url": "https://shop.example.com/products/laptop", "referrer_url": "https://shop.example.com/products", "product_id": "laptop_001", "product_category": "Electronics", "revenue": 0, "user_agent": "Mozilla/5.0", "country": "US"},
    {"timestamp": "2024-12-01 10:35:00", "session_id": "sess_001", "user_id": "user_123", "event_type": "checkout", "page_url": "https://checkout.example.com/complete", "referrer_url": "https://shop.example.com/cart", "product_id": "laptop_001", "product_category": "Electronics", "revenue": 1299.99, "user_agent": "Mozilla/5.0", "country": "US"},
    {"timestamp": "2024-12-01 11:15:00", "session_id": "sess_002", "user_id": "user_456", "event_type": "pageview", "page_url": "https://m.example.com/products", "referrer_url": "https://facebook.com", "product_id": "", "product_category": "", "revenue": 0, "user_agent": "Mobile Safari", "country": "CA"},
    {"timestamp": "2024-12-01 11:18:00", "session_id": "sess_002", "user_id": "user_456", "event_type": "add_to_cart", "page_url": "https://m.example.com/products/phone", "referrer_url": "https://m.example.com/products", "product_id": "phone_002", "product_category": "Electronics", "revenue": 0, "user_agent": "Mobile Safari", "country": "CA"}
  ]'
```

Test your funnel analytics APIs locally:

```bash
# Test the main funnel analysis API
curl "http://localhost:7181/v0/pipes/conversion_funnel_api.json?start_date=2024-12-01&end_date=2024-12-31"

# Test with domain filtering
curl "http://localhost:7181/v0/pipes/conversion_funnel_api.json?domain_filter=example.com&category_filter=Electronics"

# Test daily trends API
curl "http://localhost:7181/v0/pipes/daily_funnel_trends.json?domain_filter=example.com&start_date=2024-12-01&end_date=2024-12-07"
```

You should see the domain normalization in action - both `shop.example.com`, `m.example.com`, and `checkout.example.com` URLs get normalized to `example.com` in the results.

### Step 5: Deploy to production

Once you've tested your funnel analytics locally and verified the domain normalization works correctly, deploy to your Tinybird cloud workspace:

```bash
tb --cloud deploy
```

Your APIs are now available in production. Query them using your cloud workspace URLs:

```bash
# Production API calls (replace YOUR_TOKEN with your actual token)
curl "$TINYBIRD_HOST/v0/pipes/conversion_funnel_api.json?start_date=2024-12-01&end_date=2024-12-31" \
  -H "Authorization: Bearer $TINYBIRD_TOKEN"

curl "$TINYBIRD_HOST/v0/pipes/daily_funnel_trends.json?domain_filter=example.com" \
  -H "Authorization: Bearer $TINYBIRD_TOKEN"
```

The APIs return comprehensive funnel analytics with domain normalization:

```json
{
    "meta": [
        {"name": "normalized_domain", "type": "String"},
        {"name": "product_category", "type": "String"},
        {"name": "country", "type": "String"},
        {"name": "total_pageviews", "type": "UInt64"},
        {"name": "total_cart_additions", "type": "UInt64"},
        {"name": "total_checkouts", "type": "UInt64"},
        {"name": "total_revenue", "type": "Float64"},
        {"name": "total_sessions", "type": "UInt64"},
        {"name": "total_users", "type": "UInt64"},
        {"name": "pageview_to_cart_rate", "type": "Float64"},
        {"name": "cart_to_checkout_rate", "type": "Float64"},
        {"name": "overall_conversion_rate", "type": "Float64"},
        {"name": "average_order_value", "type": "Float64"},
        {"name": "revenue_per_user", "type": "Float64"}
    ],
    "data": [
        {
            "normalized_domain": "example.com",
            "product_category": "Electronics",
            "country": "US",
            "total_pageviews": 1,
            "total_cart_additions": 1,
            "total_checkouts": 1,
            "total_revenue": 1299.99,
            "total_sessions": 1,
            "total_users": 1,
            "pageview_to_cart_rate": 100.0,
            "cart_to_checkout_rate": 100.0,
            "overall_conversion_rate": 100.0,
            "average_order_value": 1299.99,
            "revenue_per_user": 1299.99
        }
    ],
    "rows": 1,
    "statistics": {
        "elapsed": 0.003,
        "rows_read": 5,
        "bytes_read": 640
    }
}
```

This approach provides:

- **Domain normalization** using `cutToFirstSignificantSubdomain()` to unify metrics across subdomains
- **Complete funnel tracking** from pageviews through cart additions to final checkouts
- **Pre-aggregated metrics** via materialized views for sub-millisecond API responses
- **Conversion rate calculations** at each funnel stage with percentage breakdowns
- **Revenue analytics** including average order value and revenue per user
- **Time-series analysis** with daily trends and cumulative metrics
- **Multi-dimensional filtering** by domain, product category, and geographic region

Your e-commerce funnel data becomes available as production-ready APIs that automatically normalize domains while providing detailed conversion insights. [Start building with Tinybird's free plan](https://cloud.tinybird.co/signup) to create your ClickHouse{% sup %}®{% /sup %}-based own web analytics APIs.

## What to do next with ClickHouse{% sup %}®{% /sup %} and Tinybird

Working with URL parsing in production often means dealing with streaming data, real-time transformations, and the need to expose insights via APIs for your app to consume, all while maintaining sub-second query performance. Managing ClickHouse{% sup %}®{% /sup %} infrastructure for these requirements can quickly become complex.

[Tinybird](https://www.tinybird.co) abstracts much of the infrastructure management required for ClickHouse{% sup %}®{% /sup %}, providing broad access to ClickHouse{% sup %}®{% /sup %} functions like `cutToFirstSignificantSubdomain`. You can stream URLs, normalize domains in real-time, and expose the results through low-latency REST APIs, with much of the operational overhead handled by the platform.

The platform includes built-in data ingestion via HTTP or Kafka, automatic schema evolution, and rapid API publishing, allowing you to focus on building features rather than managing database infrastructure.

[You can sign up for free](https://cloud.tinybird.co/signup) to build and test your first ClickHouse{% sup %}®{% /sup %}-backed API in a few minutes.

## Additional resources

- [Functions for Working with URLs](https://clickhouse.com/docs/en/sql-reference/functions/url-functions)
- [String Functions](https://clickhouse.com/docs/en/sql-reference/functions/string-functions)
- [Materialized Views](https://clickhouse.com/docs/en/engines/table-engines/special/materializedview)
- [SummingMergeTree Engine](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/summingmergetree)
- [Real-time analytics API at scale with billions of rows](https://www.tinybird.co/blog-posts/real-time-analytics-with-billion-rows-at-scale)
- [Tinybird vs. ClickHouse{% sup %}®{% /sup %}: What's the difference?](https://www.tinybird.co/blog-posts/tinybird-vs-clickhouse)
- [Web Analytics Dashboard Starter Kit](https://www.tinybird.co/templates/web-analytics-starter-kit)

## FAQs about extracting the first significant subdomain

### Which ClickHouse{% sup %}®{% /sup %} version introduced cutToFirstSignificantSubdomain?

The function has been available in ClickHouse{% sup %}®{% /sup %} for many years as part of the core URL functions library. If you're running any supported stable release, it will be present. You can verify by running `SELECT version()` and checking `system.functions` for the complete list of available functions.

### What happens if the URL has no scheme or host?

Malformed inputs or strings without a valid host component return an empty string rather than throwing an error. For bare hostnames without schemes, the function attempts to parse them, but including a scheme like `http://` or `https://` ensures more consistent results.

### Does the function keep or strip port numbers?

Port numbers are automatically stripped from the result. Whether your input is `https://api.example.com:8443/path` or `example.com:3000`, the function returns the clean registrable domain without any port information.
