---
title: "How to extract domains without www from URLs in ClickHouse®"
excerpt: "Learn how to use ClickHouse®'s domainWithoutWWW function to extract clean domains from URLs, plus advanced techniques for processing entire tables at scale."
authors: "Cameron Archer"
categories: "AI Resources"
createdOn: "2025-10-03 17:22:12"
publishedOn: "2025-10-03 17:22:12"
updatedOn: "2025-10-03 17:22:12"
status: "published"
---

When you're analyzing web traffic or building domain-level analytics, URLs arrive in frustratingly inconsistent formats, some with www prefixes, others without, making it impossible to group `www.example.com` and `example.com` as the same domain. ClickHouse®'s `domainWithoutWWW` function solves this by extracting clean domain names from any URL format while automatically stripping the www prefix. This guide walks you through the function's syntax and edge cases, shows you how to process entire tables efficiently, and demonstrates how to build real-time domain analytics APIs using scalable infrastructure.

## What `domainWithoutWWW` does and when to use it

The `domainWithoutWWW` function extracts the domain from a URL and automatically removes any leading "www." prefix if present. When you pass `'https://www.example.com/path'` to this function, it returns `'example.com'`, stripping away the protocol, path, query parameters, and the www prefix in one operation.

This function becomes essential when you're working with web analytics data where URLs might inconsistently include or exclude the www prefix. Web traffic often comes from both `www.example.com` and `example.com`, but you want to treat them as the same domain for analytics purposes.

You'll find it particularly useful for grouping page views by domain, normalizing referrer data, or preparing URLs for domain-level aggregations. The function handles URL parsing complexity internally, so you don't have to write custom string manipulation logic to achieve consistent domain extraction across your dataset.

## Syntax and parameters

The function follows a straightforward syntax pattern:

```sql
domainWithoutWWW(url)
```

The single parameter accepts a String type (or Nullable(String) if your data contains null values). You can pass full URLs with protocols, bare hostnames, or even URLs with complex query strings and fragments, and the function intelligently extracts just the domain portion.

Here's how it works with a simple example:

```sql
SELECT domainWithoutWWW('https://www.example.com/path?query=value') AS clean_domain;
-- Returns: 'example.com'
```

## Return type and edge cases

The function returns a String (or Nullable(String) if your input column allows nulls). However, you'll encounter specific behaviors with malformed or unusual inputs that are worth understanding upfront.

When the function can't parse a valid domain from the input, it typically returns an empty string rather than throwing an error. This graceful handling means your queries won't break on messy data, though you might want to filter out empty results depending on your use case.

### Empty Or Null Input

Empty strings and null values behave predictably:

```sql
SELECT domainWithoutWWW('') AS empty_result;
-- Returns: ''

SELECT domainWithoutWWW(NULL) AS null_result;
-- Returns: NULL
```

If you're working with a Nullable(String) column and want to avoid nulls in your results, you can use `coalesce`:

```sql
SELECT coalesce(domainWithoutWWW(url), '') AS clean_domain
FROM your_table;
```

### Internationalized Domains

ClickHouse® URL functions work with ASCII hostnames, so internationalized domains work best when already converted to punycode format:

```sql
SELECT domainWithoutWWW('https://www.xn--e1afmkfd.xn--p1ai') AS punycode_domain;
-- Returns: 'xn--e1afmkfd.xn--p1ai'
```

Unicode hostnames passed directly might produce inconsistent results depending on your ClickHouse® build and locale settings.

### Missing Protocol

The function can extract domains from both full URLs and bare hostnames, though results vary with malformed strings:

```sql
SELECT domainWithoutWWW('www.example.com') AS bare_hostname;
-- Returns: 'example.com'

SELECT domainWithoutWWW('example.com/some/path') AS with_path;
-- Returns: 'example.com'

SELECT domainWithoutWWW('/just/a/path') AS path_only;
-- Returns: ''
```

## Quick examples on single URLs

Let's walk through practical examples that demonstrate the function's behavior across different URL formats you'll encounter in real data.

Standard www URLs get cleaned as expected:

```sql
SELECT domainWithoutWWW('http://www.example.com') AS http_www;
-- Returns: 'example.com'

SELECT domainWithoutWWW('https://www.example.com:8443?q=1') AS https_with_port;
-- Returns: 'example.com'
```

URLs without www remain unchanged (except for protocol and path removal):

```sql
SELECT domainWithoutWWW('https://example.com') AS no_www;
-- Returns: 'example.com'

SELECT domainWithoutWWW('sub.example.com') AS subdomain;
-- Returns: 'sub.example.com'
```

Complex URLs with paths, queries, and fragments still extract cleanly:

- `domainWithoutWWW('https://www.blog.example.com/path/article?id=123#section')` returns `'blog.example.com'`
- `domainWithoutWWW('ftp://www.example.co.uk/files/download')` returns `'example.co.uk'`

Notice how the function preserves subdomains like "blog" while removing only the "www" prefix specifically.

## Processing an entire ClickHouse® table

Moving from single values to processing entire tables unlocks the real power of domain normalization at scale. Let's assume you have a table structure like this:

```sql
CREATE TABLE visits (
    ts DateTime,
    url String,
    user_id UInt64
) ENGINE = MergeTree
ORDER BY ts;
```

### Selecting Clean Domains

Extract and analyze cleaned domains across your dataset:

```sql
SELECT
    domainWithoutWWW(url) AS clean_domain,
    count() AS visits,
    uniq(user_id) AS unique_users
FROM visits
WHERE ts >= now() - INTERVAL 7 DAY
GROUP BY clean_domain
ORDER BY visits DESC
LIMIT 10;
```

You can filter out malformed URLs that produce empty domains:

```sql
SELECT *
FROM visits
WHERE domainWithoutWWW(url) != ''
  AND ts >= today() - 1;
```

### Updating In Place

For tables where you frequently query by clean domain, consider adding a dedicated column:

```sql
ALTER TABLE visits ADD COLUMN clean_domain String DEFAULT '';
ALTER TABLE visits UPDATE clean_domain = domainWithoutWWW(url) WHERE 1;
```

**Performance consideration:** ALTER...UPDATE operations rewrite data and can be expensive on large tables. Schedule during low-traffic periods or consider the materialized column approach instead.

A materialized column automatically computes the clean domain for new data:

```sql
ALTER TABLE visits ADD COLUMN clean_domain String MATERIALIZED domainWithoutWWW(url);
```

### Materialized View Pattern

For high-frequency analytics queries, you can precompute domain aggregations using [materialized views](https://tinybird.co/docs/forward/work-with-data/optimize/materialized-views):

```sql
CREATE TABLE domain_stats (
    clean_domain String,
    visits AggregateFunction(count),
    unique_users AggregateFunction(uniq, UInt64)
) ENGINE = AggregatingMergeTree
ORDER BY clean_domain;

CREATE MATERIALIZED VIEW visits_to_domain_stats
TO domain_stats
AS SELECT
    domainWithoutWWW(url) AS clean_domain,
    countState() AS visits,
    uniqState(user_id) AS unique_users
FROM visits
GROUP BY clean_domain;
```

This pattern precomputes domain-level metrics at insert time, dramatically speeding up dashboard queries that aggregate by domain.

## `domainWithoutWWW` vs `cutWWW` vs `domain`

ClickHouse® provides several related functions for URL and domain processing, each optimized for different use cases:

- **`domainWithoutWWW`:** Extracts domain from full URLs and removes www prefix
- **`cutWWW`:** Removes www prefix from domain strings only
- **`domain`:** Extracts domain from URLs but preserves www

Here's how they differ in practice:

```sql
-- domainWithoutWWW: Full URL processing with www removal
SELECT domainWithoutWWW('https://www.example.com/path') AS result;
-- Returns: 'example.com'

-- cutWWW: String processing for already-extracted domains
SELECT cutWWW('www.example.com') AS result;
-- Returns: 'example.com'

-- domain: Full URL processing but keeps www
SELECT domain('https://www.example.com/path') AS result;
-- Returns: 'www.example.com'
```

Choose `domainWithoutWWW` when you're working with raw URL data and want normalized domain grouping. Use `cutWWW` for post-processing already-extracted domains, and `domain` when the www distinction matters for your analysis.

## Performance considerations at scale

URL parsing involves string operations that can become expensive at scale, especially when processing billions of rows. Understanding performance characteristics helps you architect efficient data pipelines.

### Function Costs In CPU Cycles

String parsing adds computational overhead proportional to your URL complexity and data volume. URLs with long paths, many query parameters, or unusual encoding patterns require more CPU cycles to process.

The function also creates new string objects for each result, which impacts memory allocation patterns and garbage collection in high-throughput scenarios.

### Using Projections Or Views

Precomputing cleaned domains often yields better query performance than on-the-fly parsing:

```sql
ALTER TABLE visits ADD PROJECTION domain_projection (
    SELECT
        domainWithoutWWW(url) AS clean_domain,
        count()
    GROUP BY clean_domain
);
```

Materialized views can aggregate by clean domain during data ingestion, eliminating parsing overhead from analytical queries entirely.

### Benchmark Results

In practice, on-the-fly parsing works well for exploratory queries and datasets under 100M rows. Beyond that scale, or for frequently-accessed dashboard queries, precomputation through materialized columns or views typically provides 2-5x performance improvements.

The exact performance gain depends on your URL complexity, query patterns, and hardware configuration, but the pattern holds consistently across different workloads.

## Building a real-time, ClickHouse®-based web analytics API with Tinybird

Let's create a complete example that ingests web traffic data and exposes an [API endpoint](https://www.tinybird.co/docs/api-reference/pipe-api/api-endpoints) returning domain-level analytics using `domainWithoutWWW`.

This walkthrough shows how domain normalization integrates into real-world analytics workflows, from data ingestion through API deployment. You'll see exactly how the function works in practice.

[Tinybird](https://www.tinybird.co) is a managed service that makes it easy to build real-time analytics APIs powered by ClickHouse®. It handles the infrastructure, scaling, and API generation so you can focus on writing SQL and building analytics features.

In this example, we'll use Tinybird to:

- Create a data source for ingesting web traffic events
- Write a pipe that processes and aggregates the data using `domainWithoutWWW`
- Deploy an API endpoint that serves domain-level analytics

The workflow demonstrates how to combine ClickHouse®'s powerful URL processing with Tinybird's API capabilities to build production-ready analytics services.

### 1. Create the data source

First, create a [data source](https://www.tinybird.co/docs/forward/get-data-in/data-sources) to store your web traffic events:

```tinybird
SCHEMA >
    `timestamp` DateTime64(3, 'UTC') `json:$.timestamp`,
    `url` String `json:$.url`,
    `user_id` UInt32 `json:$.user_id`,
    `session_id` String `json:$.session_id`,
    `page_title` String `json:$.page_title`,
    `referrer` String `json:$.referrer`

ENGINE MergeTree
ENGINE_SORTING_KEY timestamp
ENGINE_PARTITION_KEY toYYYYMM(timestamp)
```

Now build the data source locally:

```bash
tb dev
```

This builds the data source on Tinybird Local (`localhost:7181`), which you'll need to have running. Then ingest some sample data using the Events API:

```bash
curl -X POST "http://localhost:7181/v0/events?name=web_traffic" \
  -H "Content-Type: application/json" \
  -d '{"timestamp": "2025-10-03 14:30:15.123", "url": "https://www.example.com/products", "user_id": 1001, "session_id": "sess_abc123", "page_title": "Products Page", "referrer": "https://www.google.com/search"}
{"timestamp": "2025-10-03 14:30:16.456", "url": "https://blog.example.com/article", "user_id": 1002, "session_id": "sess_def456", "page_title": "Blog Article", "referrer": "https://www.facebook.com"}
{"timestamp": "2025-10-03 14:30:17.789", "url": "https://www.github.com/tinybird", "user_id": 1003, "session_id": "sess_ghi789", "page_title": "GitHub Profile", "referrer": "https://www.twitter.com"}'
```

### 2. Write the pipe query

Create a [pipe](https://www.tinybird.co/docs/forward/work-with-data/pipes) that aggregates traffic by normalized domain:

```tinybird
NODE traffic_events
SQL >
    SELECT
        timestamp,
        domainWithoutWWW(url) AS clean_domain,
        user_id,
        session_id,
        page_title,
        domainWithoutWWW(referrer) AS clean_referrer
    FROM web_traffic
    WHERE domainWithoutWWW(url) != ''
      AND timestamp >= now() - INTERVAL 7 DAY

NODE domain_analytics
SQL >
    %
    SELECT
        clean_domain,
        COUNT(*) AS total_visits,
        COUNT(DISTINCT user_id) AS unique_users,
        COUNT(DISTINCT session_id) AS unique_sessions,
        COUNT(DISTINCT clean_referrer) AS unique_referrers,
        MIN(timestamp) AS first_visit,
        MAX(timestamp) AS last_visit
    FROM traffic_events
    WHERE
        {% if defined(domain) %}
            clean_domain = {{ String(domain) }}
        {% end %}
        {% if defined(date_from) %}
            AND timestamp >= {{ DateTime(date_from) }}
        {% end %}
        {% if defined(date_to) %}
            AND timestamp <= {{ DateTime(date_to) }}
        {% end %}
    GROUP BY clean_domain
    ORDER BY total_visits DESC
    LIMIT {{ Int32(limit, 100) }}

TYPE endpoint
```

The `domainWithoutWWW` function ensures consistent domain grouping regardless of www prefixes in your URLs.

### 3. Deploy and call the endpoint

Deploy your pipe to create the API endpoint:

```bash
tb --cloud deploy
```

Then call your new endpoint:

```bash
curl -H "Authorization: Bearer <your_token>" \
  "https://api.tinybird.co/v0/pipes/domain_analytics.json?domain=example.com&limit=5"
```

The response returns neatly aggregated data with normalized domains:

```json
{
  ...,
  "data": [
    {
      "clean_domain": "example.com",
      "total_visits": 1247,
      "unique_users": 892,
      "unique_sessions": 1156,
      "unique_referrers": 23,
      "first_visit": "2025-10-03 14:30:15",
      "last_visit": "2025-10-03 15:45:22"
    }
  ],
  "rows": 1,
  "statistics": {
    "elapsed": 0.023,
    "rows_read": 1000,
    "bytes_read": 45000
  }
}
```

## Next steps with ClickHouse® and Tinybird

ClickHouse® is greating for normalizing URL data to build clean domain-level analytics. For large datasets, ClickHouse® allows you to precompute clean domains through materialized columns or views to optimize query performance.

When you're ready to operationalize your ClickHouse®-based analytics workflows into APIs, Tinybird enables you to deploy production APIs over your data transformations and manages the operational complexity of the underlying infrastructure.

[Get started with a free Tinybird account](https://cloud.tinybird.co/signup) to create real-time APIs over your domain analysis data.

## FAQs about extracting domains in ClickHouse®

### Does `domainWithoutWWW` handle multiple subdomains like blog.www.example.com?

The function only strips a leading "www." prefix, leaving other subdomains intact. So `domainWithoutWWW('https://blog.www.example.com/article')` returns `'blog.example.com'`, removing the www but preserving the blog subdomain.

### Is the `domainWithoutWWW` function available in older ClickHouse® versions?

Yes, `domainWithoutWWW` has been part of ClickHouse®'s core URL functions for many years and works across most maintained versions. You won't encounter compatibility issues unless you're running a very outdated ClickHouse® installation.

### Can I remove both the protocol and www together in one operation?

Absolutely. The `domainWithoutWWW` function automatically handles both, extracting the hostname from any valid URL format and removing the www prefix in a single operation. `domainWithoutWWW('https://www.example.com/path?query=1')` gives you `'example.com'` directly.

## Additional resources

- [domainWithoutWWW function](https://clickhouse.com/docs/en/sql-reference/functions/url-functions#domainwithoutwww)
- [cutWWW function](https://clickhouse.com/docs/en/sql-reference/functions/string-functions#cutwww)
- [domain function](https://clickhouse.com/docs/en/sql-reference/functions/url-functions#domain)
- [URL functions in ClickHouse®](https://clickhouse.com/docs/en/sql-reference/functions/url-functions)
- [String functions in ClickHouse®](https://clickhouse.com/docs/en/sql-reference/functions/string-functions)
- [Working with URLs in Tinybird](https://www.tinybird.co/docs/sql-reference/functions/url-functions)
- [How to extract URL fragments without hash symbols (#) in ClickHouse®](https://www.tinybird.co/blog-posts/extract-url-fragment-clickhouse)
- [How to extract the protocol of a URL in ClickHouse®](https://www.tinybird.co/blog-posts/extract-url-protocol-clickhouse)
- [How to extract query parameters from URLs in ClickHouse®](https://www.tinybird.co/blog-posts/extract-query-parameter-clickhouse)
