---
title: "How to extract the protocol of a URL in ClickHouse®"
excerpt: "Learn how to extract URL protocols in ClickHouse® using the protocol() function with practical examples, performance tips, and real-time API implementation."
authors: "Cameron Archer"
categories: "AI Resources"
createdOn: "2025-10-02 17:22:42"
publishedOn: "2025-10-02 17:22:42"
updatedOn: "2025-10-02 17:22:42"
status: "published"
---

URL parsing in ClickHouse{% sup %}®{% /sup %} becomes straightforward when you know which function to use. The `protocol()` function extracts the scheme portion from any URL string, returning values like `https`, `http`, `ftp`, or an empty string for malformed inputs.

This article covers the function's syntax and behavior, performance optimization techniques for large datasets, and how to build real-time APIs that analyze URL protocols at scale.

## What the `protocol` function does

The `protocol()` function extracts the scheme portion from a URL string and returns values like `https`, `http`, `ftp`, `mailto`, or `file`. When you pass a URL like `https://example.com/path?q=1`, the function returns just `https`.

If the input lacks a recognizable scheme or isn't formatted as a valid URL, the function returns an empty string. This behavior makes it reliable for parsing mixed URL data where some entries might be incomplete or malformed.

## Basic syntax and quick example

The function signature is straightforward: `protocol(url)` where `url` is any string expression. Here's how it works with different URL types:

```sql
SELECT
    protocol('https://clickhouse.com/docs') AS https_example,
    protocol('http://example.org') AS http_example,
    protocol('ftp://mirror.example.net/file.iso') AS ftp_example;
```

This query returns `https`, `http`, and `ftp` respectively.

### Simple select

You can extract protocols from both literal strings and table columns. If you have a table called `urls` with a `url` column, you can analyze all protocols at once:

```sql
SELECT url, protocol(url) AS proto
FROM urls
LIMIT 10;
```

### Nested use in `GROUP BY`

One of the most common use cases is counting protocol distribution across your dataset:

```sql
SELECT
    protocol(url) AS proto,
    count() AS c
FROM urls
GROUP BY proto
ORDER BY c DESC;
```

This shows you whether your traffic is predominantly HTTPS, HTTP, or includes other protocols like FTP or mailto links.

## Return values and edge cases

The function returns the substring before the first colon when it matches a valid URL scheme pattern. Understanding edge cases helps you handle real-world data more effectively.

### Missing protocol

URLs without schemes return empty strings, which is often what you want for filtering or conditional logic:

```sql
SELECT
    protocol('example.com/path') AS missing,          -- Returns ''
    protocol('/relative/path?q=1') AS relative,       -- Returns ''
    protocol('://bad') AS malformed,                  -- Returns ''
    protocol('example') AS plain_text;                -- Returns ''
```

### Invalid URLs

Completely invalid URLs that don't begin with a recognizable scheme pattern also return empty strings:

```sql
SELECT
    protocol('not a url') AS invalid,                 -- Returns ''
    protocol('http//missing-colon.com') AS broken;    -- Returns ''
```

### Non-web schemes

The function supports many schemes beyond HTTP and HTTPS, making it useful for diverse URL datasets:

- `protocol('mailto:user@example.com')` → `'mailto'`
- `protocol('tel:+14155550123')` → `'tel'`
- `protocol('file:///var/log/system.log')` → `'file'`
- `protocol('ssh://user@host')` → `'ssh'`
- `protocol('s3://bucket/key')` → `'s3'`
- `protocol('data:text/plain;base64,SGVsbG8=')` → `'data'`

Even custom or uncommon schemes work as long as they follow the general "scheme:" pattern.

## Performance tips for large datasets

When processing millions or billions of URLs, a few optimizations can significantly reduce memory usage, CPU load, and query latency.

### Use of `LowCardinality`

Protocols are inherently low-cardinality data—you'll typically see only a handful of distinct values across your entire dataset. Store extracted protocols as [`LowCardinality(String)`](https://tinybird.co/docs/sql-reference/data-types/lowcardinality) to reduce memory usage and speed up `GROUP BY` operations:

```sql
CREATE TABLE urls_enriched
(
    ts DateTime,
    user_id UInt64,
    url String,
    proto LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (ts, user_id);
```

### Push-down filters

Filter by protocol early in your query to reduce the amount of data ClickHouse{% sup %}®{% /sup %} scans:

```sql
SELECT count()
FROM urls
WHERE protocol(url) = 'https';
```

If you've precomputed the protocol as a column, filtering on that column avoids recomputation entirely:

```sql
SELECT count()
FROM urls
WHERE proto = 'https';
```

### Avoiding repeated computation with projections

Projections let you precompute protocol extraction so queries can skip per-row function calls. Here's how to set one up:

```sql
ALTER TABLE urls
ADD PROJECTION p_by_proto
(
    SELECT
        protocol(url) AS proto,
        user_id,
        ts
    ORDER BY proto, ts
);
```

After running `OPTIMIZE TABLE urls FINAL`, queries that group or filter by protocol can leverage the projection automatically for much faster performance.

## When to use `protocol` versus string functions

The `protocol()` function is purpose-built for URL parsing and handles edge cases that generic string functions might miss. However, there are situations where alternatives might make sense.

### Situations favoring substring

If your input follows a strict format and you only need a quick split before the first colon, `splitByChar` can be marginally faster:

```sql
SELECT splitByChar(':', url)[1] AS fast_proto
FROM urls;
```

This approach will misclassify malformed inputs and won't return empty strings consistently for non-URLs. Use it only when you're certain about your data quality.

### Regex trade-offs

Regex functions like `extractGroups` can handle highly irregular inputs but come with significant downsides:

- **Flexibility:** Maximum control over custom parsing rules and complex pattern matching
- **Performance cost:** Higher CPU usage and slower execution compared to built-in functions
- **Complexity:** More difficult to maintain and debug than simple function calls

Use regex only when you need advanced matching beyond what `protocol()` provides, such as extracting specific parts of custom URL schemes.

## Building a materialized view with protocol

Creating a pipeline that automatically extracts protocols during data ingestion eliminates the need for recomputation at query time. This approach is particularly useful when you're processing high-volume URL data.

### Create table and ingest

First, set up your raw data table:

```sql
CREATE TABLE urls_raw
(
    ts DateTime,
    user_id UInt64,
    url String
)
ENGINE = MergeTree
ORDER BY (ts, user_id);

INSERT INTO urls_raw VALUES
('2025-01-01 00:00:00', 1, 'https://example.com/a'),
('2025-01-01 00:00:01', 2, 'http://example.com/b'),
('2025-01-01 00:00:02', 3, 'mailto:user@example.com'),
('2025-01-01 00:00:03', 4, 'example.com/no-scheme');
```

### Define materialized view

Create an enriched table and a [materialized view](https://tinybird.co/docs/forward/work-with-data/optimize/materialized-views) that automatically populates it:

```sql
CREATE TABLE urls_enriched
(
    ts DateTime,
    user_id UInt64,
    url String,
    proto LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (ts, user_id);

CREATE MATERIALIZED VIEW mv_urls_proto
TO urls_enriched
AS
SELECT
    ts,
    user_id,
    url,
    protocol(url) AS proto
FROM urls_raw;
```

### Verify results

Check that your materialized view is working correctly:

```sql
SELECT url, proto
FROM urls_enriched
ORDER BY ts;
```

You'll see protocol values of `'https'`, `'http'`, `'mailto'`, and an empty string for the URL without a scheme.

## Building ClickHouse{% sup %}®{% /sup %}-based web security analytics APIs with Tinybird

You can transform basic protocol extraction queries into a comprehensive web security analytics API using Tinybird's [managed ClickHouse{% sup %}®{% /sup %} platform](https://www.tinybird.co/clickhouse).

Instead of managing ClickHouse{% sup %}®{% /sup %} infrastructure, you define your security analytics logic as code and [Tinybird](https://www.tinybird.co) handles the operational complexity. The platform provides sub-second query performance for millions of web requests while abstracting away database administration.

### 1. Create a data source for web security events

Define your security analytics data structure in a [data source](https://www.tinybird.co/docs/forward/get-data-in/data-sources) file:

```tinybird
SCHEMA >
    `timestamp` DateTime64(3, 'UTC'),
    `request_id` String,
    `user_id` String,
    `session_id` String,
    `url` String,
    `referrer_url` String,
    `user_agent` String,
    `ip_address` String,
    `country` LowCardinality(String),
    `device_type` LowCardinality(String),
    `response_status` UInt16,
    `response_time_ms` UInt32,
    `is_bot` UInt8,
    `threat_score` Float32

ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "timestamp, user_id"

JSONPATH >
    {
        "timestamp": "$.timestamp",
        "request_id": "$.request_id",
        "user_id": "$.user_id",
        "session_id": "$.session_id",
        "url": "$.url",
        "referrer_url": "$.referrer_url",
        "user_agent": "$.user_agent",
        "ip_address": "$.ip_address",
        "country": "$.country",
        "device_type": "$.device_type",
        "response_status": "$.response_status",
        "response_time_ms": "$.response_time_ms",
        "is_bot": "$.is_bot",
        "threat_score": "$.threat_score"
    }
```

This schema captures comprehensive web security data including protocol-based security analysis. The partitioning by month ensures fast queries as your security analytics data grows.

### 2. Ingest example data using the Events API

Add sample web security events to test your analytics using the [Events API](https://www.tinybird.co/docs/forward/get-data-in/events-api). First, start your local Tinybird development environment:

```bash
# Start Tinybird locally
tb dev
```

Then ingest sample data to your local instance:

```bash
# Ingest sample web security events to localhost
curl -X POST "http://localhost:7181/v0/events" \
  -H "Authorization: Bearer $TINYBIRD_TOKEN" \
  -H "Content-Type: application/json" \
  -d '[
    {
      "timestamp": "2024-01-15T10:30:00.000Z",
      "request_id": "req_001",
      "user_id": "user_123",
      "session_id": "sess_abc",
      "url": "https://example.com/dashboard",
      "referrer_url": "https://google.com/search",
      "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
      "ip_address": "192.168.1.100",
      "country": "US",
      "device_type": "Desktop",
      "response_status": 200,
      "response_time_ms": 150,
      "is_bot": 0,
      "threat_score": 0.1
    },
    {
      "timestamp": "2024-01-15T10:31:00.000Z",
      "request_id": "req_002",
      "user_id": "user_456",
      "session_id": "sess_def",
      "url": "http://example.com/login",
      "referrer_url": "https://example.com/",
      "user_agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X)",
      "ip_address": "10.0.0.50",
      "country": "CA",
      "device_type": "Mobile",
      "response_status": 200,
      "response_time_ms": 200,
      "is_bot": 0,
      "threat_score": 0.2
    },
    {
      "timestamp": "2024-01-15T10:32:00.000Z",
      "request_id": "req_003",
      "user_id": "user_789",
      "session_id": "sess_ghi",
      "url": "https://example.com/api/data",
      "referrer_url": "https://example.com/dashboard",
      "user_agent": "curl/7.68.0",
      "ip_address": "203.0.113.1",
      "country": "GB",
      "device_type": "Bot",
      "response_status": 403,
      "response_time_ms": 50,
      "is_bot": 1,
      "threat_score": 0.8
    },
    {
      "timestamp": "2024-01-15T10:33:00.000Z",
      "request_id": "req_004",
      "user_id": "user_321",
      "session_id": "sess_jkl",
      "url": "ftp://files.example.com/download",
      "referrer_url": "https://example.com/",
      "user_agent": "FileZilla/3.50.0",
      "ip_address": "198.51.100.10",
      "country": "DE",
      "device_type": "Desktop",
      "response_status": 200,
      "response_time_ms": 500,
      "is_bot": 0,
      "threat_score": 0.3
    },
    {
      "timestamp": "2024-01-15T10:34:00.000Z",
      "request_id": "req_005",
      "user_id": "user_654",
      "session_id": "sess_mno",
      "url": "https://example.com/admin",
      "referrer_url": "https://example.com/login",
      "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)",
      "ip_address": "172.16.0.5",
      "country": "US",
      "device_type": "Desktop",
      "response_status": 200,
      "response_time_ms": 120,
      "is_bot": 0,
      "threat_score": 0.1
    }
  ]'
```

This sample data includes various protocols (HTTPS, HTTP, FTP), different device types, geographic locations, and security scenarios to test your analytics pipes.

### 3. Build security analytics pipes

Create [pipes](https://www.tinybird.co/docs/forward/work-with-data/pipes) that analyze protocol-based security patterns and compliance metrics:

```tinybird
NODE security_events
SQL >
    %
    SELECT
        timestamp,
        request_id,
        user_id,
        session_id,
        url,
        referrer_url,
        country,
        device_type,
        response_status,
        response_time_ms,
        is_bot,
        threat_score,
        protocol(url) AS protocol_type
    FROM web_security_events
    WHERE timestamp >= {{DateTime(start_date, '2024-01-01 00:00:00')}}
      AND timestamp <= {{DateTime(end_date, '2024-12-31 23:59:59')}}
    {\% if defined(country_filter) %}
      AND country = {{String(country_filter)}}
    {\% end %}
    {\% if defined(protocol_filter) %}
      AND protocol(url) = {{String(protocol_filter)}}
    {\% end %}

NODE protocol_security_analysis
SQL >
    %
    SELECT
        protocol_type,
        count() AS total_requests,
        uniq(user_id) AS unique_users,
        uniq(session_id) AS unique_sessions,
        countIf(response_status >= 400) AS error_requests,
        round(countIf(response_status >= 400) * 100.0 / count(), 2) AS error_rate_percent,
        round(avg(response_time_ms), 2) AS avg_response_time_ms,
        countIf(is_bot = 1) AS bot_requests,
        round(avg(threat_score), 2) AS avg_threat_score,
        -- Security compliance metrics
        round(countIf(protocol_type = 'https') * 100.0 / count(), 2) AS https_adoption_percent,
        -- Top error patterns by protocol
        topK(5)(response_status) AS top_error_codes,
        -- Geographic distribution of protocol usage
        topK(3)(country) AS top_countries
    FROM security_events
    WHERE protocol_type != ''
    GROUP BY protocol_type
    ORDER BY total_requests DESC
    {\% if defined(limit) %}
    LIMIT {{Int32(limit, 20)}}
    {\% end %}

TYPE ENDPOINT
```

Create a protocol migration analysis pipe:

```tinybird
NODE protocol_trends
SQL >
    %
    SELECT
        toDate(timestamp) AS date,
        protocol(url) AS protocol_type,
        count() AS daily_requests,
        uniq(user_id) AS daily_users,
        countIf(response_status >= 400) AS daily_errors,
        round(avg(response_time_ms), 2) AS avg_response_time
    FROM web_security_events
    WHERE timestamp >= {{DateTime(start_date, '2024-01-01 00:00:00')}}
      AND timestamp <= {{DateTime(end_date, '2024-12-31 23:59:59')}}
    {\% if defined(protocol_filter) %}
      AND protocol(url) = {{String(protocol_filter)}}
    {\% end %}
    GROUP BY date, protocol_type

NODE migration_metrics
SQL >
    %
    SELECT
        protocol_type,
        date,
        daily_requests,
        daily_users,
        daily_errors,
        avg_response_time,
        -- Calculate week-over-week growth
        round((daily_requests - lag(daily_requests, 7) OVER (PARTITION BY protocol_type ORDER BY date)) * 100.0 / lag(daily_requests, 7) OVER (PARTITION BY protocol_type ORDER BY date), 2) AS wow_growth_percent,
        -- Calculate error rate trends
        round(daily_errors * 100.0 / daily_requests, 2) AS daily_error_rate
    FROM protocol_trends

NODE protocol_migration_analysis
SQL >
    %
    SELECT
        protocol_type,
        date,
        daily_requests,
        daily_users,
        daily_errors,
        daily_error_rate,
        avg_response_time,
        wow_growth_percent,
        -- Identify migration patterns
        CASE
            WHEN protocol_type = 'https' AND wow_growth_percent > 10 THEN 'accelerating_https'
            WHEN protocol_type = 'http' AND wow_growth_percent < -10 THEN 'declining_http'
            WHEN protocol_type = 'https' AND daily_error_rate < 1 THEN 'stable_https'
            ELSE 'mixed_trends'
        END AS migration_status
    FROM migration_metrics
    ORDER BY date DESC, daily_requests DESC
    {\% if defined(limit) %}
    LIMIT {{Int32(limit, 100)}}
    {\% end %}

TYPE ENDPOINT
```

### 4. Test locally and deploy to cloud

Test your security analytics APIs locally first:

```bash
# Test protocol security analysis locally
curl "http://localhost:7181/v0/pipes/protocol_security_analysis.json?start_date=2024-01-15&end_date=2024-01-15&limit=10" \
  -H "Authorization: Bearer $TINYBIRD_TOKEN"
```

**Example local API response:**

```json
{
  "meta": [
    {
      "name": "protocol_type",
      "type": "String"
    },
    {
      "name": "total_requests",
      "type": "UInt64"
    },
    {
      "name": "unique_users",
      "type": "UInt64"
    },
    {
      "name": "unique_sessions",
      "type": "UInt64"
    },
    {
      "name": "error_requests",
      "type": "UInt64"
    },
    {
      "name": "error_rate_percent",
      "type": "Float64"
    },
    {
      "name": "avg_response_time_ms",
      "type": "Float64"
    },
    {
      "name": "bot_requests",
      "type": "UInt64"
    },
    {
      "name": "avg_threat_score",
      "type": "Float64"
    },
    {
      "name": "https_adoption_percent",
      "type": "Float64"
    },
    {
      "name": "top_error_codes",
      "type": "Array(UInt16)"
    },
    {
      "name": "top_countries",
      "type": "Array(String)"
    }
  ],
  "data": [
    {
      "protocol_type": "https",
      "total_requests": 3,
      "unique_users": 3,
      "unique_sessions": 3,
      "error_requests": 0,
      "error_rate_percent": 0.0,
      "avg_response_time_ms": 123.33,
      "bot_requests": 1,
      "avg_threat_score": 0.33,
      "https_adoption_percent": 100.0,
      "top_error_codes": [200, 200, 200],
      "top_countries": ["US", "US", "GB"]
    },
    {
      "protocol_type": "http",
      "total_requests": 1,
      "unique_users": 1,
      "unique_sessions": 1,
      "error_requests": 0,
      "error_rate_percent": 0.0,
      "avg_response_time_ms": 200.0,
      "bot_requests": 0,
      "avg_threat_score": 0.2,
      "https_adoption_percent": 0.0,
      "top_error_codes": [200],
      "top_countries": ["CA"]
    },
    {
      "protocol_type": "ftp",
      "total_requests": 1,
      "unique_users": 1,
      "unique_sessions": 1,
      "error_requests": 0,
      "error_rate_percent": 0.0,
      "avg_response_time_ms": 500.0,
      "bot_requests": 0,
      "avg_threat_score": 0.3,
      "https_adoption_percent": 0.0,
      "top_error_codes": [200],
      "top_countries": ["DE"]
    }
  ],
  "rows": 3,
  "statistics": {
    "elapsed": 0.001234,
    "rows_read": 5,
    "bytes_read": 1024
  }
}
```

Once you're satisfied with your local testing, deploy to Tinybird's cloud infrastructure:

```bash
# Deploy to cloud
tb --cloud deploy
```

Test your production APIs:

```bash
# Protocol security analysis
curl "https://api.tinybird.co/v0/pipes/protocol_security_analysis.json?start_date=2024-01-01&end_date=2024-01-31&country_filter=US&limit=20" \
  -H "Authorization: Bearer $TINYBIRD_TOKEN"

# Protocol migration analysis
curl "https://api.tinybird.co/v0/pipes/protocol_migration_analysis.json?start_date=2024-01-01&end_date=2024-01-31&protocol_filter=https" \
  -H "Authorization: Bearer $TINYBIRD_TOKEN"
```

The APIs return comprehensive security analytics including protocol adoption rates, error patterns, threat scores, and migration trends. This enables real-time insights into web security posture, HTTPS adoption progress, and protocol-based threat detection across your entire web infrastructure.

## Why a managed ClickHouse{% sup %}®{% /sup %} service simplifies this work

Managing ClickHouse{% sup %}®{% /sup %} infrastructure involves complex decisions around cluster sizing, storage configuration, ingestion pipelines, and API hosting. A managed service like [Tinybird](https://www.tinybird.co) removes operational burdens while providing additional capabilities.

The protocol extraction patterns we've covered work the same way whether you're running self-hosted ClickHouse{% sup %}®{% /sup %} or using Tinybird. However, with Tinybird, you can go from raw URL data to a production API significantly faster than with traditional approaches.

Key advantages of managed services include:

- **High-throughput ingestion:** Built-in [connectors](https://tinybird.co/docs/forward/get-data-in/connectors) handle streaming data from various sources without custom pipeline code
- **SQL-based transformations:** Create data pipelines using familiar SQL syntax instead of complex ETL frameworks
- **Production-grade APIs:** Token-authenticated HTTP endpoints with automatic scaling and monitoring
- **Zero infrastructure management:** Focus on building features instead of managing clusters, storage, and networking

For teams looking to build real-time analytics without the operational complexity, [managed ClickHouse{% sup %}®{% /sup %} services offer compelling advantages](https://www.tinybird.co/blog-posts/managed-clickhouse-options) over self-hosted deployments.

[Sign up for Tinybird's free plan](https://cloud.tinybird.co/signup) to start building real-time URL analytics APIs without the infrastructure complexity.

## FAQs about extracting URL protocols in ClickHouse{% sup %}®{% /sup %}

### Can I index the protocol column for faster filtering?

You can create indexes, but protocols are low-cardinality data by nature. Using `LowCardinality(String)` typically provides better memory usage and performance than secondary indexes for `GROUP BY` and filter operations. If you still need an index for highly selective filters, consider bloom filter indexes on the protocol column.

### How do I update existing rows to include protocol data?

You have several options depending on your data size and downtime tolerance. Add a new column and backfill it, create a new enriched table with `INSERT SELECT`, or use a materialized view for new data while migrating old data as a batch job. The materialized view approach is often the most practical for large datasets.

### Does the protocol function allocate memory for each call?

The `protocol()` function is lightweight and optimized for performance. However, repeatedly computing it at read time over large datasets still consumes CPU cycles. Precomputing protocols via materialized views or projections is usually more efficient, and storing results as `LowCardinality(String)` further reduces memory usage while speeding up aggregations.

## Additional resources

- [URL Functions](https://clickhouse.com/docs/en/sql-reference/functions/url-functions)
- [String Functions](https://clickhouse.com/docs/en/sql-reference/functions/string-functions)
- [Array Functions](https://clickhouse.com/docs/en/sql-reference/functions/array-functions)
- [Materialized Views](https://clickhouse.com/docs/en/engines/table-engines/special/materializedview)
- [Tinybird vs ClickHouse{% sup %}®{% /sup %}](https://www.tinybird.co/blog-posts/tinybird-vs-clickhouse)
- [How to build URL hierarchy in ClickHouse{% sup %}®{% /sup %}](https://www.tinybird.co/blog-posts/how-to-build-url-hierarchy-clickhouse)
- [Extract query parameters in ClickHouse{% sup %}®{% /sup %}](https://www.tinybird.co/blog-posts/extract-query-parameter-clickhouse)
- [Strip URL query strings in ClickHouse{% sup %}®{% /sup %}](https://www.tinybird.co/blog-posts/strip-url-query-string-clickhouse)
- [Web Analytics Starter Kit Template](https://github.com/tinybirdco/web-analytics-starter-kit)
