Name: Tinybird
Brand: Tinybird
Rating: 5.0 (10 reviews)

URL parsing in ClickHouse® becomes straightforward when you know which function to use. The protocol() function extracts the scheme portion from any URL string, returning values like https, http, ftp, or an empty string for malformed inputs.

This article covers the function's syntax and behavior, performance optimization techniques for large datasets, and how to build real-time APIs that analyze URL protocols at scale.

What the `protocol` function does

The protocol() function extracts the scheme portion from a URL string and returns values like https, http, ftp, mailto, or file. When you pass a URL like https://example.com/path?q=1, the function returns just https.

If the input lacks a recognizable scheme or isn't formatted as a valid URL, the function returns an empty string. This behavior makes it reliable for parsing mixed URL data where some entries might be incomplete or malformed.

Basic syntax and quick example

The function signature is straightforward: protocol(url) where url is any string expression. Here's how it works with different URL types:

SELECT
    protocol('https://clickhouse.com/docs') AS https_example,
    protocol('http://example.org') AS http_example,
    protocol('ftp://mirror.example.net/file.iso') AS ftp_example;

This query returns https, http, and ftp respectively.

Simple select

You can extract protocols from both literal strings and table columns. If you have a table called urls with a url column, you can analyze all protocols at once:

SELECT url, protocol(url) AS proto
FROM urls
LIMIT 10;

Nested use in `GROUP BY`

One of the most common use cases is counting protocol distribution across your dataset:

SELECT
    protocol(url) AS proto,
    count() AS c
FROM urls
GROUP BY proto
ORDER BY c DESC;

This shows you whether your traffic is predominantly HTTPS, HTTP, or includes other protocols like FTP or mailto links.

Return values and edge cases

The function returns the substring before the first colon when it matches a valid URL scheme pattern. Understanding edge cases helps you handle real-world data more effectively.

Missing protocol

URLs without schemes return empty strings, which is often what you want for filtering or conditional logic:

SELECT
    protocol('example.com/path') AS missing,          -- Returns ''
    protocol('/relative/path?q=1') AS relative,       -- Returns ''
    protocol('://bad') AS malformed,                  -- Returns ''
    protocol('example') AS plain_text;                -- Returns ''

Invalid URLs

Completely invalid URLs that don't begin with a recognizable scheme pattern also return empty strings:

SELECT
    protocol('not a url') AS invalid,                 -- Returns ''
    protocol('http//missing-colon.com') AS broken;    -- Returns ''

Non-web schemes

The function supports many schemes beyond HTTP and HTTPS, making it useful for diverse URL datasets:

protocol('mailto:user@example.com') → 'mailto'
protocol('tel:+14155550123') → 'tel'
protocol('file:///var/log/system.log') → 'file'
protocol('ssh://user@host') → 'ssh'
protocol('s3://bucket/key') → 's3'
protocol('data:text/plain;base64,SGVsbG8=') → 'data'

Even custom or uncommon schemes work as long as they follow the general "scheme:" pattern.

Performance tips for large datasets

When processing millions or billions of URLs, a few optimizations can significantly reduce memory usage, CPU load, and query latency.

Use of `LowCardinality`

Protocols are inherently low-cardinality data—you'll typically see only a handful of distinct values across your entire dataset. Store extracted protocols as LowCardinality(String) to reduce memory usage and speed up GROUP BY operations:

CREATE TABLE urls_enriched
(
    ts DateTime,
    user_id UInt64,
    url String,
    proto LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (ts, user_id);

Push-down filters

Filter by protocol early in your query to reduce the amount of data ClickHouse® scans:

SELECT count()
FROM urls
WHERE protocol(url) = 'https';

If you've precomputed the protocol as a column, filtering on that column avoids recomputation entirely:

SELECT count()
FROM urls
WHERE proto = 'https';

Avoiding repeated computation with projections

Projections let you precompute protocol extraction so queries can skip per-row function calls. Here's how to set one up:

ALTER TABLE urls
ADD PROJECTION p_by_proto
(
    SELECT
        protocol(url) AS proto,
        user_id,
        ts
    ORDER BY proto, ts
);

After running OPTIMIZE TABLE urls FINAL, queries that group or filter by protocol can leverage the projection automatically for much faster performance.

When to use `protocol` versus string functions

The protocol() function is purpose-built for URL parsing and handles edge cases that generic string functions might miss. However, there are situations where alternatives might make sense.

Situations favoring substring

If your input follows a strict format and you only need a quick split before the first colon, splitByChar can be marginally faster:

SELECT splitByChar(':', url)[1] AS fast_proto
FROM urls;

This approach will misclassify malformed inputs and won't return empty strings consistently for non-URLs. Use it only when you're certain about your data quality.

Regex trade-offs

Regex functions like extractGroups can handle highly irregular inputs but come with significant downsides:

Flexibility: Maximum control over custom parsing rules and complex pattern matching
Performance cost: Higher CPU usage and slower execution compared to built-in functions
Complexity: More difficult to maintain and debug than simple function calls

Use regex only when you need advanced matching beyond what protocol() provides, such as extracting specific parts of custom URL schemes.

Building a materialized view with protocol

Creating a pipeline that automatically extracts protocols during data ingestion eliminates the need for recomputation at query time. This approach is particularly useful when you're processing high-volume URL data.

Create table and ingest

First, set up your raw data table:

CREATE TABLE urls_raw
(
    ts DateTime,
    user_id UInt64,
    url String
)
ENGINE = MergeTree
ORDER BY (ts, user_id);

INSERT INTO urls_raw VALUES
('2025-01-01 00:00:00', 1, 'https://example.com/a'),
('2025-01-01 00:00:01', 2, 'http://example.com/b'),
('2025-01-01 00:00:02', 3, 'mailto:user@example.com'),
('2025-01-01 00:00:03', 4, 'example.com/no-scheme');

Define materialized view

Create an enriched table and a materialized view that automatically populates it:

CREATE TABLE urls_enriched
(
    ts DateTime,
    user_id UInt64,
    url String,
    proto LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (ts, user_id);

CREATE MATERIALIZED VIEW mv_urls_proto
TO urls_enriched
AS
SELECT
    ts,
    user_id,
    url,
    protocol(url) AS proto
FROM urls_raw;

Verify results

Check that your materialized view is working correctly:

SELECT url, proto
FROM urls_enriched
ORDER BY ts;

You'll see protocol values of 'https', 'http', 'mailto', and an empty string for the URL without a scheme.

Building ClickHouse-based web security analytics APIs with Tinybird

You can transform basic protocol extraction queries into a comprehensive web security analytics API using Tinybird's managed ClickHouse platform.

Instead of managing ClickHouse infrastructure, you define your security analytics logic as code and Tinybird handles the operational complexity. The platform provides sub-second query performance for millions of web requests while abstracting away database administration.

1. Create a data source for web security events

Define your security analytics data structure in a data source file:

SCHEMA >
    `timestamp` DateTime64(3, 'UTC'),
    `request_id` String,
    `user_id` String,
    `session_id` String,
    `url` String,
    `referrer_url` String,
    `user_agent` String,
    `ip_address` String,
    `country` LowCardinality(String),
    `device_type` LowCardinality(String),
    `response_status` UInt16,
    `response_time_ms` UInt32,
    `is_bot` UInt8,
    `threat_score` Float32

ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "timestamp, user_id"

JSONPATH >
    {
        "timestamp": "$.timestamp",
        "request_id": "$.request_id",
        "user_id": "$.user_id",
        "session_id": "$.session_id",
        "url": "$.url",
        "referrer_url": "$.referrer_url",
        "user_agent": "$.user_agent",
        "ip_address": "$.ip_address",
        "country": "$.country",
        "device_type": "$.device_type",
        "response_status": "$.response_status",
        "response_time_ms": "$.response_time_ms",
        "is_bot": "$.is_bot",
        "threat_score": "$.threat_score"
    }

This schema captures comprehensive web security data including protocol-based security analysis. The partitioning by month ensures fast queries as your security analytics data grows.

2. Ingest example data using the Events API

Add sample web security events to test your analytics using the Events API. First, start your local Tinybird development environment:

# Start Tinybird locally
tb dev

Then ingest sample data to your local instance:

# Ingest sample web security events to localhost
curl -X POST "http://localhost:7181/v0/events" \
  -H "Authorization: Bearer $TINYBIRD_TOKEN" \
  -H "Content-Type: application/json" \
  -d '[
    {
      "timestamp": "2024-01-15T10:30:00.000Z",
      "request_id": "req_001",
      "user_id": "user_123",
      "session_id": "sess_abc",
      "url": "https://example.com/dashboard",
      "referrer_url": "https://google.com/search",
      "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
      "ip_address": "192.168.1.100",
      "country": "US",
      "device_type": "Desktop",
      "response_status": 200,
      "response_time_ms": 150,
      "is_bot": 0,
      "threat_score": 0.1
    },
    {
      "timestamp": "2024-01-15T10:31:00.000Z",
      "request_id": "req_002",
      "user_id": "user_456",
      "session_id": "sess_def",
      "url": "http://example.com/login",
      "referrer_url": "https://example.com/",
      "user_agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X)",
      "ip_address": "10.0.0.50",
      "country": "CA",
      "device_type": "Mobile",
      "response_status": 200,
      "response_time_ms": 200,
      "is_bot": 0,
      "threat_score": 0.2
    },
    {
      "timestamp": "2024-01-15T10:32:00.000Z",
      "request_id": "req_003",
      "user_id": "user_789",
      "session_id": "sess_ghi",
      "url": "https://example.com/api/data",
      "referrer_url": "https://example.com/dashboard",
      "user_agent": "curl/7.68.0",
      "ip_address": "203.0.113.1",
      "country": "GB",
      "device_type": "Bot",
      "response_status": 403,
      "response_time_ms": 50,
      "is_bot": 1,
      "threat_score": 0.8
    },
    {
      "timestamp": "2024-01-15T10:33:00.000Z",
      "request_id": "req_004",
      "user_id": "user_321",
      "session_id": "sess_jkl",
      "url": "ftp://files.example.com/download",
      "referrer_url": "https://example.com/",
      "user_agent": "FileZilla/3.50.0",
      "ip_address": "198.51.100.10",
      "country": "DE",
      "device_type": "Desktop",
      "response_status": 200,
      "response_time_ms": 500,
      "is_bot": 0,
      "threat_score": 0.3
    },
    {
      "timestamp": "2024-01-15T10:34:00.000Z",
      "request_id": "req_005",
      "user_id": "user_654",
      "session_id": "sess_mno",
      "url": "https://example.com/admin",
      "referrer_url": "https://example.com/login",
      "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)",
      "ip_address": "172.16.0.5",
      "country": "US",
      "device_type": "Desktop",
      "response_status": 200,
      "response_time_ms": 120,
      "is_bot": 0,
      "threat_score": 0.1
    }
  ]'

This sample data includes various protocols (HTTPS, HTTP, FTP), different device types, geographic locations, and security scenarios to test your analytics pipes.

3. Build security analytics pipes

Create pipes that analyze protocol-based security patterns and compliance metrics:

NODE security_events
SQL >
    %
    SELECT 
        timestamp,
        request_id,
        user_id,
        session_id,
        url,
        referrer_url,
        country,
        device_type,
        response_status,
        response_time_ms,
        is_bot,
        threat_score,
        protocol(url) AS protocol_type
    FROM web_security_events
    WHERE timestamp >= {{DateTime(start_date, '2024-01-01 00:00:00')}}
      AND timestamp <= {{DateTime(end_date, '2024-12-31 23:59:59')}}
    {\% if defined(country_filter) %}
      AND country = {{String(country_filter)}}
    {\% end %}
    {\% if defined(protocol_filter) %}
      AND protocol(url) = {{String(protocol_filter)}}
    {\% end %}

NODE protocol_security_analysis
SQL >
    %
    SELECT 
        protocol_type,
        count() AS total_requests,
        uniq(user_id) AS unique_users,
        uniq(session_id) AS unique_sessions,
        countIf(response_status >= 400) AS error_requests,
        round(countIf(response_status >= 400) * 100.0 / count(), 2) AS error_rate_percent,
        round(avg(response_time_ms), 2) AS avg_response_time_ms,
        countIf(is_bot = 1) AS bot_requests,
        round(avg(threat_score), 2) AS avg_threat_score,
        -- Security compliance metrics
        round(countIf(protocol_type = 'https') * 100.0 / count(), 2) AS https_adoption_percent,
        -- Top error patterns by protocol
        topK(5)(response_status) AS top_error_codes,
        -- Geographic distribution of protocol usage
        topK(3)(country) AS top_countries
    FROM security_events
    WHERE protocol_type != ''
    GROUP BY protocol_type
    ORDER BY total_requests DESC
    {\% if defined(limit) %}
    LIMIT {{Int32(limit, 20)}}
    {\% end %}

TYPE ENDPOINT

Create a protocol migration analysis pipe:

NODE protocol_trends
SQL >
    %
    SELECT 
        toDate(timestamp) AS date,
        protocol(url) AS protocol_type,
        count() AS daily_requests,
        uniq(user_id) AS daily_users,
        countIf(response_status >= 400) AS daily_errors,
        round(avg(response_time_ms), 2) AS avg_response_time
    FROM web_security_events
    WHERE timestamp >= {{DateTime(start_date, '2024-01-01 00:00:00')}}
      AND timestamp <= {{DateTime(end_date, '2024-12-31 23:59:59')}}
    {\% if defined(protocol_filter) %}
      AND protocol(url) = {{String(protocol_filter)}}
    {\% end %}
    GROUP BY date, protocol_type

NODE migration_metrics
SQL >
    %
    SELECT 
        protocol_type,
        date,
        daily_requests,
        daily_users,
        daily_errors,
        avg_response_time,
        -- Calculate week-over-week growth
        round((daily_requests - lag(daily_requests, 7) OVER (PARTITION BY protocol_type ORDER BY date)) * 100.0 / lag(daily_requests, 7) OVER (PARTITION BY protocol_type ORDER BY date), 2) AS wow_growth_percent,
        -- Calculate error rate trends
        round(daily_errors * 100.0 / daily_requests, 2) AS daily_error_rate
    FROM protocol_trends

NODE protocol_migration_analysis
SQL >
    %
    SELECT 
        protocol_type,
        date,
        daily_requests,
        daily_users,
        daily_errors,
        daily_error_rate,
        avg_response_time,
        wow_growth_percent,
        -- Identify migration patterns
        CASE 
            WHEN protocol_type = 'https' AND wow_growth_percent > 10 THEN 'accelerating_https'
            WHEN protocol_type = 'http' AND wow_growth_percent < -10 THEN 'declining_http'
            WHEN protocol_type = 'https' AND daily_error_rate < 1 THEN 'stable_https'
            ELSE 'mixed_trends'
        END AS migration_status
    FROM migration_metrics
    ORDER BY date DESC, daily_requests DESC
    {\% if defined(limit) %}
    LIMIT {{Int32(limit, 100)}}
    {\% end %}

TYPE ENDPOINT

4. Test locally and deploy to cloud

Test your security analytics APIs locally first:

# Test protocol security analysis locally
curl "http://localhost:7181/v0/pipes/protocol_security_analysis.json?start_date=2024-01-15&end_date=2024-01-15&limit=10" \
  -H "Authorization: Bearer $TINYBIRD_TOKEN"

Example local API response:

{
  "meta": [
    {
      "name": "protocol_type",
      "type": "String"
    },
    {
      "name": "total_requests",
      "type": "UInt64"
    },
    {
      "name": "unique_users",
      "type": "UInt64"
    },
    {
      "name": "unique_sessions",
      "type": "UInt64"
    },
    {
      "name": "error_requests",
      "type": "UInt64"
    },
    {
      "name": "error_rate_percent",
      "type": "Float64"
    },
    {
      "name": "avg_response_time_ms",
      "type": "Float64"
    },
    {
      "name": "bot_requests",
      "type": "UInt64"
    },
    {
      "name": "avg_threat_score",
      "type": "Float64"
    },
    {
      "name": "https_adoption_percent",
      "type": "Float64"
    },
    {
      "name": "top_error_codes",
      "type": "Array(UInt16)"
    },
    {
      "name": "top_countries",
      "type": "Array(String)"
    }
  ],
  "data": [
    {
      "protocol_type": "https",
      "total_requests": 3,
      "unique_users": 3,
      "unique_sessions": 3,
      "error_requests": 0,
      "error_rate_percent": 0.0,
      "avg_response_time_ms": 123.33,
      "bot_requests": 1,
      "avg_threat_score": 0.33,
      "https_adoption_percent": 100.0,
      "top_error_codes": [200, 200, 200],
      "top_countries": ["US", "US", "GB"]
    },
    {
      "protocol_type": "http",
      "total_requests": 1,
      "unique_users": 1,
      "unique_sessions": 1,
      "error_requests": 0,
      "error_rate_percent": 0.0,
      "avg_response_time_ms": 200.0,
      "bot_requests": 0,
      "avg_threat_score": 0.2,
      "https_adoption_percent": 0.0,
      "top_error_codes": [200],
      "top_countries": ["CA"]
    },
    {
      "protocol_type": "ftp",
      "total_requests": 1,
      "unique_users": 1,
      "unique_sessions": 1,
      "error_requests": 0,
      "error_rate_percent": 0.0,
      "avg_response_time_ms": 500.0,
      "bot_requests": 0,
      "avg_threat_score": 0.3,
      "https_adoption_percent": 0.0,
      "top_error_codes": [200],
      "top_countries": ["DE"]
    }
  ],
  "rows": 3,
  "statistics": {
    "elapsed": 0.001234,
    "rows_read": 5,
    "bytes_read": 1024
  }
}

Once you're satisfied with your local testing, deploy to Tinybird's cloud infrastructure:

# Deploy to cloud
tb --cloud deploy

Test your production APIs:

# Protocol security analysis
curl "https://api.tinybird.co/v0/pipes/protocol_security_analysis.json?start_date=2024-01-01&end_date=2024-01-31&country_filter=US&limit=20" \
  -H "Authorization: Bearer $TINYBIRD_TOKEN"

# Protocol migration analysis
curl "https://api.tinybird.co/v0/pipes/protocol_migration_analysis.json?start_date=2024-01-01&end_date=2024-01-31&protocol_filter=https" \
  -H "Authorization: Bearer $TINYBIRD_TOKEN"

The APIs return comprehensive security analytics including protocol adoption rates, error patterns, threat scores, and migration trends. This enables real-time insights into web security posture, HTTPS adoption progress, and protocol-based threat detection across your entire web infrastructure.

Why a managed ClickHouse service simplifies this work

Managing ClickHouse infrastructure involves complex decisions around cluster sizing, storage configuration, ingestion pipelines, and API hosting. A managed service like Tinybird removes operational burdens while providing additional capabilities.

The protocol extraction patterns we've covered work the same way whether you're running self-hosted ClickHouse or using Tinybird. However, with Tinybird, you can go from raw URL data to a production API significantly faster than with traditional approaches.

Key advantages of managed services include:

High-throughput ingestion: Built-in connectors handle streaming data from various sources without custom pipeline code
SQL-based transformations: Create data pipelines using familiar SQL syntax instead of complex ETL frameworks
Production-grade APIs: Token-authenticated HTTP endpoints with automatic scaling and monitoring
Zero infrastructure management: Focus on building features instead of managing clusters, storage, and networking

For teams looking to build real-time analytics without the operational complexity, managed ClickHouse services offer compelling advantages over self-hosted deployments.

FAQs about extracting URL protocols in ClickHouse

Can I index the protocol column for faster filtering?

You can create indexes, but protocols are low-cardinality data by nature. Using LowCardinality(String) typically provides better memory usage and performance than secondary indexes for GROUP BY and filter operations. If you still need an index for highly selective filters, consider bloom filter indexes on the protocol column.

How do I update existing rows to include protocol data?

You have several options depending on your data size and downtime tolerance. Add a new column and backfill it, create a new enriched table with INSERT SELECT, or use a materialized view for new data while migrating old data as a batch job. The materialized view approach is often the most practical for large datasets.

Does the protocol function allocate memory for each call?

The protocol() function is lightweight and optimized for performance. However, repeatedly computing it at read time over large datasets still consumes CPU cycles. Precomputing protocols via materialized views or projections is usually more efficient, and storing results as LowCardinality(String) further reduces memory usage while speeding up aggregations.

Skip the infra work. Deploy your first ClickHouse^®
project now.

Our Columns:

Skip the infra work. Deploy your first ClickHouse^®
project now.

Skip the infra work. Deploy your first ClickHouse^®
project now.

How to extract the protocol of a URL in ClickHouse^®

What the `protocol` function does

Basic syntax and quick example

Simple select

Nested use in `GROUP BY`

Return values and edge cases

Missing protocol

Invalid URLs

Non-web schemes

Performance tips for large datasets

Use of `LowCardinality`

Push-down filters

Avoiding repeated computation with projections

When to use `protocol` versus string functions

Situations favoring substring

Regex trade-offs

Building a materialized view with protocol

Create table and ingest

Define materialized view

Verify results

Building ClickHouse-based web security analytics APIs with Tinybird

1. Create a data source for web security events

2. Ingest example data using the Events API

3. Build security analytics pipes

4. Test locally and deploy to cloud

Why a managed ClickHouse service simplifies this work

FAQs about extracting URL protocols in ClickHouse

Can I index the protocol column for faster filtering?

How do I update existing rows to include protocol data?

Does the protocol function allocate memory for each call?

Additional resources

Skip the infra work. Deploy your first ClickHouse^®
project now.

Skip the infra work. Deploy your first ClickHouse® project now.

Our Columns:

Skip the infra work. Deploy your first ClickHouse® project now.

Skip the infra work. Deploy your first ClickHouse® project now.

How to extract the protocol of a URL in ClickHouse®

What the protocol function does

Basic syntax and quick example

Simple select

Nested use in GROUP BY

Return values and edge cases

Missing protocol

Invalid URLs

Non-web schemes

Performance tips for large datasets

Use of LowCardinality

Push-down filters

Avoiding repeated computation with projections

When to use protocol versus string functions

Situations favoring substring

Regex trade-offs

Building a materialized view with protocol

Create table and ingest

Define materialized view

Verify results

Building ClickHouse-based web security analytics APIs with Tinybird

1. Create a data source for web security events

2. Ingest example data using the Events API

3. Build security analytics pipes

4. Test locally and deploy to cloud

Why a managed ClickHouse service simplifies this work

FAQs about extracting URL protocols in ClickHouse

Can I index the protocol column for faster filtering?

How do I update existing rows to include protocol data?

Does the protocol function allocate memory for each call?

Additional resources

Skip the infra work. Deploy your first ClickHouse® project now.

Skip the infra work. Deploy your first ClickHouse^®
project now.

Skip the infra work. Deploy your first ClickHouse^®
project now.

Skip the infra work. Deploy your first ClickHouse^®
project now.

How to extract the protocol of a URL in ClickHouse^®

What the `protocol` function does

Nested use in `GROUP BY`

Use of `LowCardinality`

When to use `protocol` versus string functions

Skip the infra work. Deploy your first ClickHouse^®
project now.