URL parsing in ClickHouse becomes straightforward when you know which function to use. The protocol()
function extracts the scheme portion from any URL string, returning values like https
, http
, ftp
, or an empty string for malformed inputs.
This article covers the function's syntax and behavior, performance optimization techniques for large datasets, and how to build real-time APIs that analyze URL protocols at scale.
What the protocol
function does
The protocol()
function extracts the scheme portion from a URL string and returns values like https
, http
, ftp
, mailto
, or file
. When you pass a URL like https://example.com/path?q=1
, the function returns just https
.
If the input lacks a recognizable scheme or isn't formatted as a valid URL, the function returns an empty string. This behavior makes it reliable for parsing mixed URL data where some entries might be incomplete or malformed.
Basic syntax and quick example
The function signature is straightforward: protocol(url)
where url
is any string expression. Here's how it works with different URL types:
SELECT
protocol('https://clickhouse.com/docs') AS https_example,
protocol('http://example.org') AS http_example,
protocol('ftp://mirror.example.net/file.iso') AS ftp_example;
This query returns https
, http
, and ftp
respectively.
Simple select
You can extract protocols from both literal strings and table columns. If you have a table called urls
with a url
column, you can analyze all protocols at once:
SELECT url, protocol(url) AS proto
FROM urls
LIMIT 10;
Nested use in GROUP BY
One of the most common use cases is counting protocol distribution across your dataset:
SELECT
protocol(url) AS proto,
count() AS c
FROM urls
GROUP BY proto
ORDER BY c DESC;
This shows you whether your traffic is predominantly HTTPS, HTTP, or includes other protocols like FTP or mailto links.
Return values and edge cases
The function returns the substring before the first colon when it matches a valid URL scheme pattern. Understanding edge cases helps you handle real-world data more effectively.
Missing protocol
URLs without schemes return empty strings, which is often what you want for filtering or conditional logic:
SELECT
protocol('example.com/path') AS missing, -- Returns ''
protocol('/relative/path?q=1') AS relative, -- Returns ''
protocol('://bad') AS malformed, -- Returns ''
protocol('example') AS plain_text; -- Returns ''
Invalid URLs
Completely invalid URLs that don't begin with a recognizable scheme pattern also return empty strings:
SELECT
protocol('not a url') AS invalid, -- Returns ''
protocol('http//missing-colon.com') AS broken; -- Returns ''
Non-web schemes
The function supports many schemes beyond HTTP and HTTPS, making it useful for diverse URL datasets:
protocol('mailto:user@example.com')
→'mailto'
protocol('tel:+14155550123')
→'tel'
protocol('file:///var/log/system.log')
→'file'
protocol('ssh://user@host')
→'ssh'
protocol('s3://bucket/key')
→'s3'
protocol('data:text/plain;base64,SGVsbG8=')
→'data'
Even custom or uncommon schemes work as long as they follow the general "scheme:" pattern.
Performance tips for large datasets
When processing millions or billions of URLs, a few optimizations can significantly reduce memory usage, CPU load, and query latency.
Use of LowCardinality
Protocols are inherently low-cardinality data—you'll typically see only a handful of distinct values across your entire dataset. Store extracted protocols as LowCardinality(String)
to reduce memory usage and speed up GROUP BY
operations:
CREATE TABLE urls_enriched
(
ts DateTime,
user_id UInt64,
url String,
proto LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (ts, user_id);
Push-down filters
Filter by protocol early in your query to reduce the amount of data ClickHouse scans:
SELECT count()
FROM urls
WHERE protocol(url) = 'https';
If you've precomputed the protocol as a column, filtering on that column avoids recomputation entirely:
SELECT count()
FROM urls
WHERE proto = 'https';
Avoiding repeated computation with projections
Projections let you precompute protocol extraction so queries can skip per-row function calls. Here's how to set one up:
ALTER TABLE urls
ADD PROJECTION p_by_proto
(
SELECT
protocol(url) AS proto,
user_id,
ts
ORDER BY proto, ts
);
After running OPTIMIZE TABLE urls FINAL
, queries that group or filter by protocol can leverage the projection automatically for much faster performance.
When to use protocol
versus string functions
The protocol()
function is purpose-built for URL parsing and handles edge cases that generic string functions might miss. However, there are situations where alternatives might make sense.
Situations favoring substring
If your input follows a strict format and you only need a quick split before the first colon, splitByChar
can be marginally faster:
SELECT splitByChar(':', url)[1] AS fast_proto
FROM urls;
This approach will misclassify malformed inputs and won't return empty strings consistently for non-URLs. Use it only when you're certain about your data quality.
Regex trade-offs
Regex functions like extractGroups
can handle highly irregular inputs but come with significant downsides:
- Flexibility: Maximum control over custom parsing rules and complex pattern matching
- Performance cost: Higher CPU usage and slower execution compared to built-in functions
- Complexity: More difficult to maintain and debug than simple function calls
Use regex only when you need advanced matching beyond what protocol()
provides, such as extracting specific parts of custom URL schemes.
Building a materialized view with protocol
Creating a pipeline that automatically extracts protocols during data ingestion eliminates the need for recomputation at query time. This approach is particularly useful when you're processing high-volume URL data.
Create table and ingest
First, set up your raw data table:
CREATE TABLE urls_raw
(
ts DateTime,
user_id UInt64,
url String
)
ENGINE = MergeTree
ORDER BY (ts, user_id);
INSERT INTO urls_raw VALUES
('2025-01-01 00:00:00', 1, 'https://example.com/a'),
('2025-01-01 00:00:01', 2, 'http://example.com/b'),
('2025-01-01 00:00:02', 3, 'mailto:user@example.com'),
('2025-01-01 00:00:03', 4, 'example.com/no-scheme');
Define materialized view
Create an enriched table and a materialized view that automatically populates it:
CREATE TABLE urls_enriched
(
ts DateTime,
user_id UInt64,
url String,
proto LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (ts, user_id);
CREATE MATERIALIZED VIEW mv_urls_proto
TO urls_enriched
AS
SELECT
ts,
user_id,
url,
protocol(url) AS proto
FROM urls_raw;
Verify results
Check that your materialized view is working correctly:
SELECT url, proto
FROM urls_enriched
ORDER BY ts;
You'll see protocol values of 'https'
, 'http'
, 'mailto'
, and an empty string for the URL without a scheme.
Building ClickHouse-based web security analytics APIs with Tinybird
You can transform basic protocol extraction queries into a comprehensive web security analytics API using Tinybird's managed ClickHouse platform.
Instead of managing ClickHouse infrastructure, you define your security analytics logic as code and Tinybird handles the operational complexity. The platform provides sub-second query performance for millions of web requests while abstracting away database administration.
1. Create a data source for web security events
Define your security analytics data structure in a data source file:
SCHEMA >
`timestamp` DateTime64(3, 'UTC'),
`request_id` String,
`user_id` String,
`session_id` String,
`url` String,
`referrer_url` String,
`user_agent` String,
`ip_address` String,
`country` LowCardinality(String),
`device_type` LowCardinality(String),
`response_status` UInt16,
`response_time_ms` UInt32,
`is_bot` UInt8,
`threat_score` Float32
ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "timestamp, user_id"
JSONPATH >
{
"timestamp": "$.timestamp",
"request_id": "$.request_id",
"user_id": "$.user_id",
"session_id": "$.session_id",
"url": "$.url",
"referrer_url": "$.referrer_url",
"user_agent": "$.user_agent",
"ip_address": "$.ip_address",
"country": "$.country",
"device_type": "$.device_type",
"response_status": "$.response_status",
"response_time_ms": "$.response_time_ms",
"is_bot": "$.is_bot",
"threat_score": "$.threat_score"
}
This schema captures comprehensive web security data including protocol-based security analysis. The partitioning by month ensures fast queries as your security analytics data grows.
2. Ingest example data using the Events API
Add sample web security events to test your analytics using the Events API. First, start your local Tinybird development environment:
# Start Tinybird locally
tb dev
Then ingest sample data to your local instance:
# Ingest sample web security events to localhost
curl -X POST "http://localhost:7181/v0/events" \
-H "Authorization: Bearer $TINYBIRD_TOKEN" \
-H "Content-Type: application/json" \
-d '[
{
"timestamp": "2024-01-15T10:30:00.000Z",
"request_id": "req_001",
"user_id": "user_123",
"session_id": "sess_abc",
"url": "https://example.com/dashboard",
"referrer_url": "https://google.com/search",
"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"ip_address": "192.168.1.100",
"country": "US",
"device_type": "Desktop",
"response_status": 200,
"response_time_ms": 150,
"is_bot": 0,
"threat_score": 0.1
},
{
"timestamp": "2024-01-15T10:31:00.000Z",
"request_id": "req_002",
"user_id": "user_456",
"session_id": "sess_def",
"url": "http://example.com/login",
"referrer_url": "https://example.com/",
"user_agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X)",
"ip_address": "10.0.0.50",
"country": "CA",
"device_type": "Mobile",
"response_status": 200,
"response_time_ms": 200,
"is_bot": 0,
"threat_score": 0.2
},
{
"timestamp": "2024-01-15T10:32:00.000Z",
"request_id": "req_003",
"user_id": "user_789",
"session_id": "sess_ghi",
"url": "https://example.com/api/data",
"referrer_url": "https://example.com/dashboard",
"user_agent": "curl/7.68.0",
"ip_address": "203.0.113.1",
"country": "GB",
"device_type": "Bot",
"response_status": 403,
"response_time_ms": 50,
"is_bot": 1,
"threat_score": 0.8
},
{
"timestamp": "2024-01-15T10:33:00.000Z",
"request_id": "req_004",
"user_id": "user_321",
"session_id": "sess_jkl",
"url": "ftp://files.example.com/download",
"referrer_url": "https://example.com/",
"user_agent": "FileZilla/3.50.0",
"ip_address": "198.51.100.10",
"country": "DE",
"device_type": "Desktop",
"response_status": 200,
"response_time_ms": 500,
"is_bot": 0,
"threat_score": 0.3
},
{
"timestamp": "2024-01-15T10:34:00.000Z",
"request_id": "req_005",
"user_id": "user_654",
"session_id": "sess_mno",
"url": "https://example.com/admin",
"referrer_url": "https://example.com/login",
"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)",
"ip_address": "172.16.0.5",
"country": "US",
"device_type": "Desktop",
"response_status": 200,
"response_time_ms": 120,
"is_bot": 0,
"threat_score": 0.1
}
]'
This sample data includes various protocols (HTTPS, HTTP, FTP), different device types, geographic locations, and security scenarios to test your analytics pipes.
3. Build security analytics pipes
Create pipes that analyze protocol-based security patterns and compliance metrics:
NODE security_events
SQL >
%
SELECT
timestamp,
request_id,
user_id,
session_id,
url,
referrer_url,
country,
device_type,
response_status,
response_time_ms,
is_bot,
threat_score,
protocol(url) AS protocol_type
FROM web_security_events
WHERE timestamp >= {{DateTime(start_date, '2024-01-01 00:00:00')}}
AND timestamp <= {{DateTime(end_date, '2024-12-31 23:59:59')}}
{\% if defined(country_filter) %}
AND country = {{String(country_filter)}}
{\% end %}
{\% if defined(protocol_filter) %}
AND protocol(url) = {{String(protocol_filter)}}
{\% end %}
NODE protocol_security_analysis
SQL >
%
SELECT
protocol_type,
count() AS total_requests,
uniq(user_id) AS unique_users,
uniq(session_id) AS unique_sessions,
countIf(response_status >= 400) AS error_requests,
round(countIf(response_status >= 400) * 100.0 / count(), 2) AS error_rate_percent,
round(avg(response_time_ms), 2) AS avg_response_time_ms,
countIf(is_bot = 1) AS bot_requests,
round(avg(threat_score), 2) AS avg_threat_score,
-- Security compliance metrics
round(countIf(protocol_type = 'https') * 100.0 / count(), 2) AS https_adoption_percent,
-- Top error patterns by protocol
topK(5)(response_status) AS top_error_codes,
-- Geographic distribution of protocol usage
topK(3)(country) AS top_countries
FROM security_events
WHERE protocol_type != ''
GROUP BY protocol_type
ORDER BY total_requests DESC
{\% if defined(limit) %}
LIMIT {{Int32(limit, 20)}}
{\% end %}
TYPE ENDPOINT
Create a protocol migration analysis pipe:
NODE protocol_trends
SQL >
%
SELECT
toDate(timestamp) AS date,
protocol(url) AS protocol_type,
count() AS daily_requests,
uniq(user_id) AS daily_users,
countIf(response_status >= 400) AS daily_errors,
round(avg(response_time_ms), 2) AS avg_response_time
FROM web_security_events
WHERE timestamp >= {{DateTime(start_date, '2024-01-01 00:00:00')}}
AND timestamp <= {{DateTime(end_date, '2024-12-31 23:59:59')}}
{\% if defined(protocol_filter) %}
AND protocol(url) = {{String(protocol_filter)}}
{\% end %}
GROUP BY date, protocol_type
NODE migration_metrics
SQL >
%
SELECT
protocol_type,
date,
daily_requests,
daily_users,
daily_errors,
avg_response_time,
-- Calculate week-over-week growth
round((daily_requests - lag(daily_requests, 7) OVER (PARTITION BY protocol_type ORDER BY date)) * 100.0 / lag(daily_requests, 7) OVER (PARTITION BY protocol_type ORDER BY date), 2) AS wow_growth_percent,
-- Calculate error rate trends
round(daily_errors * 100.0 / daily_requests, 2) AS daily_error_rate
FROM protocol_trends
NODE protocol_migration_analysis
SQL >
%
SELECT
protocol_type,
date,
daily_requests,
daily_users,
daily_errors,
daily_error_rate,
avg_response_time,
wow_growth_percent,
-- Identify migration patterns
CASE
WHEN protocol_type = 'https' AND wow_growth_percent > 10 THEN 'accelerating_https'
WHEN protocol_type = 'http' AND wow_growth_percent < -10 THEN 'declining_http'
WHEN protocol_type = 'https' AND daily_error_rate < 1 THEN 'stable_https'
ELSE 'mixed_trends'
END AS migration_status
FROM migration_metrics
ORDER BY date DESC, daily_requests DESC
{\% if defined(limit) %}
LIMIT {{Int32(limit, 100)}}
{\% end %}
TYPE ENDPOINT
4. Test locally and deploy to cloud
Test your security analytics APIs locally first:
# Test protocol security analysis locally
curl "http://localhost:7181/v0/pipes/protocol_security_analysis.json?start_date=2024-01-15&end_date=2024-01-15&limit=10" \
-H "Authorization: Bearer $TINYBIRD_TOKEN"
Example local API response:
{
"meta": [
{
"name": "protocol_type",
"type": "String"
},
{
"name": "total_requests",
"type": "UInt64"
},
{
"name": "unique_users",
"type": "UInt64"
},
{
"name": "unique_sessions",
"type": "UInt64"
},
{
"name": "error_requests",
"type": "UInt64"
},
{
"name": "error_rate_percent",
"type": "Float64"
},
{
"name": "avg_response_time_ms",
"type": "Float64"
},
{
"name": "bot_requests",
"type": "UInt64"
},
{
"name": "avg_threat_score",
"type": "Float64"
},
{
"name": "https_adoption_percent",
"type": "Float64"
},
{
"name": "top_error_codes",
"type": "Array(UInt16)"
},
{
"name": "top_countries",
"type": "Array(String)"
}
],
"data": [
{
"protocol_type": "https",
"total_requests": 3,
"unique_users": 3,
"unique_sessions": 3,
"error_requests": 0,
"error_rate_percent": 0.0,
"avg_response_time_ms": 123.33,
"bot_requests": 1,
"avg_threat_score": 0.33,
"https_adoption_percent": 100.0,
"top_error_codes": [200, 200, 200],
"top_countries": ["US", "US", "GB"]
},
{
"protocol_type": "http",
"total_requests": 1,
"unique_users": 1,
"unique_sessions": 1,
"error_requests": 0,
"error_rate_percent": 0.0,
"avg_response_time_ms": 200.0,
"bot_requests": 0,
"avg_threat_score": 0.2,
"https_adoption_percent": 0.0,
"top_error_codes": [200],
"top_countries": ["CA"]
},
{
"protocol_type": "ftp",
"total_requests": 1,
"unique_users": 1,
"unique_sessions": 1,
"error_requests": 0,
"error_rate_percent": 0.0,
"avg_response_time_ms": 500.0,
"bot_requests": 0,
"avg_threat_score": 0.3,
"https_adoption_percent": 0.0,
"top_error_codes": [200],
"top_countries": ["DE"]
}
],
"rows": 3,
"statistics": {
"elapsed": 0.001234,
"rows_read": 5,
"bytes_read": 1024
}
}
Once you're satisfied with your local testing, deploy to Tinybird's cloud infrastructure:
# Deploy to cloud
tb --cloud deploy
Test your production APIs:
# Protocol security analysis
curl "https://api.tinybird.co/v0/pipes/protocol_security_analysis.json?start_date=2024-01-01&end_date=2024-01-31&country_filter=US&limit=20" \
-H "Authorization: Bearer $TINYBIRD_TOKEN"
# Protocol migration analysis
curl "https://api.tinybird.co/v0/pipes/protocol_migration_analysis.json?start_date=2024-01-01&end_date=2024-01-31&protocol_filter=https" \
-H "Authorization: Bearer $TINYBIRD_TOKEN"
The APIs return comprehensive security analytics including protocol adoption rates, error patterns, threat scores, and migration trends. This enables real-time insights into web security posture, HTTPS adoption progress, and protocol-based threat detection across your entire web infrastructure.
Why a managed ClickHouse service simplifies this work
Managing ClickHouse infrastructure involves complex decisions around cluster sizing, storage configuration, ingestion pipelines, and API hosting. A managed service like Tinybird removes operational burdens while providing additional capabilities.
The protocol extraction patterns we've covered work the same way whether you're running self-hosted ClickHouse or using Tinybird. However, with Tinybird, you can go from raw URL data to a production API significantly faster than with traditional approaches.
Key advantages of managed services include:
- High-throughput ingestion: Built-in connectors handle streaming data from various sources without custom pipeline code
- SQL-based transformations: Create data pipelines using familiar SQL syntax instead of complex ETL frameworks
- Production-grade APIs: Token-authenticated HTTP endpoints with automatic scaling and monitoring
- Zero infrastructure management: Focus on building features instead of managing clusters, storage, and networking
For teams looking to build real-time analytics without the operational complexity, managed ClickHouse services offer compelling advantages over self-hosted deployments.
Sign up for Tinybird's free plan to start building real-time URL analytics APIs without the infrastructure complexity.
FAQs about extracting URL protocols in ClickHouse
Can I index the protocol column for faster filtering?
You can create indexes, but protocols are low-cardinality data by nature. Using LowCardinality(String)
typically provides better memory usage and performance than secondary indexes for GROUP BY
and filter operations. If you still need an index for highly selective filters, consider bloom filter indexes on the protocol column.
How do I update existing rows to include protocol data?
You have several options depending on your data size and downtime tolerance. Add a new column and backfill it, create a new enriched table with INSERT SELECT
, or use a materialized view for new data while migrating old data as a batch job. The materialized view approach is often the most practical for large datasets.
Does the protocol function allocate memory for each call?
The protocol()
function is lightweight and optimized for performance. However, repeatedly computing it at read time over large datasets still consumes CPU cycles. Precomputing protocols via materialized views or projections is usually more efficient, and storing results as LowCardinality(String)
further reduces memory usage while speeding up aggregations.