When you need to extract a port number from a URL in ClickHouse, you use ClickHouse's built-in port() function.
Keep in mind that port() returns 0 when no explicit port exists, which means HTTP and HTTPS URLs without ports need special handling to return standard values of 80 and 443 respectively.
This guide covers ClickHouse's URL parsing functions, conditional logic for default ports, edge case handling, and performance optimization techniques for large-scale URL analysis.
Extracting port numbers with built-in functions
ClickHouse provides the port() function to directly extract port numbers from URLs, returning the port as a UInt16 value when explicitly specified, or 0 when no port is present. This built-in function handles URL parsing efficiently and covers most standard URL formats you'll encounter in log analysis or web traffic data.
The function works consistently across different URL types, though you'll want to understand its behavior with missing ports before building production queries.
urlExtractPort syntax
The port() function (also available as urlExtractPort()) takes a single string parameter containing the URL and returns the port number as UInt16:
SELECT port(url_string) AS extracted_port
Here's how it works with different URL formats:
SELECT
port('http://example.com:8080/path') AS explicit_port,
port('https://example.com/secure') AS no_port,
port('http://localhost:3000') AS localhost_port;
-- Results:
-- explicit_port: 8080
-- no_port: 0
-- localhost_port: 3000
Notice that URLs without explicit ports return 0, not NULL. This behavior becomes important when you're building conditional logic around port values.
Quick example query
When working with real URL data, you'll often have a mix of URLs with and without explicit ports. The function consistently returns 0 for URLs without ports, making it predictable for downstream logic:
WITH sample_urls AS (
SELECT arrayJoin([
'https://api.example.com:443/v1/data',
'http://localhost:8080/admin',
'https://secure.site.com/login',
'http://dev.internal:9000/metrics'
]) AS url
)
SELECT
url,
port(url) AS extracted_port
FROM sample_urls;
This query demonstrates the function's behavior across different scenarios—explicit HTTPS ports, development servers, production URLs without ports, and internal services with custom ports.
Returning default ports when none are present
Since port() returns 0 when no explicit port exists, you'll often want to substitute standard default ports based on the protocol. HTTP typically uses port 80, while HTTPS uses port 443, and you can implement this logic using conditional statements.
The key challenge here is determining which protocol you're dealing with so you can apply the right default.
Case for http and https
The most reliable approach uses a CASE statement that checks both the extracted port and the URL protocol to determine the appropriate default:
SELECT
url,
CASE
WHEN port(url) = 0 AND protocol(url) = 'https' THEN 443
WHEN port(url) = 0 AND protocol(url) = 'http' THEN 80
ELSE port(url)
END AS port_with_default
FROM your_table;
Alternatively, you can use ClickHouse's multiIf() function for a more compact syntax:
SELECT
url,
multiIf(
port(url) != 0, port(url),
protocol(url) = 'https', 443,
protocol(url) = 'http', 80,
port(url)
) AS port_with_default
FROM your_table;
Both approaches give you the same result, though multiIf() can be easier to read when you have multiple conditions to check.
Generic fallback using coalesce pattern
For more complex scenarios where you want custom defaults or handle additional protocols, you can combine nullIf() with coalesce() to create flexible fallback logic:
SELECT
url,
coalesce(
nullIf(port(url), 0),
multiIf(
protocol(url) = 'https', 443,
protocol(url) = 'http', 80,
protocol(url) = 'ftp', 21,
8080 -- Generic fallback
)
) AS port_with_custom_default
FROM your_table;
This pattern works well when you're dealing with diverse URL types and want different default behaviors for different protocols. The nullIf(port(url), 0) converts zero values to NULL, allowing coalesce() to fall through to your default logic.
Handling edge cases and invalid URLs
URL parsing can encounter various edge cases that affect port extraction, particularly with IPv6 addresses, malformed URLs, or URLs containing authentication information. Understanding these scenarios helps you build robust queries that handle real-world data.
Most of these edge cases won't break your queries, but they might return unexpected results if you're not prepared for them.
IPv6 and user info in authority
IPv6 addresses in URLs require square brackets and can complicate port extraction, but ClickHouse's port() function handles them correctly:
SELECT
port('http://[2001:db8::1]:8080/path') AS ipv6_port,
port('https://user:pass@example.com:9000/api') AS userinfo_port,
port('http://[::1]/local') AS ipv6_no_port;
-- Results:
-- ipv6_port: 8080
-- userinfo_port: 9000
-- ipv6_no_port: 0
The function correctly identifies ports even when userinfo (username:password) precedes the hostname, and it properly handles IPv6 literal addresses enclosed in brackets.
Malformed or empty strings
When working with log data or user input, you'll encounter invalid URLs that could cause issues. The port() function returns 0 for malformed URLs, but you might want explicit error handling:
SELECT
url,
CASE
WHEN url = '' OR url IS NULL THEN NULL
WHEN protocol(url) = '' THEN 0 -- Likely malformed
ELSE port(url)
END AS safe_port_extraction
FROM (
SELECT arrayJoin(['', 'not-a-url', 'http://valid.com:8080', NULL]) AS url
);
This approach distinguishes between legitimately missing ports and malformed input, which can be valuable for data quality monitoring.
Performance tips and anti-patterns
When extracting ports from large datasets, the choice of extraction method significantly impacts query performance. ClickHouse's built-in URL functions are optimized for speed and correctness, while alternatives like regular expressions carry performance penalties.
The performance difference becomes especially pronounced when you're processing millions of URLs in a single query.
Avoid regexpExtract for simple ports
While you might be tempted to use regular expressions for port extraction, the built-in port() function offers substantial performance advantages:
-- Slow: regex-based extraction
SELECT regexpExtract(url, ':(\d+)', 1) AS regex_port FROM large_table;
-- Fast: built-in function
SELECT port(url) AS builtin_port FROM large_table;
The performance difference becomes pronounced with large datasets because of several factors:
- Built-in optimization: Native URL parsing leverages optimized C++ code paths
- Error handling: Better handling of edge cases without regex complexity
- Type safety: Returns proper UInt16 values instead of strings
Materialized view on ingestion
For frequently queried port information, consider pre-computing port values during data ingestion rather than extracting them repeatedly at query time:
CREATE MATERIALIZED VIEW url_analysis_mv
ENGINE = MergeTree()
ORDER BY timestamp
AS SELECT
timestamp,
url,
port(url) AS extracted_port,
multiIf(
port(url) != 0, port(url),
protocol(url) = 'https', 443,
protocol(url) = 'http', 80,
port(url)
) AS port_with_default
FROM raw_logs;
This pattern works particularly well for analytics workloads where the same URL analysis happens repeatedly across different queries. You pay the extraction cost once during ingestion instead of every time you query the data.
Building a reusable port-extraction view or API endpoint
Creating reusable components for port extraction with default handling eliminates code duplication and ensures consistent logic across your analytics. You can implement this as a standard ClickHouse view or leverage Tinybird's pipe architecture for more flexible parameterization.
The choice between views and pipes often depends on whether you need parameterization and how you plan to expose the logic to other systems.
Creating a helper view in plain ClickHouse
A well-designed view encapsulates the port extraction logic and default handling, making it easy to reuse across different queries:
CREATE VIEW url_port_extractor AS
SELECT
url,
port(url) AS raw_port,
protocol(url) AS url_protocol,
multiIf(
port(url) != 0, port(url),
protocol(url) = 'https', 443,
protocol(url) = 'http', 80,
protocol(url) = 'ftp', 21,
protocol(url) = 'ssh', 22,
8080
) AS port_with_defaults,
CASE
WHEN port(url) = 0 THEN 'default'
ELSE 'explicit'
END AS port_type
FROM (
SELECT column_name AS url FROM your_source_table
);
This view provides multiple port-related fields that different queries can use as needed, while centralizing the extraction logic. You can then query this view like any other table, and ClickHouse will optimize the underlying operations.
Defining a Tinybird pipe for the same logic
Tinybird pipes offer more flexibility for parameterized port extraction, allowing you to customize default values based on query parameters:
NODE url_port_analysis
SQL >
%
SELECT
url,
port(url) AS raw_port,
protocol(url) AS url_protocol,
multiIf(
port(url) != 0, port(url),
protocol(url) = 'https', {{ Int16(https_default, 443) }},
protocol(url) = 'http', {{ Int16(http_default, 80) }},
{{ Int16(fallback_default, 8080) }}
) AS port_with_defaults
FROM {{ String(source_table, 'access_logs') }}
WHERE url IS NOT NULL
{\% if defined(protocol_filter) %}
AND protocol(url) = {{ String(protocol_filter) }}
{\% end %}
This pipe accepts parameters for custom default ports and optional protocol filtering, making it adaptable to different use cases while maintaining the core extraction logic. The parameterization means you can reuse the same pipe for different scenarios without duplicating code.
Exposing the query as an API endpoint with Tinybird
You can deploy your port extraction logic (plus any additional web analytics queries) into a production-ready API endpoint that applications can consume directly. Tinybird abstracts infrastructure management for ClickHouse, allowing developers to focus on query logic and API design.
Once you have your pipe working locally, getting it deployed as an API takes just a few commands.
Deploy and test the endpoint
To create an API endpoint from your pipe, you'll need to define it in a .pipe file with the TYPE ENDPOINT configuration. First, create your pipe file with the endpoint type:
NODE url_port_analysis
SQL >
%
SELECT
url,
port(url) AS raw_port,
protocol(url) AS url_protocol,
multiIf(
port(url) != 0, port(url),
protocol(url) = 'https', {{ Int16(https_default, 443) }},
protocol(url) = 'http', {{ Int16(http_default, 80) }},
{{ Int16(fallback_default, 8080) }}
) AS port_with_defaults
FROM {{ String(source_table, 'access_logs') }}
WHERE url IS NOT NULL
{\% if defined(protocol_filter) %}
AND protocol(url) = {{ String(protocol_filter) }}
{\% end %}
TYPE ENDPOINT
Then deploy your pipe to Tinybird:
# Deploy the pipe as an API endpoint
tb --cloud deploy
After deployment, you can test the endpoint with curl to verify it handles different URL formats correctly:
# Test with various URL types
curl "https://api.tinybird.co/v0/pipes/port_extractor_api.json" \
-H "Authorization: Bearer YOUR_TOKEN" \
-d "source_table=access_logs&https_default=8443"
Tinybird's managed service for ClickHouse reduces the need for manual infrastructure management. You can monitor API usage and performance through Tinybird's built-in observability features.
FAQs about extracting port numbers in ClickHouse
What value does urlExtractPort return when the URL has no port?
The port() function returns 0 (zero) as a UInt16 value when no explicit port is specified in the URL. This differs from returning NULL, which means you can safely use the result in mathematical operations or comparisons without additional null checking.
Does urlExtractPort support IPv6 literals?
Yes, the function correctly handles IPv6 addresses enclosed in square brackets within URLs. For example, port('http://[2001:db8::1]:8080/path') returns 8080, while port('http://[::1]/local') returns 0 since no port is specified.
Can I filter rows using urlExtractPort in a where clause?
Yes, you can use the port() function in WHERE clauses, but consider the performance implications for large datasets. For frequently filtered port values, consider creating a materialized column or using a materialized view to pre-compute port information during data ingestion rather than extracting it repeatedly at query time.
