Name: Tinybird
Brand: Tinybird
Rating: 5.0 (10 reviews)

When analyzing web traffic or API logs in ClickHouse, URLs with query parameters like https://example.com/products?color=blue&size=10 can create noise that makes it difficult to group and analyze page performance effectively. Query strings fragment what should be identical page views into separate entries, skewing your analytics and making trend analysis nearly impossible.

ClickHouse provides several built-in functions specifically designed to clean URLs and extract meaningful paths from messy web data. This guide covers the most efficient methods to strip query strings from URLs, when to apply URL cleaning in your data pipeline, and how to handle common edge cases that can trip up your analysis.

Quick answer: Strip the query string in one function

The fastest way to extract a URL path without the query string in ClickHouse is using the cutQueryString() function. This function removes everything after the ? character, including query parameters and fragments, while keeping the protocol, domain, and path structure intact.

SELECT cutQueryString('https://example.com/products/shoes?color=blue&size=10') AS clean_url;
-- Result: https://example.com/products/shoes

1. Use `cutQueryString(url)` for Versions ≥21.8

The cutQueryString() function is your go-to solution for modern ClickHouse installations. It's built specifically for this purpose and handles edge cases automatically.

-- Basic usage
SELECT cutQueryString(url_column) AS clean_url 
FROM analytics_events;

-- Web analytics grouping example
SELECT cutQueryString(page_url) AS page, COUNT(*) AS visits
FROM web_logs 
GROUP BY page 
ORDER BY visits DESC;

2. Fallback `replaceRegexpAll` for older servers

If you're running ClickHouse versions before 21.8, you can achieve the same result with a regular expression. This approach works but requires more careful handling of edge cases.

SELECT replaceRegexpAll(url_column, '\\?.*$', '') AS clean_url
FROM your_table;

The regex \\?.*$ matches a literal question mark followed by any characters until the end of the string.

`path`, `pathFull`, and `cutQueryString` explained

ClickHouse offers three main functions for working with URL paths. Each serves different use cases, and understanding their differences helps you pick the right tool for your specific situation.

`path(url)` returns only the path

The path() function extracts just the path portion of a URL, excluding the protocol, domain, query parameters, and fragments.

SELECT path('https://shop.example.com/category/products?filter=new') AS url_path;
-- Result: /category/products

This function works well when you only care about the hierarchical structure within a website, not the full URL context.

`pathFull(url)` keeps query and fragment

The pathFull() function returns the path along with query parameters and fragments. It's essentially everything after the domain name.

SELECT pathFull('https://shop.example.com/search?q=shoes#results') AS full_path;
-- Result: /search?q=shoes#results

Use this when you want to preserve query parameters but don't need the protocol and domain information.

`cutQueryString(url)` removes everything after `?`

The cutQueryString() function keeps the complete URL structure while removing only the query parameters and fragments. This preserves the most context while cleaning the URL.

Here's how all three functions compare:

Function	Input	Output
`path()`	`https://site.com/page?q=1#top`	`/page`
`pathFull()`	`https://site.com/page?q=1#top`	`/page?q=1#top`
`cutQueryString()`	`https://site.com/page?q=1#top`	`https://site.com/page`

Syntax and examples for URL path extraction

Understanding the exact syntax and behavior of ClickHouse URL functions helps you avoid common pitfalls. Let's look at the function signatures and some practical examples.

Function signatures and return types

All URL functions in ClickHouse accept a single String parameter and return a String result. They handle NULL values gracefully by returning NULL.

Function signatures: cutQueryString(url String) -> String, path(url String) -> String, pathFull(url String) -> String
NULL handling: If the input URL is malformed or empty, these functions typically return an empty string rather than throwing an error

Sample queries with expected output

Here are practical examples you might encounter when analyzing web traffic or API logs:

-- E-commerce product page analysis
SELECT 
    cutQueryString(page_url) AS clean_page,
    COUNT(*) AS page_views,
    COUNT(DISTINCT user_id) AS unique_visitors
FROM web_analytics 
WHERE page_url LIKE '%/products/%'
GROUP BY clean_page
ORDER BY page_views DESC
LIMIT 10;

-- API endpoint grouping
SELECT 
    path(request_url) AS endpoint,
    AVG(response_time_ms) AS avg_response_time
FROM api_logs 
WHERE status_code = 200
GROUP BY endpoint
HAVING COUNT(*) > 1000;

These queries demonstrate how URL cleaning enables proper grouping and analysis of web traffic patterns.

Clean paths at ingest or at query time

You can apply URL cleaning at two different stages in your data pipeline. The choice depends on your query patterns, data volume, and performance requirements.

Materialized view pattern for streaming ingest

Processing URLs during data ingestion creates cleaner datasets and faster queries. Materialized views automatically transform incoming data as it arrives.

-- Create a materialized view that cleans URLs on insert
CREATE MATERIALIZED VIEW clean_web_logs_mv
TO clean_web_logs
AS SELECT 
    timestamp,
    user_id,
    cutQueryString(raw_url) AS clean_url,
    referrer,
    user_agent
FROM raw_web_logs;

This approach works well when you consistently need clean URLs across multiple queries and have high query volume on the same dataset.

On-demand transformation in analytics queries

Applying URL functions directly in SELECT statements offers more flexibility but requires more computation at query time.

-- Transform URLs during analysis
SELECT 
    cutQueryString(page_url) AS page,
    DATE(timestamp) AS date,
    COUNT(*) AS daily_visits
FROM web_logs 
WHERE timestamp >= today() - INTERVAL 30 DAY
GROUP BY page, date
ORDER BY date DESC, daily_visits DESC;

Choose this approach when you need different URL transformations for different analyses or when storage space is a concern.

Handle trailing slash and case sensitivity

Real-world URLs often have inconsistencies that can skew your analysis. Combining URL cleaning with normalization techniques produces more accurate results.

Canonicalize trailing slash for deduplication

URLs like /products and /products/ typically represent the same page but appear as separate entries in your data. Combining cutQueryString() with trimRight() standardizes these variations.

-- Remove query strings and trailing slashes
SELECT 
    trimRight(cutQueryString(page_url), '/') AS canonical_page,
    COUNT(*) AS total_visits
FROM web_logs 
GROUP BY canonical_page
ORDER BY total_visits DESC;

This technique is particularly important for content management systems and e-commerce platforms where both URL formats are valid.

Lowercase paths for consistent grouping

Some web servers treat URLs as case-sensitive while others don't. Converting cleaned URLs to lowercase ensures consistent grouping regardless of how users typed the URL.

-- Case-insensitive URL analysis
SELECT 
    lower(cutQueryString(request_url)) AS normalized_url,
    COUNT(*) AS request_count,
    AVG(response_size_bytes) AS avg_response_size
FROM access_logs 
GROUP BY normalized_url
HAVING request_count > 100;

Apply this normalization when analyzing user-generated URLs or when your application accepts URLs in mixed case.

Performance benchmarks on billion-row tables

URL processing performance becomes critical when working with large datasets. Understanding the computational costs helps you optimize your queries and choose the right approach for your specific situation.

Profiling `cutQueryString` vs. regex

Built-in ClickHouse functions like cutQueryString() are optimized for performance and typically outperform regex-based alternatives, especially on large datasets.

-- Performance test: built-in function
SELECT COUNT(DISTINCT cutQueryString(url)) 
FROM large_url_table;
-- Typical performance: ~50M rows/second

-- Performance test: regex alternative  
SELECT COUNT(DISTINCT replaceRegexpAll(url, '\\?.*$', ''))
FROM large_url_table;
-- Typical performance: ~25M rows/second

The performance difference becomes more pronounced with complex URLs and when processing hundreds of millions of rows.

Memory and CPU considerations

URL functions have minimal memory overhead since they process strings in place without creating large intermediate results. However, the choice of function affects CPU usage patterns.

cutQueryString(): Minimal CPU overhead, optimized string scanning
path(): Moderate CPU usage, requires URL parsing
Regex functions: Higher CPU usage, complex pattern matching

For high-throughput applications, consider preprocessing URLs during ingestion rather than applying functions to every query.

Build clean path web analytics APIs with Tinybird

Once you've learned how to clean and analyze URL paths in ClickHouse, Tinybird makes it simple to deploy your web analytics queries as production-ready API endpoints. This approach lets you serve processed data directly to applications without exposing your database.

Step 1: Create a data source for web analytics events

First, create a data source to store web analytics events with full URLs. Create a web_analytics.datasource file:

SCHEMA >
    `timestamp` DateTime64(3, 'UTC'),
    `session_id` String,
    `user_id` String,
    `page_url` String,
    `referrer` String,
    `user_agent` String,
    `country` LowCardinality(String),
    `device_type` LowCardinality(String)

ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "timestamp, session_id"

Deploy the data source:

tb --cloud deploy

Ingest sample web analytics data using the Events API:

curl -X POST \
  -H "Authorization: Bearer $TINYBIRD_TOKEN" \
  -H "Content-Type: application/json" \
  "https://api.tinybird.co/v0/events?name=web_analytics" \
  -d '[
    {"timestamp": "2024-12-01 14:30:00", "session_id": "sess_001", "user_id": "usr_123", "page_url": "https://mysite.com/products/shoes?color=red&size=10", "referrer": "https://google.com", "user_agent": "Mozilla/5.0", "country": "US", "device_type": "desktop"},
    {"timestamp": "2024-12-01 14:31:00", "session_id": "sess_002", "user_id": "usr_456", "page_url": "https://mysite.com/blog/analytics-guide?utm_source=newsletter&utm_campaign=dec2024", "referrer": "https://newsletter.com", "user_agent": "Mozilla/5.0", "country": "UK", "device_type": "mobile"},
    {"timestamp": "2024-12-01 14:32:00", "session_id": "sess_003", "user_id": "usr_789", "page_url": "https://mysite.com/contact?ref=footer", "referrer": "https://mysite.com/about", "user_agent": "Mozilla/5.0", "country": "DE", "device_type": "tablet"}
  ]'

Step 2: Create a pageviews by path API endpoint

Create a pipe that extracts clean URL paths and aggregates pageviews. Create pageviews_by_path.pipe:

TOKEN "pageviews_read" READ

NODE pageviews_analysis
SQL >
    %
    SELECT 
        cutQueryString(page_url) AS page_path,
        count() AS pageviews,
        uniq(session_id) AS unique_sessions,
        uniq(user_id) AS unique_users,
        round(count() / uniq(session_id), 2) AS pages_per_session
    FROM web_analytics 
    WHERE timestamp >= {{DateTime(start_date, '2024-12-01 00:00:00')}}
      AND timestamp < {{DateTime(end_date, '2024-12-02 00:00:00')}}
    {\% if defined(path_filter) %}
      AND cutQueryString(page_url) LIKE {{String(path_filter, '%')}}
    {\% end %}
    {\% if defined(country_filter) %}
      AND country = {{String(country_filter)}}
    {\% end %}
    GROUP BY page_path
    ORDER BY pageviews DESC
    {\% if defined(limit) %}
    LIMIT {{Int32(limit, 20)}}
    {\% end %}

TYPE ENDPOINT

Deploy the API endpoint:

tb --cloud deploy

Step 3: Query the pageviews API

Your clean path analytics API is now available. Query it for different use cases:

# Get top pages for yesterday
curl "https://api.tinybird.co/v0/pipes/pageviews_by_path.json?start_date=2024-12-01&end_date=2024-12-01" \
  -H "Authorization: Bearer $TINYBIRD_TOKEN"

# Filter by specific path patterns
curl "https://api.tinybird.co/v0/pipes/pageviews_by_path.json?start_date=2024-12-01&end_date=2024-12-01&path_filter=%/products/%" \
  -H "Authorization: Bearer $TINYBIRD_TOKEN"

# Analyze blog posts with country filter
curl "https://api.tinybird.co/v0/pipes/pageviews_by_path.json?start_date=2024-12-01&end_date=2024-12-01&path_filter=%/blog/%&country_filter=US" \
  -H "Authorization: Bearer $TINYBIRD_TOKEN"

The API returns clean URL paths without query parameters, enabling clear pageview analysis:

{
    "meta": [
        {"name": "page_path", "type": "String"},
        {"name": "pageviews", "type": "UInt64"},
        {"name": "unique_sessions", "type": "UInt64"},
        {"name": "unique_users", "type": "UInt64"},
        {"name": "pages_per_session", "type": "Float64"}
    ],
    "data": [
        {
            "page_path": "https://mysite.com/products/shoes",
            "pageviews": 1,
            "unique_sessions": 1,
            "unique_users": 1,
            "pages_per_session": 1.0
        },
        {
            "page_path": "https://mysite.com/blog/analytics-guide",
            "pageviews": 1,
            "unique_sessions": 1,
            "unique_users": 1,
            "pages_per_session": 1.0
        }
    ],
    "rows": 2,
    "statistics": {
        "elapsed": 0.001,
        "rows_read": 3,
        "bytes_read": 512
    }
}

This approach provides:

Clean URL grouping by removing query parameters with cutQueryString()
Flexible time range filtering with parameterized start and end dates
Path pattern matching for analyzing specific sections of your site
Rich analytics metrics including unique sessions and engagement ratios

This pattern works particularly well for real-time dashboards and applications that need fresh analytics data.

Build ClickHouse-based APIs with Tinybird

ClickHouse is the perfect database for analyzing web clickstream data. As your data grows and your analytics requirements become more complex, however, managing ClickHouse infrastructure can consume significant engineering resources.

Tinybird provides a managed ClickHouse with developer-focused tooling that reduces the infrastructure overhead associated with managing ClickHouse clusters. The platform includes built-in scaling and monitoring capabilities as part of its managed service, as well as API generation.

Instead of spending time on database administration, cluster management, and performance tuning, you can focus on building the analytics features your users actually want. Start building with Tinybird's free plan and deploy your first API quickly.

You can sign up for free to Tinybird here

Additional resources

FAQs about extracting URL paths in ClickHouse

What happens if the URL has no path?

The path() function returns / for URLs without explicit paths (like https://example.com), while cutQueryString() preserves the original URL structure without the query string. For domain-only URLs, cutQueryString() simply returns the original URL since there's no query string to remove.

Can I strip both query string and fragment in one call?

Yes, cutQueryString() removes everything after the ? including fragments (the # portion), so you don't need separate function calls. If you need more granular control, you can combine cutQueryString() with cutFragment(), though this is rarely necessary since cutQueryString() handles both cases.

Skip the infra work. Deploy your first ClickHouse
project now.

Our Columns:

Skip the infra work. Deploy your first ClickHouse
project now.

Skip the infra work. Deploy your first ClickHouse
project now.

How to extract URL paths without query strings in ClickHouse

Quick answer: Strip the query string in one function

1. Use `cutQueryString(url)` for Versions ≥21.8

2. Fallback `replaceRegexpAll` for older servers

`path`, `pathFull`, and `cutQueryString` explained

`path(url)` returns only the path

`pathFull(url)` keeps query and fragment

`cutQueryString(url)` removes everything after `?`

Syntax and examples for URL path extraction

Function signatures and return types

Sample queries with expected output

Clean paths at ingest or at query time

Materialized view pattern for streaming ingest

On-demand transformation in analytics queries

Handle trailing slash and case sensitivity

Canonicalize trailing slash for deduplication

Lowercase paths for consistent grouping

Performance benchmarks on billion-row tables

Profiling `cutQueryString` vs. regex

Memory and CPU considerations

Build clean path web analytics APIs with Tinybird

Step 1: Create a data source for web analytics events

Step 2: Create a pageviews by path API endpoint

Step 3: Query the pageviews API

Build ClickHouse-based APIs with Tinybird

Additional resources

FAQs about extracting URL paths in ClickHouse

What happens if the URL has no path?

Can I strip both query string and fragment in one call?

Skip the infra work. Deploy your first ClickHouse
project now.

Skip the infra work. Deploy your first ClickHouse project now.

Our Columns:

Skip the infra work. Deploy your first ClickHouse project now.

Skip the infra work. Deploy your first ClickHouse project now.

How to extract URL paths without query strings in ClickHouse

Quick answer: Strip the query string in one function

1. Use cutQueryString(url) for Versions ≥21.8

2. Fallback replaceRegexpAll for older servers

path, pathFull, and cutQueryString explained

path(url) returns only the path

pathFull(url) keeps query and fragment

cutQueryString(url) removes everything after ?

Syntax and examples for URL path extraction

Function signatures and return types

Sample queries with expected output

Clean paths at ingest or at query time

Materialized view pattern for streaming ingest

On-demand transformation in analytics queries

Handle trailing slash and case sensitivity

Canonicalize trailing slash for deduplication

Lowercase paths for consistent grouping

Performance benchmarks on billion-row tables

Profiling cutQueryString vs. regex

Memory and CPU considerations

Build clean path web analytics APIs with Tinybird

Step 1: Create a data source for web analytics events

Step 2: Create a pageviews by path API endpoint

Step 3: Query the pageviews API

Build ClickHouse-based APIs with Tinybird

Additional resources

FAQs about extracting URL paths in ClickHouse

What happens if the URL has no path?

Can I strip both query string and fragment in one call?

Skip the infra work. Deploy your first ClickHouse project now.

Skip the infra work. Deploy your first ClickHouse
project now.

Skip the infra work. Deploy your first ClickHouse
project now.

Skip the infra work. Deploy your first ClickHouse
project now.

1. Use `cutQueryString(url)` for Versions ≥21.8

2. Fallback `replaceRegexpAll` for older servers

`path`, `pathFull`, and `cutQueryString` explained

`path(url)` returns only the path

`pathFull(url)` keeps query and fragment

`cutQueryString(url)` removes everything after `?`

Profiling `cutQueryString` vs. regex

Skip the infra work. Deploy your first ClickHouse
project now.