---
title: "How to extract URL paths without query strings in ClickHouse®"
excerpt: "Strip URL query strings in ClickHouse with one function. Stop writing regex hacks. This approach is cleaner and faster."
authors: "Cameron Archer"
categories: "AI Resources"
createdOn: "2025-09-30 17:17:26"
publishedOn: "2025-09-30 17:17:26"
updatedOn: "2025-09-30 17:17:26"
status: "published"
---

When analyzing web traffic or API logs in ClickHouse{% sup %}®{% /sup %}, URLs with query parameters like `https://example.com/products?color=blue&size=10` can create noise that makes it difficult to group and analyze page performance effectively. Query strings fragment what should be identical page views into separate entries, skewing your analytics and making trend analysis nearly impossible.

ClickHouse{% sup %}®{% /sup %} provides several built-in functions specifically designed to clean URLs and extract meaningful paths from messy web data. This guide covers the most efficient methods to strip query strings from URLs, when to apply URL cleaning in your data pipeline, and how to handle common edge cases that can trip up your analysis.

## Quick answer: Strip the query string in one function

The fastest way to extract a URL path without the query string in ClickHouse{% sup %}®{% /sup %} is using the `cutQueryString()` function. This function removes everything after the `?` character, including query parameters and fragments, while keeping the protocol, domain, and path structure intact.

```sql
SELECT cutQueryString('https://example.com/products/shoes?color=blue&size=10') AS clean_url;
-- Result: https://example.com/products/shoes
```

### 1. Use `cutQueryString(url)` for Versions ≥21.8

The `cutQueryString()` function is your go-to solution for modern ClickHouse{% sup %}®{% /sup %} installations. It's built specifically for this purpose and handles edge cases automatically.

```sql
-- Basic usage
SELECT cutQueryString(url_column) AS clean_url
FROM analytics_events;

-- Web analytics grouping example
SELECT cutQueryString(page_url) AS page, COUNT(*) AS visits
FROM web_logs
GROUP BY page
ORDER BY visits DESC;
```

### 2. Fallback `replaceRegexpAll` for older servers

If you're running ClickHouse{% sup %}®{% /sup %} versions before 21.8, you can achieve the same result with a regular expression. This approach works but requires more careful handling of edge cases.

```sql
SELECT replaceRegexpAll(url_column, '\\?.*$', '') AS clean_url
FROM your_table;
```

The regex `\\?.*$` matches a literal question mark followed by any characters until the end of the string.

## `path`, `pathFull`, and `cutQueryString` explained

ClickHouse{% sup %}®{% /sup %} offers three main functions for working with URL paths. Each serves different use cases, and understanding their differences helps you pick the right tool for your specific situation.

### `path(url)` returns only the path

The `path()` function extracts just the path portion of a URL, excluding the protocol, domain, query parameters, and fragments.

```sql
SELECT path('https://shop.example.com/category/products?filter=new') AS url_path;
-- Result: /category/products
```

This function works well when you only care about the hierarchical structure within a website, not the full URL context.

### `pathFull(url)` keeps query and fragment

The `pathFull()` function returns the path along with query parameters and fragments. It's essentially everything after the domain name.

```sql
SELECT pathFull('https://shop.example.com/search?q=shoes#results') AS full_path;
-- Result: /search?q=shoes#results
```

Use this when you want to preserve query parameters but don't need the protocol and domain information.

### `cutQueryString(url)` removes everything after `?`

The `cutQueryString()` function keeps the complete URL structure while removing only the query parameters and fragments. This preserves the most context while cleaning the URL.

Here's how all three functions compare:

| Function | Input | Output |
| -------- | ----- | ------ |
| `path()` | `https://site.com/page?q=1#top` | `/page` |
| `pathFull()` | `https://site.com/page?q=1#top` | `/page?q=1#top` |
| `cutQueryString()` | `https://site.com/page?q=1#top` | `https://site.com/page` |

## Syntax and examples for URL path extraction

Understanding the exact syntax and behavior of ClickHouse{% sup %}®{% /sup %} URL functions helps you avoid common pitfalls. Let's look at the function signatures and some practical examples.

### Function signatures and return types

All URL functions in ClickHouse{% sup %}®{% /sup %} accept a single String parameter and return a String result. They handle NULL values gracefully by returning NULL.

- **Function signatures:** `cutQueryString(url String) -> String`, `path(url String) -> String`, `pathFull(url String) -> String`
- **NULL handling:** If the input URL is malformed or empty, these functions typically return an empty string rather than throwing an error

### Sample queries with expected output

Here are practical examples you might encounter when analyzing web traffic or API logs:

```sql
-- E-commerce product page analysis
SELECT
    cutQueryString(page_url) AS clean_page,
    COUNT(*) AS page_views,
    COUNT(DISTINCT user_id) AS unique_visitors
FROM web_analytics
WHERE page_url LIKE '%/products/%'
GROUP BY clean_page
ORDER BY page_views DESC
LIMIT 10;

-- API endpoint grouping
SELECT
    path(request_url) AS endpoint,
    AVG(response_time_ms) AS avg_response_time
FROM api_logs
WHERE status_code = 200
GROUP BY endpoint
HAVING COUNT(*) > 1000;
```

These queries demonstrate how URL cleaning enables proper grouping and analysis of web traffic patterns.

## Clean paths at ingest or at query time

You can apply URL cleaning at two different stages in your data pipeline. The choice depends on your query patterns, data volume, and performance requirements.

### Materialized view pattern for streaming ingest

Processing URLs during data ingestion creates cleaner datasets and faster queries. [Materialized views](https://tinybird.co/docs/forward/work-with-data/optimize/materialized-views) automatically transform incoming data as it arrives.

```sql
-- Create a materialized view that cleans URLs on insert
CREATE MATERIALIZED VIEW clean_web_logs_mv
TO clean_web_logs
AS SELECT
    timestamp,
    user_id,
    cutQueryString(raw_url) AS clean_url,
    referrer,
    user_agent
FROM raw_web_logs;
```

This approach works well when you consistently need clean URLs across multiple queries and have high query volume on the same dataset.

### On-demand transformation in analytics queries

Applying URL functions directly in SELECT statements offers more flexibility but requires more computation at query time.

```sql
-- Transform URLs during analysis
SELECT
    cutQueryString(page_url) AS page,
    DATE(timestamp) AS date,
    COUNT(*) AS daily_visits
FROM web_logs
WHERE timestamp >= today() - INTERVAL 30 DAY
GROUP BY page, date
ORDER BY date DESC, daily_visits DESC;
```

Choose this approach when you need different URL transformations for different analyses or when storage space is a concern.

## Handle trailing slash and case sensitivity

Real-world URLs often have inconsistencies that can skew your analysis. Combining URL cleaning with normalization techniques produces more accurate results.

### Canonicalize trailing slash for deduplication

URLs like `/products` and `/products/` typically represent the same page but appear as separate entries in your data. Combining `cutQueryString()` with `trimRight()` standardizes these variations.

```sql
-- Remove query strings and trailing slashes
SELECT
    trimRight(cutQueryString(page_url), '/') AS canonical_page,
    COUNT(*) AS total_visits
FROM web_logs
GROUP BY canonical_page
ORDER BY total_visits DESC;
```

This technique is particularly important for content management systems and e-commerce platforms where both URL formats are valid.

### Lowercase paths for consistent grouping

Some web servers treat URLs as case-sensitive while others don't. Converting cleaned URLs to lowercase ensures consistent grouping regardless of how users typed the URL.

```sql
-- Case-insensitive URL analysis
SELECT
    lower(cutQueryString(request_url)) AS normalized_url,
    COUNT(*) AS request_count,
    AVG(response_size_bytes) AS avg_response_size
FROM access_logs
GROUP BY normalized_url
HAVING request_count > 100;
```

Apply this normalization when analyzing user-generated URLs or when your application accepts URLs in mixed case.

## Performance benchmarks on billion-row tables

URL processing performance becomes critical when working with large datasets. Understanding the computational costs helps you optimize your queries and choose the right approach for your specific situation.

### Profiling `cutQueryString` vs. regex

Built-in ClickHouse{% sup %}®{% /sup %} functions like `cutQueryString()` are optimized for performance and typically outperform regex-based alternatives, especially on large datasets.

```sql
-- Performance test: built-in function
SELECT COUNT(DISTINCT cutQueryString(url))
FROM large_url_table;
-- Typical performance: ~50M rows/second

-- Performance test: regex alternative
SELECT COUNT(DISTINCT replaceRegexpAll(url, '\\?.*$', ''))
FROM large_url_table;
-- Typical performance: ~25M rows/second
```

The performance difference becomes more pronounced with complex URLs and when processing hundreds of millions of rows.

### Memory and CPU considerations

URL functions have minimal memory overhead since they process strings in place without creating large intermediate results. However, the choice of function affects CPU usage patterns.

- **`cutQueryString()`:** Minimal CPU overhead, optimized string scanning
- **`path()`:** Moderate CPU usage, requires URL parsing
- **Regex functions:** Higher CPU usage, complex pattern matching

For high-throughput applications, consider preprocessing URLs during ingestion rather than applying functions to every query.

## Build clean path web analytics APIs with Tinybird

Once you've learned how to clean and analyze URL paths in ClickHouse{% sup %}®{% /sup %}, [Tinybird](https://www.tinybird.co) makes it simple to deploy your web analytics queries as production-ready [API endpoints](https://tinybird.co/docs/forward/work-with-data/publish-data/endpoints). This approach lets you serve processed data directly to applications without exposing your database.

### Step 1: Create a data source for web analytics events

First, create a data source to store web analytics events with full URLs. Create a `web_analytics.datasource` file:

```tinybird
SCHEMA >
    `timestamp` DateTime64(3, 'UTC'),
    `session_id` String,
    `user_id` String,
    `page_url` String,
    `referrer` String,
    `user_agent` String,
    `country` LowCardinality(String),
    `device_type` LowCardinality(String)

ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "timestamp, session_id"
```

Deploy the data source:

```bash
tb --cloud deploy
```

Ingest sample web analytics data using the [Events API](https://www.tinybird.co/docs/forward/get-data-in/events-api):

```bash
curl -X POST \
  -H "Authorization: Bearer $TINYBIRD_TOKEN" \
  -H "Content-Type: application/json" \
  "https://api.tinybird.co/v0/events?name=web_analytics" \
  -d '[
    {"timestamp": "2024-12-01 14:30:00", "session_id": "sess_001", "user_id": "usr_123", "page_url": "https://mysite.com/products/shoes?color=red&size=10", "referrer": "https://google.com", "user_agent": "Mozilla/5.0", "country": "US", "device_type": "desktop"},
    {"timestamp": "2024-12-01 14:31:00", "session_id": "sess_002", "user_id": "usr_456", "page_url": "https://mysite.com/blog/analytics-guide?utm_source=newsletter&utm_campaign=dec2024", "referrer": "https://newsletter.com", "user_agent": "Mozilla/5.0", "country": "UK", "device_type": "mobile"},
    {"timestamp": "2024-12-01 14:32:00", "session_id": "sess_003", "user_id": "usr_789", "page_url": "https://mysite.com/contact?ref=footer", "referrer": "https://mysite.com/about", "user_agent": "Mozilla/5.0", "country": "DE", "device_type": "tablet"}
  ]'
```

### Step 2: Create a pageviews by path API endpoint

Create a pipe that extracts clean URL paths and aggregates pageviews. Create `pageviews_by_path.pipe`:

```tinybird
TOKEN "pageviews_read" READ

NODE pageviews_analysis
SQL >
    %
    SELECT
        cutQueryString(page_url) AS page_path,
        count() AS pageviews,
        uniq(session_id) AS unique_sessions,
        uniq(user_id) AS unique_users,
        round(count() / uniq(session_id), 2) AS pages_per_session
    FROM web_analytics
    WHERE timestamp >= {{DateTime(start_date, '2024-12-01 00:00:00')}}
      AND timestamp < {{DateTime(end_date, '2024-12-02 00:00:00')}}
    {\% if defined(path_filter) %}
      AND cutQueryString(page_url) LIKE {{String(path_filter, '%')}}
    {\% end %}
    {\% if defined(country_filter) %}
      AND country = {{String(country_filter)}}
    {\% end %}
    GROUP BY page_path
    ORDER BY pageviews DESC
    {\% if defined(limit) %}
    LIMIT {{Int32(limit, 20)}}
    {\% end %}

TYPE ENDPOINT
```

Deploy the API endpoint:

```bash
tb --cloud deploy
```

### Step 3: Query the pageviews API

Your clean path analytics API is now available. Query it for different use cases:

```bash
# Get top pages for yesterday
curl "https://api.tinybird.co/v0/pipes/pageviews_by_path.json?start_date=2024-12-01&end_date=2024-12-01" \
  -H "Authorization: Bearer $TINYBIRD_TOKEN"

# Filter by specific path patterns
curl "https://api.tinybird.co/v0/pipes/pageviews_by_path.json?start_date=2024-12-01&end_date=2024-12-01&path_filter=%/products/%" \
  -H "Authorization: Bearer $TINYBIRD_TOKEN"

# Analyze blog posts with country filter
curl "https://api.tinybird.co/v0/pipes/pageviews_by_path.json?start_date=2024-12-01&end_date=2024-12-01&path_filter=%/blog/%&country_filter=US" \
  -H "Authorization: Bearer $TINYBIRD_TOKEN"
```

The API returns clean URL paths without query parameters, enabling clear pageview analysis:

```json
{
    "meta": [
        {"name": "page_path", "type": "String"},
        {"name": "pageviews", "type": "UInt64"},
        {"name": "unique_sessions", "type": "UInt64"},
        {"name": "unique_users", "type": "UInt64"},
        {"name": "pages_per_session", "type": "Float64"}
    ],
    "data": [
        {
            "page_path": "https://mysite.com/products/shoes",
            "pageviews": 1,
            "unique_sessions": 1,
            "unique_users": 1,
            "pages_per_session": 1.0
        },
        {
            "page_path": "https://mysite.com/blog/analytics-guide",
            "pageviews": 1,
            "unique_sessions": 1,
            "unique_users": 1,
            "pages_per_session": 1.0
        }
    ],
    "rows": 2,
    "statistics": {
        "elapsed": 0.001,
        "rows_read": 3,
        "bytes_read": 512
    }
}
```

This approach provides:

- **Clean URL grouping** by removing query parameters with `cutQueryString()`
- **Flexible time range filtering** with parameterized start and end dates
- **Path pattern matching** for analyzing specific sections of your site
- **Rich analytics metrics** including unique sessions and engagement ratios

This pattern works particularly well for real-time dashboards and applications that need fresh analytics data.

## Build ClickHouse{% sup %}®{% /sup %}-based APIs with Tinybird

ClickHouse{% sup %}®{% /sup %} is the perfect database for analyzing web clickstream data. As your data grows and your analytics requirements become more complex, however, managing ClickHouse{% sup %}®{% /sup %} infrastructure can consume significant engineering resources.

Tinybird provides a [managed ClickHouse{% sup %}®{% /sup %}](https://www.tinybird.co/clickhouse) with developer-focused tooling that reduces the infrastructure overhead associated with managing ClickHouse{% sup %}®{% /sup %} clusters. The platform includes built-in scaling and monitoring capabilities as part of its managed service, as well as API generation.

Instead of spending time on database administration, cluster management, and performance tuning, you can focus on building the analytics features your users actually want. Start building with [Tinybird's free plan](https://www.tinybird.co/blog-posts/tinybird-vs-clickhouse) and deploy your first API quickly.

[You can sign up for free to Tinybird here](https://cloud.tinybird.co/signup)

## Additional resources

- [Functions for Working with URLs](https://clickhouse.com/docs/en/sql-reference/functions/url-functions)
- [String Functions](https://clickhouse.com/docs/en/sql-reference/functions/string-functions)
- [Tinybird Web Analytics Starter Kit](https://www.tinybird.co/templates/web-analytics-starter-kit)

## FAQs about extracting URL paths in ClickHouse{% sup %}®{% /sup %}

### What happens if the URL has no path?

The `path()` function returns `/` for URLs without explicit paths (like `https://example.com`), while `cutQueryString()` preserves the original URL structure without the query string. For domain-only URLs, `cutQueryString()` simply returns the original URL since there's no query string to remove.

### Can I strip both query string and fragment in one call?

Yes, `cutQueryString()` removes everything after the `?` including fragments (the `#` portion), so you don't need separate function calls. If you need more granular control, you can combine `cutQueryString()` with `cutFragment()`, though this is rarely necessary since `cutQueryString()` handles both cases.
