---
title: "real time error monitoring — 3 ways to implement in 2026"
excerpt: "Real time error monitoring options compared: Sentry and ELK stack, ClickHouse® with Grafana for high-volume logs, and Tinybird API-first for embedded error analytics. Pick the right architecture."
authors: "Tinybird"
categories: "AI Resources"
createdOn: "2026-04-13 00:00:00"
publishedOn: "2026-04-13 00:00:00"
updatedOn: "2026-04-13 00:00:00"
status: "published"
---

These are the main options for **real time error monitoring**:

1. **Sentry / ELK stack** — developer-facing error tracking and log aggregation
2. **ClickHouse® + Grafana** — high-volume error analytics with live dashboards and alerting
3. **Tinybird API-first** — error ingestion via Events API with REST endpoints for dashboards and alerting

**Real time error monitoring** means capturing errors and exceptions as they occur, alerting on regressions within seconds, and making the data queryable for debugging and trend analysis.

The right approach depends on volume, audience, and whether you need the error data to power product features. Sentry is the standard for developer-facing error tracking at moderate volume. ClickHouse® + Grafana handles [streaming data](https://www.ibm.com/think/topics/streaming-data) at high scale. Tinybird adds an API layer when the same data needs to power both monitoring and product-facing features.

Before picking, answer: What is your **error event volume**? Who is the primary consumer — developers, SRE dashboards, or product UIs? Do you need **custom SQL** to correlate errors with business metrics?

## **Three ways to implement real time error monitoring**

### **Option 1: Sentry / ELK stack**

The most widely adopted approach for developer-facing **real time error monitoring**. **Sentry** captures exceptions with full stack traces, breadcrumbs, and user context. The **ELK stack** (Elasticsearch, Logstash, Kibana) aggregates logs from multiple services and makes them searchable.

**How it works (Sentry):** install the Sentry SDK in your application. Sentry captures unhandled exceptions and manual `captureException` calls, enriches them with context, and sends them to Sentry's ingest pipeline.

**Sentry SDK setup (Python/FastAPI example):**

```python
import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration
from sentry_sdk.integrations.sqlalchemy import SqlalchemyIntegration

sentry_sdk.init(
    dsn="https://your-dsn@sentry.io/your-project",
    integrations=[FastApiIntegration(), SqlalchemyIntegration()],
    traces_sample_rate=0.1,        # 10% of transactions for performance monitoring
    profiles_sample_rate=0.01,     # 1% profiling
    environment="production",
    release="myapp@1.2.3",
    before_send=lambda event, hint: event  # filter or enrich events here
)
```

**Manual error capture with custom context:**

```python
with sentry_sdk.push_scope() as scope:
    scope.set_tag("order_id", order.id)
    scope.set_tag("user_tier", user.tier)
    scope.set_extra("request_payload", payload)
    sentry_sdk.capture_exception(e)
```

**ELK stack: Filebeat shipping application logs:**

```yaml
# filebeat.yml
filebeat.inputs:
  - type: log
    paths:
      - /var/log/myapp/*.log
    fields:
      service: myapp
      env: production
    json.keys_under_root: true
    json.add_error_key: true

output.logstash:
  hosts: ["logstash-host:5044"]
```

**When this fits:**

- Your error volume is moderate (under ~10,000 events/second)
- The primary consumers are **developers** who need stack traces, breadcrumbs, and release-correlated errors
- You want out-of-the-box alerting, issue grouping, and integrations with GitHub, Slack, and PagerDuty

**Trade-offs:** Sentry pricing scales with event volume — at millions of errors per day, costs grow quickly. Sentry's query model is optimized for issue grouping and deduplication, not arbitrary SQL analytics over raw error streams. The ELK stack has significant operational overhead (index management, shard sizing, Elasticsearch tuning). Neither is designed for [user-facing analytics](https://www.tinybird.co/blog/user-facing-analytics) features in product UIs.

**Prerequisites:** Sentry account (cloud or self-hosted), Sentry SDK for your language/framework; or ELK stack deployed (Elasticsearch 8+, Logstash, Kibana, Filebeat).

### **Option 2: ClickHouse® + Grafana — high-volume error analytics**

For teams generating millions of error events per day, ClickHouse® is a purpose-built fit. ClickHouse®'s columnar storage and vectorized execution handle high-frequency event ingestion and arbitrary aggregation queries at [low latency](https://www.cisco.com/site/us/en/learn/topics/cloud-networking/what-is-low-latency.html), making it ideal for [real-time logs analytics](https://www.tinybird.co/blog/real-time-logs-analytics-architectures) pipelines.

**How it works:** ingest error events into ClickHouse® via Kafka (using the Kafka table engine) or HTTP insert. Define a schema optimized for time-series error queries. Build Grafana dashboards and alert rules backed by ClickHouse® queries using the Grafana ClickHouse® datasource plugin.

**ClickHouse® errors table:**

```sql
CREATE TABLE error_events (
  event_id     UInt64,
  service_name LowCardinality(String),
  error_type   LowCardinality(String),
  error_msg    String,
  stack_trace  String,
  user_id      UInt64,
  request_id   String,
  http_status  UInt16,
  env          LowCardinality(String),
  release      LowCardinality(String),
  event_time   DateTime64(3),
  updated_at   DateTime
)
ENGINE = ReplacingMergeTree(updated_at)
PARTITION BY toYYYYMM(event_time)
ORDER BY (service_name, error_type, event_time)
```

**Kafka table engine for real-time ingest:**

```sql
CREATE TABLE error_events_kafka (
  event_id     UInt64,
  service_name String,
  error_type   String,
  error_msg    String,
  stack_trace  String,
  user_id      UInt64,
  request_id   String,
  http_status  UInt16,
  env          String,
  release      String,
  event_time   DateTime64(3)
)
ENGINE = Kafka
SETTINGS
  kafka_broker_list = 'kafka-broker:9092',
  kafka_topic_list = 'error-events',
  kafka_group_name = 'clickhouse-consumer',
  kafka_format = 'JSONEachRow';

-- Materialized view to move from Kafka → persistent table
CREATE MATERIALIZED VIEW error_events_mv TO error_events AS
SELECT *, now() AS updated_at FROM error_events_kafka;
```

**Grafana panel query — error rate by service (last 1 hour):**

```sql
SELECT
  toStartOfMinute(event_time) AS time,
  service_name,
  count() AS error_count
FROM error_events
WHERE $__timeFilter(event_time)
  AND env = 'production'
GROUP BY time, service_name
ORDER BY time
```

**When this fits:**

- Error event volume is **very high** (millions per day) and Sentry cost or ingest limits are constraints
- You need **custom analytical queries** — grouping by release, correlating errors with latency, joining with user data
- Your team already uses ClickHouse® or Grafana and wants to consolidate monitoring on the same stack

**Trade-offs:** requires operating ClickHouse® and Kafka infrastructure, or using ClickHouse® Cloud. Grafana dashboards work well for ops teams but are not suitable for embedding in product UIs. No out-of-the-box error grouping, deduplication, or stack trace parsing — you build your own aggregation logic in SQL. Alerts are configurable in Grafana but require more setup than Sentry's built-in issue workflows.

**Prerequisites:** ClickHouse® (self-managed or Cloud), Kafka for event streaming, Grafana 9.0+ with ClickHouse® datasource plugin, application instrumentation to send errors to Kafka.

### **Option 3: Tinybird API-first — error ingestion with REST endpoints**

Tinybird's **Events API** provides a simple HTTPS ingest endpoint for error events. Define SQL Pipes over the incoming data and publish them as REST API endpoints. The same data powers Grafana dashboards (via the Infinity plugin or direct endpoint consumption), application alerting webhooks, and product-facing error analytics — without operating Kafka or ClickHouse® directly.

**How it works:** send error events to Tinybird's Events API via HTTP POST from your application or error handler. Define Pipes for the queries you need — error rate, top errors by service, errors per release. Publish each Pipe as an endpoint. Dashboards and alerting systems call the endpoint directly.

**Send error events to Tinybird Events API:**

```bash
curl -s \
  -H "Authorization: Bearer $TINYBIRD_TOKEN" \
  -d '{"event_id":1234567,"service_name":"api-gateway","error_type":"TimeoutError","error_msg":"upstream timeout after 30s","user_id":789,"http_status":504,"env":"production","release":"v1.2.3","event_time":"2026-04-13 14:22:01"}' \
  "https://api.tinybird.co/v0/events?name=error_events"
```

**Application-level error handler (Python):**

```python
import requests
import os
import json
from datetime import datetime

TINYBIRD_TOKEN = os.environ["TINYBIRD_TOKEN"]

def send_error_to_tinybird(error: Exception, context: dict) -> None:
    payload = {
        "event_id": context.get("request_id", 0),
        "service_name": context.get("service", "unknown"),
        "error_type": type(error).__name__,
        "error_msg": str(error)[:1000],
        "user_id": context.get("user_id", 0),
        "http_status": context.get("status_code", 500),
        "env": os.environ.get("ENV", "production"),
        "release": os.environ.get("RELEASE", "unknown"),
        "event_time": datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S")
    }
    try:
        requests.post(
            "https://api.tinybird.co/v0/events?name=error_events",
            headers={
                "Authorization": f"Bearer {TINYBIRD_TOKEN}",
                "Content-Type": "application/json"
            },
            data=json.dumps(payload),
            timeout=2
        )
    except Exception:
        # Never let monitoring break the application.
        # In production, increment a local counter/metric here instead of silently passing,
        # so you can alert on systematic ingest failures.
        pass
```

**Tinybird Pipe: error rate per service (last 24h):**

```sql
-- error_rate_by_service Pipe
SELECT
  service_name,
  error_type,
  count() AS total_errors,
  countIf(http_status >= 500) AS server_errors,
  countIf(http_status = 429) AS rate_limit_errors
FROM error_events
WHERE event_time >= now() - INTERVAL 24 HOUR
  AND env = 'production'
GROUP BY service_name, error_type
ORDER BY total_errors DESC
LIMIT 50
```

Publish this Pipe as a REST endpoint. Your alerting system polls `https://api.tinybird.co/v0/pipes/error_rate_by_service.json` and triggers a PagerDuty alert when `server_errors` exceeds a threshold. The same endpoint powers a [real-time dashboards](https://www.tinybird.co/blog/real-time-dashboards-are-they-worth-it) panel in your internal monitoring UI.

**When this fits:**

- You want **minimal infrastructure** — no Kafka, no self-hosted ClickHouse®, no Elasticsearch cluster
- The same error data needs to power **both monitoring dashboards and product features** (e.g., show users their own error history)
- You need error analytics accessible via APIs for downstream alerting systems, Slack bots, or status page integrations

**Trade-offs:** Tinybird requires data to be ingested into its platform — it does not query your existing ClickHouse® cluster or databases directly. The Events API has throughput limits depending on your plan. For very high-volume error streams, evaluate whether event batching or Kafka-based ingest is needed.

**Prerequisites:** Tinybird account, application instrumentation to POST error events to the Events API, Tinybird API token.

### **Summary: picking the right option**

| Criterion | Sentry / ELK | ClickHouse® + Grafana | Tinybird API-first |
|---|---|---|---|
| **Best for** | Developer error tracking | High-volume ops analytics | Ops + product API consumers |
| **Setup complexity** | Low (Sentry) / High (ELK) | High | Low |
| **Ops overhead** | Low (Sentry) / High (ELK) | High (self-managed) | Low (managed) |
| **Custom SQL queries** | Limited | Full | Full |
| **Product embedding** | No | No | Yes (REST API) |
| **Ingest volume ceiling** | Medium (Sentry pricing) | Very high | High |

## **Decision framework: what to choose for real time error monitoring**

Pick based on volume, audience, and whether you need API access to error data:

- **Sentry / ELK** if your primary consumers are developers who need stack traces, issue grouping, and release-correlated errors. Sentry is the fastest path to developer-facing error tracking. ELK is appropriate when you need full-text log search across structured and unstructured logs.
- **ClickHouse® + Grafana** if error volume is very high (millions per day), you need full SQL analytics over raw error streams, and your team can operate ClickHouse® and Kafka infrastructure. Best for SRE teams with existing ClickHouse® deployments.
- **Tinybird API-first** if you want the ClickHouse®-powered analytics capability without infrastructure overhead, need the same data accessible via REST APIs for multiple consumers, or want to embed error analytics in product UIs alongside the ops dashboard.

**Bottom line:** for most engineering teams, Sentry handles developer-facing error tracking well. When volume, cost, or API access requirements push beyond Sentry's model, Tinybird provides ClickHouse®-backed [real-time analytics](https://www.tinybird.co/blog/real-time-analytics-a-definitive-guide) with REST API publishing and minimal operational overhead.

## **What is real time error monitoring and why does it matter?**

**Real time error monitoring** captures, stores, and queries application errors as they occur — with latency measured in seconds. The goal: detect regressions before users report them, alert on-call engineers within seconds of a spike, and provide queryable error history.

It matters because MTTD (mean time to detection) directly affects MTTR. Batch log processing with 15-minute delay makes incident response reactive. [Real-time data processing](https://www.tinybird.co/blog/real-time-data-processing) enables alerting within seconds.

The challenge at scale: error events spike during incidents, precisely when the pipeline must be most reliable. Architectures that buffer errors via [streaming data](https://www.ibm.com/think/topics/streaming-data) pipelines handle spikes gracefully; synchronous writes to databases under error load amplify the incident.

## **Schema and pipeline design**

### **Practical schema rules for error monitoring**

**Rule 1: use `LowCardinality(String)` for categorical fields.** `service_name`, `error_type`, `env`, `release` — fields queried in `GROUP BY` and `WHERE` constantly. `LowCardinality` reduces memory and speeds up aggregations.

**Rule 2: keep stack traces out of `ORDER BY`.** Include them for retrieval but never as an aggregation dimension.

**Rule 3: partition by month.** `PARTITION BY toYYYYMM(event_time)` limits scans for recent time-window queries.

**Rule 4: use `DateTime64(3)` for millisecond precision.** Error correlation across services requires sub-second ordering.

### **Example: ClickHouse® error event schema with pre-aggregated MV**

```sql
CREATE TABLE error_events (
  event_id     UInt64,
  service_name LowCardinality(String),
  error_type   LowCardinality(String),
  error_msg    String,
  stack_trace  String,
  user_id      UInt64,
  http_status  UInt16,
  env          LowCardinality(String),
  release      LowCardinality(String),
  event_time   DateTime64(3),
  updated_at   DateTime
)
ENGINE = ReplacingMergeTree(updated_at)
PARTITION BY toYYYYMM(event_time)
ORDER BY (service_name, error_type, event_time)
```

**Materialized view for error rate dashboards:**

```sql
CREATE MATERIALIZED VIEW error_rate_per_minute_mv
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(minute)
ORDER BY (service_name, error_type, env, minute)
AS SELECT
  toStartOfMinute(event_time) AS minute,
  service_name,
  error_type,
  env,
  countState() AS total_errors_state,
  uniqState(user_id) AS affected_users_state,
  countIfState(http_status >= 500) AS server_errors_state
FROM error_events
GROUP BY minute, service_name, error_type, env
```

Query: `SELECT minute, service_name, countMerge(total_errors_state) AS errors, uniqMerge(affected_users_state) AS users, countMerge(server_errors_state) AS server_errors FROM error_rate_per_minute_mv WHERE minute >= now() - INTERVAL 1 HOUR GROUP BY minute, service_name ORDER BY minute`.

### **Failure modes**

1. **Monitoring system unavailable during the incident.** Synchronous error reporting blocks if the monitoring service is slow or down. Send error events **asynchronously** (fire-and-forget with a short timeout). Never let monitoring instrumentation propagate exceptions to user-facing code.

2. **Event volume spike overwhelming the ingest pipeline.** Error rates spike 100x during incidents — precisely when coverage matters most. Size ingest for peak volume using buffering (Kafka, Tinybird Events API batching). Apply application-side rate limiting per error type to protect both the app and the pipeline.

3. **High-cardinality `GROUP BY` timing out.** Grouping by `request_id`, raw URL, or `user_agent` produces result sets that exhaust ClickHouse® memory. Pre-aggregate by minute/hour in materialized views. Normalize error messages into types before using them as dashboard dimensions.

4. **Clock skew causing gaps.** Distributed service clocks drift. Add `received_at = now()` at ingest alongside `event_time` and use `received_at` for partition pruning in dashboards.

5. **Stack trace storage bloat.** Full stack traces are 1–50 KB per event. Store truncated traces (first 2 KB) in ClickHouse® and push full traces to object storage (S3) keyed by event ID.

## **Why ClickHouse® for real time error monitoring**

ClickHouse®'s columnar storage and vectorized execution suit the error monitoring query pattern: time-series aggregations over high-frequency events, filtered by service, error type, and environment, across time windows from minutes to months.

`LowCardinality` fields, partition pruning on `event_time`, and pre-aggregation via materialized views combine to make [real-time anomaly detection](https://www.tinybird.co/blog/real-time-anomaly-detection) dashboards return in milliseconds even across billions of historical error events. The Kafka table engine provides [real-time data ingestion](https://www.tinybird.co/blog/real-time-data-ingestion) from error event streams with no intermediate ETL.

## **Why Tinybird is a strong fit for real time error monitoring**

Most error monitoring systems are single-consumer. When teams need the same error data to also power product features — a status page, per-user error history, alerting APIs — they build a second pipeline.

Tinybird eliminates this. Ingest via the Events API, define Pipes for each query, publish REST endpoints. The same endpoint that powers the ops dashboard also serves the status page API and alerting webhook. One ingestion path, multiple consumers.

For SRE teams that need [faster SQL queries](https://www.tinybird.co/blog/5-rules-for-writing-faster-sql-queries) over large error histories without operating ClickHouse® infrastructure, Tinybird provides the managed path with the API layer already built in.

Next step: instrument your top three error types with Tinybird Events API sends, create a Pipe for error rate per service over the last hour, publish the endpoint, and connect it to your alerting system. Validate alert latency before replacing the primary pipeline.

## **Frequently Asked Questions (FAQs)**

### **What is the difference between real time error monitoring and log aggregation?**

**Error monitoring** (Sentry-style) captures exceptions with stack traces and groups them by issue. **Log aggregation** (ELK-style) indexes all log output for full-text search. Real time error monitoring can be built on top of log aggregation (filter for ERROR-level events) or independently with an SDK. ClickHouse® and Tinybird handle both patterns as structured rows queryable with SQL.

### **How do I alert on real time error monitoring data?**

With **Grafana + ClickHouse®**: configure alert rules backed by ClickHouse® queries — alert when 500-error count in the last 5 minutes exceeds a threshold. With **Tinybird**: poll a published Pipe endpoint from a scheduled job or Lambda function and trigger alerts when the metric exceeds your SLO.

### **Can I use real time error monitoring for user-facing error reporting features?**

Yes, with Tinybird. Define a Pipe that filters errors for a specific `user_id` parameter, publish it as a REST endpoint with a scoped token, and call it from your application to show users their own error history or current service status. Standard Sentry and ELK architectures are not designed for this pattern — they are internal ops tools, not API services for product features.

### **What event volume does real time error monitoring need to handle?**

A service handling 1,000 req/s at 1% error rate generates 10 events/second — well within Sentry's and Tinybird's ingest capacity. At 100,000 req/s and 0.1% error rate: 100 events/second. During incidents, rates spike 10–100x. Design for peak volume and use buffering (Kafka, Tinybird batching) to absorb spikes.

### **Does real time error monitoring work for frontend JavaScript errors?**

Yes. Sentry has a first-class JavaScript SDK for browser and Node.js error capture. For ClickHouse®/Tinybird approaches, instrument your frontend with a custom error handler that sends structured error events (error type, message, stack trace, user ID, URL) to a **backend relay endpoint**, which then forwards them to Tinybird. Do not POST from the browser directly with your Tinybird ingest token — tokens that appear in client-side JavaScript are exposed and can be abused. The relay can also scrub PII from stack traces before ingest. The same ClickHouse® schema and Pipes handle frontend and backend errors together.

### **How do I correlate errors with deploys in real time error monitoring?**

Include a `release` or `version` field in every error event. In Sentry, this is the `release` tag and Sentry automatically compares error rates before and after a deployment. In ClickHouse®/Tinybird, add `release LowCardinality(String)` to the schema and filter or group by `release` in dashboard queries. Create a Pipe that compares error rates for the last two releases to detect regression automatically.
