Name: Tinybird
Brand: Tinybird
Rating: 5.0 (10 reviews)

Two weeks ago, we published a benchmark comparing the ability of 19 large language models (and one human) to generate correct and efficient analytical SQL queries based on natural language prompts. The response across Reddit, Hacker News, Twitter, and elsewhere has been generous, thoughtful, and sometimes brutally honest.

This post summarizes the most recurring feedback and shares how we're thinking about incorporating it.

"Exactness" is a flawed scoring method

This came up repeatedly: models often generated SQL that produced similar results as a human baseline, but with a different structure or presentation, and thus were penalized unfairly. For example, one of the benchmark questions was, "How has the total number of stars changed over time by year?"

SELECT
    toYear(created_at) AS year,
    count() AS stars
FROM github_events
WHERE event_type = 'WatchEvent'
GROUP BY year
ORDER BY year

The human-generated SQL simply aggregated WatchEvent entries by year using count() on filtered results to show two relevant years.

`year`	`stars`
`2016`	`814207`
`2017`	`6490905`

GPT-4o-mini produced a slightly different but still valid version: it used countIf(event_type = 'WatchEvent'), grouped by year, and filled for 10 rows (some of them zeros for years with no data).

SELECT
    toYear(created_at) AS year,
    countIf(event_type = 'WatchEvent') AS total_stars
FROM github_events
GROUP BY year
ORDER BY year
LIMIT 10

Despite both outputs conveying the same analytical meaning, GPT-4o-mini received an exactness score of just 20/100. This was because our comparison was based, among other criteria, on normalized RMSE over numerical values, combined with result set similarity (column alignment, row count, sorting). That method fails in corner cases like this where padding with zeroes causes large relative differences.

`year`	`stars`
`2011`	`0`
`2014`	`0`
`2015`	`0`
`2016`	`814207`
`2017`	`6490905`
`2018`	`0`
`2019`	`0`
`2020`	`0`

We used exactness heuristically, paired with manual review of outputs, to identify correctness. But it's true that exact string or row-level comparison does not capture the semantic validity of the result.

What we can do:

Add a second grading pass based on result equivalence (i.e., does the query return the same output?)
Consider a model consensus metric: give the question, query, and result to other models and ask "how correct is this?" to reduce human bias in the ground truth (this needs infra for self-reflection loops, possibly via evaluation chains)
Use ClickHouse's EXCEPT and INTERSECT features or hashes of sorted result sets to compare outputs

Ship real-time data features faster!

Tinybird turns raw data into real-time APIs so companies like Vercel, Canva and Framer can deliver instant insights at scale.

Get started for free

Query complexity was too low

Several commenters argued that most queries in the benchmark were too simple: single-table SELECT, basic filters, GROUP BY, and straightforward aggregations. These are not representative of the kinds of SQL queries used in actual analytical workflows.

In particular, people wanted to see:

CTE-heavy or nested subqueries
Multi-table joins with join type variation (LEFT, INNER, anti-joins)
Time bucketing and window functions
Query rewriting for performance (e.g. minimize table scans)
Schema inference and ambiguity handling

What we can do:

Introduce a tiered difficulty system and flag queries as "basic," "intermediate," or "advanced"
Add new questions that involve data modeling-style inference, e.g. finding the latest event per user
Pull in synthetic questions from real-world sources (e.g., dbt test suites, Looker queries, or Dagster pipelines)

Using a clean, known dataset makes the task easier

We used a curated version of the GitHub Archive dataset to keep things controlled: same schema, same data, same prompt across all models. This eliminates confounding variables, but also makes it easier than most real-world tasks where schema ambiguity and incomplete documentation are common.

One critique was that LLMs struggle more when the schema is large, inconsistently named, or lacks clear table relationships—which is often the case in internal data warehouses.

What we can do:

Design a second benchmark track using an unfamiliar or minimally documented schema (possibly public healthcare or ecommerce data)
Add a mode where only column names and table names are given, no full schema explanations (simulate "developer dropped into a new repo" scenario)
Log which models ask follow-up clarifications (if prompted) or show error-handling capabilities

The benchmark punishes certain choices unfairly

We saw feedback (notably from Izzy Miller, thanks!) about how some models got penalized for choosing SQL functions that ensured "exactness" while either hurting performance or leading to dissimilar results from the human output.

For example, consider one question: top 10 Repositories with the most people who have push access to the main branch.

The human query:

SELECT
    repo_name,
    uniqIf(actor_login, (event_type = 'PushEvent') AND match(ref, '/(main|master)$')) AS u
FROM github_events
WHERE (event_type IN ('PushEvent', 'WatchEvent'))
GROUP BY repo_name
ORDER BY u DESC
LIMIT 10

Izzy pointed out that a human chose the uniqIf() function to perform a distinct count, but uniqIf() is not an exact count; it uses an HLL sketch to approximate the count. A model that chose to count(DISTINCT) would produce a technically more accurate result, but at a cost in performance.

Additionally, the human had some context to assume that "main branch" could mean main or master, whereas an LLM interprets the question literally only filter on ref='main', as in Claude 3.7 Sonnet's query:

SELECT
    repo_name,
    COUNT(DISTINCT actor_login) AS push_access_count
FROM github_events
WHERE event_type = 'PushEvent'
    AND ref = 'refs/heads/main'
GROUP BY repo_name
ORDER BY push_access_count DESC
LIMIT 10

Technically, the LLM produces a more exact query—Claude 3.7 Sonnet also chose an exact COUNT(DISTINCT)—but was penalized for not including master branches.

This is a limitation of our current comparison method. It favors the human result, even when the human takes minor shortcuts or uses imprecise inference to arrive at a query. It reveals a deeper issue: there's no universally "correct" SQL; only correct intent, and intent can be difficult to measure quantitatively.

What we can do:

Add syntax dialect tolerance to scoring logic (ClickHouse-specific vs ANSI SQL)
Document preferred dialect explicitly in each prompt
Provide side-by-side comparisons where scores are contested, so the reader can judge

This is a best-case benchmark, not a real-world simulator

Several commenters rightly noted that our benchmark tests a best-case setup: well-documented schema, single-dataset context, isolated prompt. It doesn't simulate the day-to-day mess of data work: mismatched schemas, evolving table structures, partial documentation, ambiguous KPIs.

We agree—and that's by design. We wanted a reproducible baseline first. But we don't intend to stop here.

What we can do:

Launch a community-sourced benchmark where users submit difficult prompts and schemas
Allow users to plug in their own ClickHouse instance and test models against their actual data
Log errors and hallucinations separately from invalid syntax, to differentiate between "dumb" and "dangerously confident" queries

Some things we got right (but didn't explain well enough)

While most of the feedback pointed to areas for improvement, several commenters also called out things we did get right—but perhaps failed to communicate clearly.

One major example is how we treated performance. Our benchmark already included efficiency metrics like rows read and execution latency, and these did influence model scoring. However, the terminology—"efficiency" and "exactness"—wasn't always intuitive. Some users assumed efficiency referred only to runtime speed, when in fact we penalized heavily for unnecessary reads, especially in a column-store context like ClickHouse.

To clarify: a model that returned the right result but scanned 10x more rows than needed did not get full marks. Efficiency scores factored in both raw performance and proportional data scanned per column.

We explain this more thoroughly in the original benchmark post, but we'll make that connection more explicit in the UI next time.

Subscribe to SCHEMA > Evolution

We are Tinybird and we manage data for companies like Vercel and Canva. Plus, write a newsletter covering Data, AI and everything that matters in between. Join us.

What's next

This benchmark isn't meant to be final or authoritative. It's a snapshot of how well LLMs handle structured analytical queries today under ideal conditions. We learned a lot—not just about the models, but about what people expect from LLMs writing SQL.

If you want to:

Contribute queries or schemas for Round 2
Run the benchmark against your own models
Help improve the scoring methodology

...we're all ears. Just email us, open a PR on GitHub, or ping us on Twitter. Thanks again for the feedback. It helps us build sharper tools.

→ See the benchmark

Skip the infra work. Deploy your first ClickHouse
project now

Our Columns:

Skip the infra work. Deploy your first ClickHouse
project now

Skip the infra work. Deploy your first ClickHouse
project now

We graded 19 LLMs on SQL. You graded us.

"Exactness" is a flawed scoring method

Query complexity was too low

Using a clean, known dataset makes the task easier

The benchmark punishes certain choices unfairly

This is a best-case benchmark, not a real-world simulator

Some things we got right (but didn't explain well enough)

What's next

Skip the infra work. Deploy your first ClickHouse
project now

Skip the infra work. Deploy your first ClickHouse project now

Our Columns:

Skip the infra work. Deploy your first ClickHouse project now

Skip the infra work. Deploy your first ClickHouse project now

We graded 19 LLMs on SQL. You graded us.

"Exactness" is a flawed scoring method

Query complexity was too low

Using a clean, known dataset makes the task easier

The benchmark punishes certain choices unfairly

This is a best-case benchmark, not a real-world simulator

Some things we got right (but didn't explain well enough)

What's next

Skip the infra work. Deploy your first ClickHouse project now

Skip the infra work. Deploy your first ClickHouse
project now

Skip the infra work. Deploy your first ClickHouse
project now

Skip the infra work. Deploy your first ClickHouse
project now

Skip the infra work. Deploy your first ClickHouse
project now