---
title: "We graded 19 LLMs on SQL. You graded us."
excerpt: "We tested 19 LLMs on SQL. The results surprised us. Then you challenged our methodology. Here's the updated truth."
authors: "Víctor Ramírez"
categories: "AI x Data"
createdOn: "2025-05-15 00:00:00"
publishedOn: "2025-05-16 00:00:00"
updatedOn: "2025-05-16 00:00:00"
status: "published"
---
Two weeks ago, we [published a benchmark](https://www.tinybird.co/blog-posts/which-llm-writes-the-best-sql) comparing the ability of 19 large language models (and one human) to generate correct and efficient analytical SQL queries based on natural language prompts. The response across Reddit, Hacker News, Twitter, and elsewhere has been generous, thoughtful, and sometimes brutally honest.

This post summarizes the most recurring feedback and shares how we're thinking about incorporating it.

## "Exactness" is a flawed scoring method

[This came up repeatedly](https://x.com/isidoremiller/status/1922007738479202614): models often generated SQL that produced similar results as a human baseline, but with a different structure or presentation, and thus were penalized unfairly. For example, one of the benchmark questions was, "How has the total number of stars changed over time by year?"

```sql
SELECT
    toYear(created_at) AS year,
    count() AS stars
FROM github_events
WHERE event_type = 'WatchEvent'
GROUP BY year
ORDER BY year
```

The human-generated SQL simply aggregated `WatchEvent` entries by year using `count()` on filtered results to show two relevant years.

| `year` | `stars` |
| :---- | :---- |
| `2016` | `814207` |
| `2017` | `6490905` |

GPT-4o-mini produced a slightly different but still valid version: it used `countIf(event_type = 'WatchEvent')`, grouped by year, and filled for 10 rows (some of them zeros for years with no data).

```sql
SELECT
    toYear(created_at) AS year,
    countIf(event_type = 'WatchEvent') AS total_stars
FROM github_events
GROUP BY year
ORDER BY year
LIMIT 10
```

Despite both outputs conveying the same analytical meaning, GPT-4o-mini received an exactness score of just 20/100. This was because our comparison was based, among other criteria, on normalized RMSE over numerical values, combined with result set similarity (column alignment, row count, sorting). That method fails in corner cases like this where padding with zeroes causes large relative differences.

| `year` | `stars` |
| :---- | :---- |
| `2011` | `0` |
| `2014` | `0` |
| `2015` | `0` |
| `2016` | `814207` |
| `2017` | `6490905` |
| `2018` | `0` |
| `2019` | `0` |
| `2020` | `0` |

We used exactness heuristically, paired with manual review of outputs, to identify correctness. But it's true that exact string or row-level comparison does not capture the semantic validity of the result.

**What we can do:**

* Add a second grading pass based on **result equivalence** (i.e., does the query return the same output?)
* Consider a **model consensus metric**: give the question, query, and result to other models and ask "how correct is this?" to reduce human bias in the ground truth (this needs infra for self-reflection loops, possibly via evaluation chains)
* Use ClickHouse{% sup %}®{% /sup %}'s `EXCEPT` and `INTERSECT` features or hashes of sorted result sets to compare outputs

## Query complexity was too low

[Several commenters](https://news.ycombinator.com/item?id=43927627) argued that most queries in the benchmark were too simple: single-table `SELECT`, basic filters, `GROUP BY`, and straightforward aggregations. These are not representative of the kinds of SQL queries used in actual analytical workflows.

In particular, people wanted to see:

* CTE-heavy or nested subqueries
* Multi-table joins with join type variation (LEFT, INNER, anti-joins)
* Time bucketing and window functions
* Query rewriting for performance (e.g. minimize table scans)
* Schema inference and ambiguity handling

**What we can do:**

* Introduce a **tiered difficulty system** and flag queries as "basic," "intermediate," or "advanced"
* Add new questions that involve **data modeling-style inference**, e.g. finding the latest event per user
* Pull in synthetic questions from real-world sources (e.g., dbt test suites, Looker queries, or Dagster pipelines)

## Using a clean, known dataset makes the task easier

We used a curated version of the GitHub Archive dataset to keep things controlled: same schema, same data, same prompt across all models. This eliminates confounding variables, but also makes it easier than most real-world tasks where schema ambiguity and incomplete documentation are common.

[One critique was](https://www.reddit.com/r/dataengineering/comments/1khsiwd/comment/mrbq36k/) that LLMs struggle more when the schema is large, inconsistently named, or lacks clear table relationships—which is often the case in internal data warehouses.

**What we can do:**

* Design a **second benchmark track** using an unfamiliar or minimally documented schema (possibly public healthcare or ecommerce data)
* Add a mode where only **column names and table names are given**, no full schema explanations (simulate "developer dropped into a new repo" scenario)
* Log which models ask follow-up clarifications (if prompted) or show error-handling capabilities

## The benchmark punishes certain choices unfairly

We saw feedback (notably from [Izzy Miller](https://x.com/isidoremiller/status/1922007454956847522), thanks\!) about how some models got penalized for choosing SQL functions that ensured "exactness" while either hurting performance or leading to dissimilar results from the human output.

For example, consider [one question](https://llm-benchmark.tinybird.live/questions/pipe_27.pipe): `top 10 Repositories with the most people who have push access to the main branch`.

The human query:

```sql
SELECT
    repo_name,
    uniqIf(actor_login, (event_type = 'PushEvent') AND match(ref, '/(main|master)$')) AS u
FROM github_events
WHERE (event_type IN ('PushEvent', 'WatchEvent'))
GROUP BY repo_name
ORDER BY u DESC
LIMIT 10
```

 Izzy pointed out that a human chose the `uniqIf()` function to perform a distinct count, but `uniqIf()` is not an exact count; it uses an HLL sketch to approximate the count. A model that chose to `count(DISTINCT)` would produce a technically more accurate result, but at a cost in performance.

Additionally, the human had some context to assume that "main branch" could mean `main` or `master`, whereas an LLM interprets the question literally only filter on `ref='main'`, as in Claude 3.7 Sonnet's query:

```sql
SELECT
    repo_name,
    COUNT(DISTINCT actor_login) AS push_access_count
FROM github_events
WHERE event_type = 'PushEvent'
    AND ref = 'refs/heads/main'
GROUP BY repo_name
ORDER BY push_access_count DESC
LIMIT 10
```

Technically, the LLM produces a more exact query—Claude 3.7 Sonnet also chose an exact `COUNT(DISTINCT)`—but was penalized for not including `master` branches.

This is a limitation of our current comparison method. It favors the human result, even when the human takes minor shortcuts or uses imprecise inference to arrive at a query. It reveals a deeper issue: there's no universally "correct" SQL; only correct *intent*, and intent can be difficult to measure quantitatively.

**What we can do:**

* Add **syntax dialect tolerance** to scoring logic (ClickHouse{% sup %}®{% /sup %}-specific vs ANSI SQL)
* Document preferred dialect explicitly in each prompt
* Provide side-by-side comparisons where scores are contested, so the reader can judge

## This is a best-case benchmark, not a real-world simulator

Several commenters rightly noted that our benchmark tests a best-case setup: well-documented schema, single-dataset context, isolated prompt. It doesn't simulate the day-to-day mess of data work: mismatched schemas, evolving table structures, partial documentation, ambiguous KPIs.

We agree—and that's by design. We wanted a reproducible baseline first. But we don't intend to stop here.

**What we can do:**

* Launch a **community-sourced benchmark** where users submit difficult prompts and schemas
* Allow users to plug in their own ClickHouse{% sup %}®{% /sup %} instance and test models against their actual data
* Log errors and hallucinations separately from invalid syntax, to differentiate between "dumb" and "dangerously confident" queries

## Some things we got right (but didn't explain well enough)

While most of the feedback pointed to areas for improvement, several commenters also called out things we did get right—but perhaps failed to communicate clearly.

One major example is [**how we treated performance**](https://www.reddit.com/r/dataengineering/comments/1khsiwd/comment/mr9kemp/). Our benchmark already included efficiency metrics like rows read and execution latency, and these did influence model scoring. However, the terminology—"efficiency" and "exactness"—wasn't always intuitive. Some users assumed efficiency referred only to runtime speed, when in fact we penalized heavily for unnecessary reads, especially in a column-store context like ClickHouse{% sup %}®{% /sup %}.

To clarify: a model that returned the right result but scanned 10x more rows than needed did not get full marks. Efficiency scores factored in both raw performance and proportional data scanned per column.

We explain this more thoroughly in the [original benchmark post](https://www.tinybird.co/blog-posts/which-llm-writes-the-best-sql), but we'll make that connection more explicit in the UI next time.

---

## What's next

This benchmark isn't meant to be final or authoritative. It's a snapshot of how well LLMs handle structured analytical queries today under ideal conditions. We learned a lot—not just about the models, but about what people expect from LLMs writing SQL.

If you want to:

* Contribute queries or schemas for Round 2
* Run the benchmark against your own models
* Help improve the scoring methodology

...we're all ears. Just email us, [open a PR on GitHub](https://github.com/tinybirdco/llm-benchmark), or [ping us on Twitter](https://x.com/tinybirdco). Thanks again for the feedback. It helps us build sharper tools.

→ [See the benchmark](https://llm-benchmark.tinybird.live)