We benchmarked how well LLMs write SQL
View the results.
Back
May 16, 2025

We graded 19 LLMs on SQL. You graded us.

We benchmarked 19 LLMs on analytical SQL and the internet had thoughts. Here's a breakdown of your feedback, what we got wrong, what we got right (but didn’t explain), and how we’re improving the benchmark for round two.
Victor Ramirez
Victor RamirezSoftware Engineer

Two weeks ago, we published a benchmark comparing the ability of 19 large language models (and one human) to generate correct and efficient analytical SQL queries based on natural language prompts. The response across Reddit, Hacker News, Twitter, and elsewhere has been generous, thoughtful, and sometimes brutally honest.

This post summarizes the most recurring feedback and shares how we're thinking about incorporating it.

"Exactness" is a flawed scoring method

This came up repeatedly: models often generated SQL that produced similar results as a human baseline, but with a different structure or presentation, and thus were penalized unfairly. For example, one of the benchmark questions was, "How has the total number of stars changed over time by year?"

Copy
SELECT 
    toYear(created_at) AS year, 
    count() AS stars 
FROM github_events
WHERE event_type = 'WatchEvent' 
GROUP BY year 
ORDER BY year

The human-generated SQL simply aggregated WatchEvent entries by year using count() on filtered results to show two relevant years.

yearstars
2016814207
20176490905

GPT-4o-mini produced a slightly different but still valid version: it used countIf(event_type = 'WatchEvent'), grouped by year, and filled for 10 rows (some of them zeros for years with no data).

Copy
SELECT
    toYear(created_at) AS year,
    countIf(event_type = 'WatchEvent') AS total_stars
FROM github_events
GROUP BY year
ORDER BY year
LIMIT 10

Despite both outputs conveying the same analytical meaning, GPT-4o-mini received an exactness score of just 20/100. This was because our comparison was based, among other criteria, on normalized RMSE over numerical values, combined with result set similarity (column alignment, row count, sorting). That method fails in corner cases like this where padding with zeroes causes large relative differences.

yearstars
20110
20140
20150
2016814207
20176490905
20180
20190
20200

We used exactness heuristically, paired with manual review of outputs, to identify correctness. But it's true that exact string or row-level comparison does not capture the semantic validity of the result.

What we can do:

  • Add a second grading pass based on result equivalence (i.e., does the query return the same output?)
  • Consider a model consensus metric: give the question, query, and result to other models and ask "how correct is this?" to reduce human bias in the ground truth (this needs infra for self-reflection loops, possibly via evaluation chains)
  • Use ClickHouse's EXCEPT and INTERSECT features or hashes of sorted result sets to compare outputs
Start building with Tinybird!
If you've read this far, you might want to use Tinybird as your analytics backend. You can just get started, on the free plan.

Query complexity was too low

Several commenters argued that most queries in the benchmark were too simple: single-table SELECT, basic filters, GROUP BY, and straightforward aggregations. These are not representative of the kinds of SQL queries used in actual analytical workflows.

In particular, people wanted to see:

  • CTE-heavy or nested subqueries
  • Multi-table joins with join type variation (LEFT, INNER, anti-joins)
  • Time bucketing and window functions
  • Query rewriting for performance (e.g. minimize table scans)
  • Schema inference and ambiguity handling

What we can do:

  • Introduce a tiered difficulty system and flag queries as "basic," "intermediate," or "advanced"
  • Add new questions that involve data modeling-style inference, e.g. finding the latest event per user
  • Pull in synthetic questions from real-world sources (e.g., dbt test suites, Looker queries, or Dagster pipelines)

Using a clean, known dataset makes the task easier

We used a curated version of the GitHub Archive dataset to keep things controlled: same schema, same data, same prompt across all models. This eliminates confounding variables, but also makes it easier than most real-world tasks where schema ambiguity and incomplete documentation are common.

One critique was that LLMs struggle more when the schema is large, inconsistently named, or lacks clear table relationships—which is often the case in internal data warehouses.

What we can do:

  • Design a second benchmark track using an unfamiliar or minimally documented schema (possibly public healthcare or ecommerce data)
  • Add a mode where only column names and table names are given, no full schema explanations (simulate "developer dropped into a new repo" scenario)
  • Log which models ask follow-up clarifications (if prompted) or show error-handling capabilities

The benchmark punishes certain choices unfairly

We saw feedback (notably from Izzy Miller, thanks!) about how some models got penalized for choosing SQL functions that ensured "exactness" while either hurting performance or leading to dissimilar results from the human output.

For example, consider one question: top 10 Repositories with the most people who have push access to the main branch.

The human query:

Copy
SELECT
    repo_name,
    uniqIf(actor_login, (event_type = 'PushEvent') AND match(ref, '/(main|master)$')) AS u
FROM github_events
WHERE (event_type IN ('PushEvent', 'WatchEvent'))
GROUP BY repo_name
ORDER BY u DESC
LIMIT 10

Izzy pointed out that a human chose the uniqIf() function to perform a distinct count, but uniqIf() is not an exact count; it uses an HLL sketch to approximate the count. A model that chose to count(DISTINCT) would produce a technically more accurate result, but at a cost in performance.

Additionally, the human had some context to assume that "main branch" could mean main or master, whereas an LLM interprets the question literally only filter on ref='main', as in Claude 3.7 Sonnet's query:

Copy
SELECT 
    repo_name,
    COUNT(DISTINCT actor_login) AS push_access_count
FROM github_events
WHERE event_type = 'PushEvent'
    AND ref = 'refs/heads/main'
GROUP BY repo_name
ORDER BY push_access_count DESC
LIMIT 10

Technically, the LLM produces a more exact query—Claude 3.7 Sonnet also chose an exact COUNT(DISTINCT)—but was penalized for not including master branches.

This is a limitation of our current comparison method. It favors the human result, even when the human takes minor shortcuts or uses imprecise inference to arrive at a query. It reveals a deeper issue: there's no universally "correct" SQL; only correct intent, and intent can be difficult to measure quantitatively.

What we can do:

  • Add syntax dialect tolerance to scoring logic (ClickHouse-specific vs ANSI SQL)
  • Document preferred dialect explicitly in each prompt
  • Provide side-by-side comparisons where scores are contested, so the reader can judge

This is a best-case benchmark, not a real-world simulator

Several commenters rightly noted that our benchmark tests a best-case setup: well-documented schema, single-dataset context, isolated prompt. It doesn't simulate the day-to-day mess of data work: mismatched schemas, evolving table structures, partial documentation, ambiguous KPIs.

We agree—and that's by design. We wanted a reproducible baseline first. But we don't intend to stop here.

What we can do:

  • Launch a community-sourced benchmark where users submit difficult prompts and schemas
  • Allow users to plug in their own ClickHouse instance and test models against their actual data
  • Log errors and hallucinations separately from invalid syntax, to differentiate between "dumb" and "dangerously confident" queries

Some things we got right (but didn't explain well enough)

While most of the feedback pointed to areas for improvement, several commenters also called out things we did get right—but perhaps failed to communicate clearly.

One major example is how we treated performance. Our benchmark already included efficiency metrics like rows read and execution latency, and these did influence model scoring. However, the terminology—"efficiency" and "exactness"—wasn't always intuitive. Some users assumed efficiency referred only to runtime speed, when in fact we penalized heavily for unnecessary reads, especially in a column-store context like ClickHouse.

To clarify: a model that returned the right result but scanned 10x more rows than needed did not get full marks. Efficiency scores factored in both raw performance and proportional data scanned per column.

We explain this more thoroughly in the original benchmark post, but we'll make that connection more explicit in the UI next time.


Subscribe to our newsletter
Get 10 links weekly to the Data and AI articles the Tinybird team is reading.

What's next

This benchmark isn't meant to be final or authoritative. It's a snapshot of how well LLMs handle structured analytical queries today under ideal conditions. We learned a lot—not just about the models, but about what people expect from LLMs writing SQL.

If you want to:

  • Contribute queries or schemas for Round 2
  • Run the benchmark against your own models
  • Help improve the scoring methodology

...we're all ears. Just email us, open a PR on GitHub, or ping us on Twitter. Thanks again for the feedback. It helps us build sharper tools.

See the benchmark

Do you like this post? Spread it!

Skip the infra work. Ship your first API today.

Read the docs
Tinybird wordmark