Two weeks ago, we published a benchmark comparing the ability of 19 large language models (and one human) to generate correct and efficient analytical SQL queries based on natural language prompts. The response across Reddit, Hacker News, Twitter, and elsewhere has been generous, thoughtful, and sometimes brutally honest.
This post summarizes the most recurring feedback and shares how we're thinking about incorporating it.
"Exactness" is a flawed scoring method
This came up repeatedly: models often generated SQL that produced similar results as a human baseline, but with a different structure or presentation, and thus were penalized unfairly. For example, one of the benchmark questions was, "How has the total number of stars changed over time by year?"
SELECT
toYear(created_at) AS year,
count() AS stars
FROM github_events
WHERE event_type = 'WatchEvent'
GROUP BY year
ORDER BY year
The human-generated SQL simply aggregated WatchEvent
entries by year using count()
on filtered results to show two relevant years.
year | stars |
---|---|
2016 | 814207 |
2017 | 6490905 |
GPT-4o-mini produced a slightly different but still valid version: it used countIf(event_type = 'WatchEvent')
, grouped by year, and filled for 10 rows (some of them zeros for years with no data).
SELECT
toYear(created_at) AS year,
countIf(event_type = 'WatchEvent') AS total_stars
FROM github_events
GROUP BY year
ORDER BY year
LIMIT 10
Despite both outputs conveying the same analytical meaning, GPT-4o-mini received an exactness score of just 20/100. This was because our comparison was based, among other criteria, on normalized RMSE over numerical values, combined with result set similarity (column alignment, row count, sorting). That method fails in corner cases like this where padding with zeroes causes large relative differences.
year | stars |
---|---|
2011 | 0 |
2014 | 0 |
2015 | 0 |
2016 | 814207 |
2017 | 6490905 |
2018 | 0 |
2019 | 0 |
2020 | 0 |
We used exactness heuristically, paired with manual review of outputs, to identify correctness. But it's true that exact string or row-level comparison does not capture the semantic validity of the result.
What we can do:
- Add a second grading pass based on result equivalence (i.e., does the query return the same output?)
- Consider a model consensus metric: give the question, query, and result to other models and ask "how correct is this?" to reduce human bias in the ground truth (this needs infra for self-reflection loops, possibly via evaluation chains)
- Use ClickHouse's
EXCEPT
andINTERSECT
features or hashes of sorted result sets to compare outputs
Query complexity was too low
Several commenters argued that most queries in the benchmark were too simple: single-table SELECT
, basic filters, GROUP BY
, and straightforward aggregations. These are not representative of the kinds of SQL queries used in actual analytical workflows.
In particular, people wanted to see:
- CTE-heavy or nested subqueries
- Multi-table joins with join type variation (LEFT, INNER, anti-joins)
- Time bucketing and window functions
- Query rewriting for performance (e.g. minimize table scans)
- Schema inference and ambiguity handling
What we can do:
- Introduce a tiered difficulty system and flag queries as "basic," "intermediate," or "advanced"
- Add new questions that involve data modeling-style inference, e.g. finding the latest event per user
- Pull in synthetic questions from real-world sources (e.g., dbt test suites, Looker queries, or Dagster pipelines)
Using a clean, known dataset makes the task easier
We used a curated version of the GitHub Archive dataset to keep things controlled: same schema, same data, same prompt across all models. This eliminates confounding variables, but also makes it easier than most real-world tasks where schema ambiguity and incomplete documentation are common.
One critique was that LLMs struggle more when the schema is large, inconsistently named, or lacks clear table relationships—which is often the case in internal data warehouses.
What we can do:
- Design a second benchmark track using an unfamiliar or minimally documented schema (possibly public healthcare or ecommerce data)
- Add a mode where only column names and table names are given, no full schema explanations (simulate "developer dropped into a new repo" scenario)
- Log which models ask follow-up clarifications (if prompted) or show error-handling capabilities
The benchmark punishes certain choices unfairly
We saw feedback (notably from Izzy Miller, thanks!) about how some models got penalized for choosing SQL functions that ensured "exactness" while either hurting performance or leading to dissimilar results from the human output.
For example, consider one question: top 10 Repositories with the most people who have push access to the main branch
.
The human query:
SELECT
repo_name,
uniqIf(actor_login, (event_type = 'PushEvent') AND match(ref, '/(main|master)$')) AS u
FROM github_events
WHERE (event_type IN ('PushEvent', 'WatchEvent'))
GROUP BY repo_name
ORDER BY u DESC
LIMIT 10
Izzy pointed out that a human chose the uniqIf()
function to perform a distinct count, but uniqIf()
is not an exact count; it uses an HLL sketch to approximate the count. A model that chose to count(DISTINCT)
would produce a technically more accurate result, but at a cost in performance.
Additionally, the human had some context to assume that "main branch" could mean main
or master
, whereas an LLM interprets the question literally only filter on ref='main'
, as in Claude 3.7 Sonnet's query:
SELECT
repo_name,
COUNT(DISTINCT actor_login) AS push_access_count
FROM github_events
WHERE event_type = 'PushEvent'
AND ref = 'refs/heads/main'
GROUP BY repo_name
ORDER BY push_access_count DESC
LIMIT 10
Technically, the LLM produces a more exact query—Claude 3.7 Sonnet also chose an exact COUNT(DISTINCT)
—but was penalized for not including master
branches.
This is a limitation of our current comparison method. It favors the human result, even when the human takes minor shortcuts or uses imprecise inference to arrive at a query. It reveals a deeper issue: there's no universally "correct" SQL; only correct intent, and intent can be difficult to measure quantitatively.
What we can do:
- Add syntax dialect tolerance to scoring logic (ClickHouse-specific vs ANSI SQL)
- Document preferred dialect explicitly in each prompt
- Provide side-by-side comparisons where scores are contested, so the reader can judge
This is a best-case benchmark, not a real-world simulator
Several commenters rightly noted that our benchmark tests a best-case setup: well-documented schema, single-dataset context, isolated prompt. It doesn't simulate the day-to-day mess of data work: mismatched schemas, evolving table structures, partial documentation, ambiguous KPIs.
We agree—and that's by design. We wanted a reproducible baseline first. But we don't intend to stop here.
What we can do:
- Launch a community-sourced benchmark where users submit difficult prompts and schemas
- Allow users to plug in their own ClickHouse instance and test models against their actual data
- Log errors and hallucinations separately from invalid syntax, to differentiate between "dumb" and "dangerously confident" queries
Some things we got right (but didn't explain well enough)
While most of the feedback pointed to areas for improvement, several commenters also called out things we did get right—but perhaps failed to communicate clearly.
One major example is how we treated performance. Our benchmark already included efficiency metrics like rows read and execution latency, and these did influence model scoring. However, the terminology—"efficiency" and "exactness"—wasn't always intuitive. Some users assumed efficiency referred only to runtime speed, when in fact we penalized heavily for unnecessary reads, especially in a column-store context like ClickHouse.
To clarify: a model that returned the right result but scanned 10x more rows than needed did not get full marks. Efficiency scores factored in both raw performance and proportional data scanned per column.
We explain this more thoroughly in the original benchmark post, but we'll make that connection more explicit in the UI next time.
What's next
This benchmark isn't meant to be final or authoritative. It's a snapshot of how well LLMs handle structured analytical queries today under ideal conditions. We learned a lot—not just about the models, but about what people expect from LLMs writing SQL.
If you want to:
- Contribute queries or schemas for Round 2
- Run the benchmark against your own models
- Help improve the scoring methodology
...we're all ears. Just email us, open a PR on GitHub, or ping us on Twitter. Thanks again for the feedback. It helps us build sharper tools.