Text-to-SQL Finally Gets Real: DySQL-Bench, BibSQL, DLBench Fix the 'Perfect Query' Myth

DySQL-Bench

Finally, a Text-to-SQL test that doesn’t pretend users talk in single sentences. DySQL-Bench simulates real database chats—user, model, and DB all interacting. Tests if LLMs can actually handle multi-turn CRUD operations (not just SELECTs) like finance analysts do. Built with 1,072 real-world tasks across 13 domains (sports, business, entertainment), where UPDATEs make up 50% of ops—because real DB work isn’t just querying.

Dataset: https://github.com/Aurora-slz/Real-World-SQL-Bench
Paper: https://arxiv.org/abs/2510.26495

Paper Intent

Most Text-to-SQL tests (Spider, BIRD) ignore how real users actually talk to databases. They don’t get it right on the first try—they iterate. So this paper built a three-way system: simulated user (LLM), model to test, and executable DB. No more “perfect query” fantasy.

Dataset Analysis

1,072 multi-turn tasks (13 domains), split into short (<3 steps) and long (≥3 steps).
50% UPDATEs—because real DBs need changes, not just reads.
Built via two-step pipeline:
1. Convert DB tables to tree-structured JSON for LLMs.
2. Triple-checked by LLM validators + SQL exec tests + human experts.

Summary

GPT-4o? Only 58.34% accurate on dynamic SQL. Pass@5 (long-term consistency)? 23.81%. This isn’t just another benchmark—it proves current LLMs are still terrible at being real AI data analysts. Finally, a test that matches how humans actually work with databases.

BibSQL

BibSQL is the first Chinese Text-to-SQL dataset for academic search. 1,190 “question-SQL” pairs, built from Nanjing University Library data + Douban + knowledge graphs. No more guessing “how to find papers on quantum computing” with broken SQL.

Paper Intent

Old library systems suck at complex queries (“Show me papers on quantum computing in journals from 2020-2023 with high citation counts”). BibSQL + RAG + PoT (Python pseudocode first) fixes this. First Chinese Text-to-SQL dataset, built for real academic search.

Dataset Analysis

1,190 question-SQL pairs across 119 types (26 single-hop, 17 multi-hop, 76 complex).
Trained on human-written examples, tested on LLM-generated for naturalness.
Built from 100k records (Nanjing University Library + Douban + knowledge graphs).

Summary

BibSQL + SoftSimMatch + PoT boosted accuracy to 96.6%. PoT (Python first, then SQL) lifted accuracy from 74.8% → 82.9%. Real talk: It’s the first system that actually gets what you mean when searching academic papers. No more “I don’t know how to write SQL for this.”

DLBench

DLBench is the first benchmark for cross-dialect SQL translation—testing if LLMs can move queries between databases (MySQL ↔ PostgreSQL, etc.). 9,320 dialect variants, 6,402 translation tasks across 7 DBMSs.

Leaderboard: https://dlbenchll.github.io/leaderboard.html
Paper: https://matafeiyanll.github.io/paper/ASE25.pdf

Paper Intent

SQL dialects are a mess. Existing translation tests only check if code runs—not if it means the same thing. DLBench fixes this: requires translations to be semantically equivalent, syntax-correct, and preserve schema constraints.

Dataset Analysis

BIRD subset: 3,206 tasks (4,669 dialect features), long complex queries.
BT subset: 3,196 tasks (4,651 dialect features), covers DQL/DDL/DML/DCL.
Built via:
1. Collect high-quality DBs + SQLs.
2. Clean via SQL-92 checks + dialect parsers.
3. GPT-4o-mini + 3 human experts for translation + validation.

Summary

GPT-4o? Only 70% accurate on translation. Best model still fails on syntax/logic errors. DLBench isn’t just a test—it’s the first real standard for cross-database SQL translation. Finally, a benchmark that doesn’t ignore the messy reality of SQL dialects.

Paper: https://aclanthology.org/2025.findings-emnlp.1312.pdf
Data: https://github.com/AlexJJJChen/DeKeyNLU

What is SQLFlash?

SQLFlash is your AI-powered SQL Optimization Partner.

Based on AI models, we accurately identify SQL performance bottlenecks and optimize query performance, freeing you from the cumbersome SQL tuning process so you can fully focus on developing and implementing business logic.

Text-to-SQL Finally Gets Real: DySQL-Bench, BibSQL, DLBench Fix the 'Perfect Query' Myth

DySQL-Bench

Paper Intent

Dataset Analysis

Summary

BibSQL

Paper Intent

Dataset Analysis

Summary

DLBench

Paper Intent

Dataset Analysis

Summary

Recommended reading

What is SQLFlash?

How to use SQLFlash in a database?

Ready to elevate your SQL performance?