Text-to-SQL Finally Gets Real: DySQL-Bench, BibSQL, DLBench Fix the 'Perfect Query' Myth


Finally, a Text-to-SQL test that doesn’t pretend users talk in single sentences. DySQL-Bench simulates real database chats—user, model, and DB all interacting. Tests if LLMs can actually handle multi-turn CRUD operations (not just SELECTs) like finance analysts do. Built with 1,072 real-world tasks across 13 domains (sports, business, entertainment), where UPDATEs make up 50% of ops—because real DB work isn’t just querying.

Most Text-to-SQL tests (Spider, BIRD) ignore how real users actually talk to databases. They don’t get it right on the first try—they iterate. So this paper built a three-way system: simulated user (LLM), model to test, and executable DB. No more “perfect query” fantasy.

GPT-4o? Only 58.34% accurate on dynamic SQL. Pass@5 (long-term consistency)? 23.81%. This isn’t just another benchmark—it proves current LLMs are still terrible at being real AI data analysts. Finally, a test that matches how humans actually work with databases.
BibSQL is the first Chinese Text-to-SQL dataset for academic search. 1,190 “question-SQL” pairs, built from Nanjing University Library data + Douban + knowledge graphs. No more guessing “how to find papers on quantum computing” with broken SQL.
Old library systems suck at complex queries (“Show me papers on quantum computing in journals from 2020-2023 with high citation counts”). BibSQL + RAG + PoT (Python pseudocode first) fixes this. First Chinese Text-to-SQL dataset, built for real academic search.
BibSQL + SoftSimMatch + PoT boosted accuracy to 96.6%. PoT (Python first, then SQL) lifted accuracy from 74.8% → 82.9%. Real talk: It’s the first system that actually gets what you mean when searching academic papers. No more “I don’t know how to write SQL for this.”
DLBench is the first benchmark for cross-dialect SQL translation—testing if LLMs can move queries between databases (MySQL ↔ PostgreSQL, etc.). 9,320 dialect variants, 6,402 translation tasks across 7 DBMSs.

SQL dialects are a mess. Existing translation tests only check if code runs—not if it means the same thing. DLBench fixes this: requires translations to be semantically equivalent, syntax-correct, and preserve schema constraints.

GPT-4o? Only 70% accurate on translation. Best model still fails on syntax/logic errors. DLBench isn’t just a test—it’s the first real standard for cross-database SQL translation. Finally, a benchmark that doesn’t ignore the messy reality of SQL dialects.
SQLFlash is your AI-powered SQL Optimization Partner.
Based on AI models, we accurately identify SQL performance bottlenecks and optimize query performance, freeing you from the cumbersome SQL tuning process so you can fully focus on developing and implementing business logic.
Join us and experience the power of SQLFlash today!.