GeoSQL-Eval: Finally, a PostGIS Benchmark That Doesn’t Make Me Scream

GeoSQL-Eval / GeoSQL-Bench

Finally—a PostGIS test that doesn’t make me want to throw my laptop. GeoSQL-Eval checks if LLMs actually get spatial queries, not just vomit syntactically valid but useless SQL. They dropped GeoSQL-Bench: 14,178 real tasks, 340 PostGIS functions covered, 82 legit spatial DBs (land use, transport networks—you name it).

Leaderboard: https://haoyuejiao.github.io/GeoSQL-Eval-Leaderboard/
Paper: https://arxiv.org/pdf/2509.25264

Paper Intent

Let’s be real: old NL2SQL benchmarks skip the messy spatial stuff—geometry types, CRS, PostGIS quirks. So models hallucinate ST_Buffer when they need ST_Distance. GeoSQL-Bench + GeoSQL-Eval fix that. Built with spatial DB folks, not just theorists. Tests if models handle real client queries, not textbook examples.

Dataset Analysis

2,380 MCQs/T-F: Straight from PostGIS 3.5 docs—tests if models know what functions do, not just syntax.
3,744 SQL gen tasks: Mix of clear prompts (“add column age”) and vague ones (“add a field”)—forces type guessing (VARCHAR? INT? You decide).
2,155 schema tasks: Built on UN GGIM + ISO 19115 databases. Models must navigate actual table relationships.
All GPT-4o drafted → triple-checked by human spatial experts. No lazy labeling.

Summary

Tested 24 models. GPT-5/o4-mini crushed geometry-heavy queries. But 70% of errors? Still function misuse. Schema tasks (multi-table joins) = hardest. This isn’t “another benchmark”—it’s the first real test for spatial SQL. Period.

DeKeyNLU

DeKeyNLU fixes the quiet killer in NL2SQL: LLMs failing to break down “Show me Q3 sales in APAC” into actual DB steps. They built a dataset where humans actually verified task splits and keywords—then baked it into DeKeySQL’s pipeline.

Paper: https://aclanthology.org/2025.findings-emnlp.1312.pdf
Data: https://github.com/AlexJJJChen/DeKeyNLU

Paper Intent

RAG/CoT pipelines keep choking on task decomposition and keyword extraction. Existing datasets? Fragmented or missing domain keywords (“fiscal year,” “student cohort”). DeKeyNLU drops a clean fix: a new dataset + DeKeySQL’s 3-module flow—user question understanding → entity retrieval → SQL generation. They fine-tuned only the understanding module… and accuracy jumped hard.

Dataset Analysis

1,500 QA pairs, pulled from BIRD benchmark (finance, education, real DB scenarios).
Split 7:2:1—train/val/test, no weird ratios.
Workflow: GPT-4o drafted task splits (main/sub) + keywords (objects/implementation) → three experts cross-checked three times. Painstaking? Yes. Worth it? Absolutely.

Summary

Fine-tuning “user question understanding” with DeKeyNLU pushed BIRD dev accuracy from 62.31% → 69.10%, Spider from 84.2% → 88.7%. Plot twist? Entity retrieval is the make-or-break step (not understanding), followed by question parsing. Proves: targeted dataset design + smart pipeline tweaks > throwing more data at the problem. Finally—NL2SQL that gets what you mean.

What is SQLFlash?

SQLFlash is your AI-powered SQL Optimization Partner.

Based on AI models, we accurately identify SQL performance bottlenecks and optimize query performance, freeing you from the cumbersome SQL tuning process so you can fully focus on developing and implementing business logic.

GeoSQL-Eval: Finally, a PostGIS Benchmark That Doesn’t Make Me Scream

GeoSQL-Eval / GeoSQL-Bench

Paper Intent

Dataset Analysis

Summary

DeKeyNLU

Paper Intent

Dataset Analysis

Summary

Recommended reading

What is SQLFlash?

How to use SQLFlash in a database?

Ready to elevate your SQL performance?