GeoSQL-Eval / GeoSQL-Bench
Finally—a PostGIS test that doesn’t make me want to throw my laptop. GeoSQL-Eval checks if LLMs actually get spatial queries, not just vomit syntactically valid but useless SQL. They dropped GeoSQL-Bench: 14,178 real tasks, 340 PostGIS functions covered, 82 legit spatial DBs (land use, transport networks—you name it).
- Leaderboard: https://haoyuejiao.github.io/GeoSQL-Eval-Leaderboard/
- Paper: https://arxiv.org/pdf/2509.25264
Paper Intent
Let’s be real: old NL2SQL benchmarks skip the messy spatial stuff—geometry types, CRS, PostGIS quirks. So models hallucinate ST_Buffer when they need ST_Distance. GeoSQL-Bench + GeoSQL-Eval fix that. Built with spatial DB folks, not just theorists. Tests if models handle real client queries, not textbook examples.
Dataset Analysis
- 2,380 MCQs/T-F: Straight from PostGIS 3.5 docs—tests if models know what functions do, not just syntax.
- 3,744 SQL gen tasks: Mix of clear prompts ("add column age") and vague ones ("add a field")—forces type guessing (VARCHAR? INT? You decide).
- 2,155 schema tasks: Built on UN GGIM + ISO 19115 databases. Models must navigate actual table relationships. All GPT-4o drafted → triple-checked by human spatial experts. No lazy labeling.
Summary
Tested 24 models. GPT-5/o4-mini crushed geometry-heavy queries. But 70% of errors? Still function misuse. Schema tasks (multi-table joins) = hardest. This isn’t "another benchmark"—it’s the first real test for spatial SQL. Period.
DeKeyNLU
DeKeyNLU fixes the quiet killer in NL2SQL: LLMs failing to break down "Show me Q3 sales in APAC" into actual DB steps. They built a dataset where humans actually verified task splits and keywords—then baked it into DeKeySQL’s pipeline.
- Paper: https://aclanthology.org/2025.findings-emnlp.1312.pdf
- Data: https://github.com/AlexJJJChen/DeKeyNLU
Paper Intent
RAG/CoT pipelines keep choking on task decomposition and keyword extraction. Existing datasets? Fragmented or missing domain keywords ("fiscal year," "student cohort"). DeKeyNLU drops a clean fix: a new dataset + DeKeySQL’s 3-module flow—user question understanding → entity retrieval → SQL generation. They fine-tuned only the understanding module... and accuracy jumped hard.
Dataset Analysis
- 1,500 QA pairs, pulled from BIRD benchmark (finance, education, real DB scenarios).
- Split 7:2:1—train/val/test, no weird ratios.
- Workflow: GPT-4o drafted task splits (main/sub) + keywords (objects/implementation) → three experts cross-checked *three times*. Painstaking? Yes. Worth it? Absolutely.
Summary
Fine-tuning "user question understanding" with DeKeyNLU pushed BIRD dev accuracy from 62.31% → 69.10%, Spider from 84.2% → 88.7%. Plot twist? Entity retrieval is the make-or-break step (not understanding), followed by question parsing. Proves: targeted dataset design + smart pipeline tweaks > throwing more data at the problem. Finally—NL2SQL that gets what you mean.


Top comments (0)