AI-Driven SQL Dataset Optimization 202508: SQLStorm&CogniSQL

In the past two months, several new datasets have been released in the NL2SQL field. Based on publicly available online materials, four papers—SQLStorm, CogniSQL, RubikSQL, and FinStat2SQL—have mentioned dataset releases. Among these, RubikSQL has not yet open-sourced its code or dataset, and FinStat2SQL has not explicitly stated whether its dataset will be publicly available.
Therefore, this article will focus on introducing the currently accessible SQLStorm and CogniSQL datasets.
SQLStorm v1.0 is a large-scale benchmark test based on real-world data, encompassing three different data sizes (1 GB, 12 GB, and 220 GB) and covering over 18,000 queries. This benchmark pioneers the use of artificial intelligence to generate query workloads, producing a massive (22 MB) volume of SQL queries that closely mimic real-world scenarios at an extremely low cost ($15).
It significantly expands the coverage of SQL functionality and query structures. In contrast, traditional manually crafted benchmarks like TPC-H, TPC-DS, and JOB fall short in query diversity and complexity.
SQLStorm can be used in the following scenarios:
SQLStorm employs large language models (LLMs) to generate SQL statements for database performance testing, aiming to address the shortcomings of traditional datasets like TPC-H in terms of SQL feature coverage. The dataset is compatible with mainstream database systems such as PostgreSQL, Umbra, and DuckDB. The data is partially based on real databases provided by StackOverflow, including a set of schemas and data of three sizes:
The query generation process is as follows:
SQLStorm can generate diverse SQL queries at a low cost, effectively revealing performance bottlenecks and errors in database systems. For example, after introducing SQLStorm, Umbra quickly improved its Crash, Error, and Timeout + OOM metrics on SQLStorm by addressing the issues encountered. Additionally, by examining the fourth set of graphs, SQLStorm detected performance degradation issues that were not evident in TPC-H, and these were identified and resolved (Note: Umbra and SQLStorm appear to be from the same team).
SQLStorm can also be used to evaluate tasks such as SQL optimization. Leveraging the complete database data and samples provided by SQLStorm, it can effectively assess performance differences after SQL optimization. Furthermore, it can utilize the complete database data to evaluate the performance quality of SQL generated by different NL2SQL systems.
CogniSQL has released two curated datasets that significantly advance research in execution-aligned, scalable text-to-SQL generation. By open-sourcing these resources, the community can directly leverage high-precision SQL samples and clear reasoning paths to support lightweight reinforcement learning and reasoning-enhanced text-to-SQL model training under limited computational resources.
Reasoning Traces (5,024 entries):
Generated by Qwen QWQ 32B, this dataset provides fixed reasoning paths, enhancing interpretability and enabling transparent observation of the SQL generation process.
Positive Sample Corpus (36,356 entries):
Generated by Qwen2.5-7B-Coder, each original training example produces six distinct training instances comprising both reasoning processes and outcomes, thereby expanding the diversity of reasoning paths.
These datasets were used in the supervised fine-tuning (SFT) phase of the study:
Beyond the datasets, the paper’s core contribution is the CogniSQL-R1-Zero framework. After experimenting with long AI agent workflows, SFT, and GRPO, the authors ultimately used GRPO for reinforcement learning to enhance NL2SQL performance. Experiments showed that CogniSQL-R1-Zero based on Qwen2.5-7B-Coder achieved 59% execution accuracy on Bird-dev, outperforming larger baseline models like DeepSeek-Coder (236B) and Mistral (123B).
CogniSQL’s key contribution lies in generating high-quality, small-model-aligned reasoning traces and positive samples through a generative approach, addressing the prior lack of datasets tailored to small-model reasoning logic. This enables effective generalization for small models in low-resource environments via SFT. The paper also summarizes critical insights for large-model training:
SQLFlash is your AI-powered SQL Optimization Partner.
Based on AI models, we accurately identify SQL performance bottlenecks and optimize query performance, freeing you from the cumbersome SQL tuning process so you can fully focus on developing and implementing business logic.
Join us and experience the power of SQLFlash today!.