AI-Driven SQL Dataset Optimization 202509: REEF&text2SQL4PM


This month, I discovered two papers related to datasets in the field of NL2SQL: REEF and text2SQL4PM. Both papers investigate the application of NL2SQL in the domain of data analysis .
REEF comprises 18 interconnected tables (e.g., Product, Order, User). Its data distributions are annotated, encoding specific causal relationships between variables, which can be used to construct realistic causal graphs. This dataset is designed to evaluate the capabilities of large language models on end-to-end causal analysis tasks .

Dataset URL: https://github.com/ChaemyungLim/ORCA/tree/main/REEF
Paper URL: https://arxiv.org/html/2508.21304
The paper proposes ORCA (an LLM agentic system) designed to address end-to-end causal analysis tasks in data analysis. End-to-end causal analysis refers to questions such as: “Do coupons increase the probability of users purchasing a certain product?” ORCA (Orchestrating Causal Agent) automates routine workflows in relational databases while preserving expert oversight through human-AI interactions. The system covers the entire data analysis pipeline: interpreting natural language queries, navigating database table structures, generating correct SQL codes, preprocessing data, and configuring causal inference models. Domain experts can maintain control over the analysis process through iterative interactions with ORCA, enabling robust data-driven decision-making without requiring deep statistical expertise.
REEF is a synthetic e-commerce database that simulates business logic and causal relationships based on industry knowledge. Variable generation combines rule-based logic and probabilistic sampling, implemented in JavaScript using Faker.js. Some variables are generated to depend on others to ensure experimental reproducibility and suitability for causal analysis evaluation. The generation methods are primarily divided into two categories:
is_active) is influenced by the time since registration (signup_days_ago), using a sigmoid function for probability scaling to simulate the trend of “the longer the registration, the lower the activity.”Despite REEF’s complex structure and closer resemblance to real business environments, ORCA achieved an execution accuracy of 60.00% on this dataset, while GPT-4o mini’s performance was significantly lower at only 6.67%. However, ORCA does not automatically discover general causal relationships and still requires manual organization of causality-corresponding fields. The paper notes that different domains require different knowledge, so achieving a 60% accuracy rate relies on manually organizing potential causal mappings.
text2SQL4PM is a bilingual (Portuguese-English) text-to-SQL benchmark dataset designed for the process mining domain. It addresses challenges specific to process mining, covering specialized vocabulary and single-table relational structures based on event logs. The dataset contains 1,655 natural language statements (including manual paraphrases), 205 SQL queries, and 10 qualifiers. Its construction integrates expert manual curation, professional translation, and detailed annotation, supporting in-depth analysis of task complexity.

Dataset URL: https://github.com/pm-usp/text-2-sql
Paper URL: https://arxiv.org/html/2509.09684
Process mining is a form of data analysis that utilizes data from system event logs to reconstruct and analyze the actual execution of business processes. For example, enterprise systems (such as ERP, CRM, OA, ticketing systems, etc.) record operational logs. Process mining reads these logs, automatically generates a “flowchart” of actual user operations, displays how the process is actually executed, and identifies bottlenecks, deviations, inefficiencies, and other issues in the process.
In the paper, the authors aim to combine NL2SQL with domain analysis to improve work efficiency. The most common standard for storing event logs in process mining is XES (eXtensible Event Stream). When converted for use in a relational database, it generates a single non-normalized table. The lack of normalization, combined with the specialized vocabulary and unique information needs of process mining, often results in text-to-SQL strategies underperforming in this domain compared to traditional domains.
Although retrieving information from a single table in a relational database via SQL may seem straightforward, exploratory studies have shown that querying information in such a context can be quite challenging.
The dataset was primarily manually annotated and generated through three stages:
In the evaluation of text2SQL4PM, GPT-3.5 Turbo achieved accuracy rates of only 30%–40% in both English and Portuguese, indicating that NL2SQL still has significant room for improvement in the field of process mining. With its bilingual nature, rich paraphrasing resources, and high-quality annotations jointly constructed by native speakers and experts, this dataset can serve not only as an important benchmark for semantic parsing tasks but also has potential applications in natural language processing tasks such as machine translation and paraphrase generation. Particularly in the field of process mining, the carefully crafted paraphrases by researchers with deep expertise represent an extremely valuable corpus resource.
SQLFlash is your AI-powered SQL Optimization Partner.
Based on AI models, we accurately identify SQL performance bottlenecks and optimize query performance, freeing you from the cumbersome SQL tuning process so you can fully focus on developing and implementing business logic.
Join us and experience the power of SQLFlash today!.