AI-Driven SQL Dataset Optimization 202509: REEF&text2SQL4PM

This month, I discovered two papers related to datasets in the field of NL2SQL: REEF and text2SQL4PM. Both papers investigate the application of NL2SQL in the domain of data analysis .

REEF

REEF comprises 18 interconnected tables (e.g., Product, Order, User). Its data distributions are annotated, encoding specific causal relationships between variables, which can be used to construct realistic causal graphs. This dataset is designed to evaluate the capabilities of large language models on end-to-end causal analysis tasks .

Dataset URL: https://github.com/ChaemyungLim/ORCA/tree/main/REEF

Paper URL: https://arxiv.org/html/2508.21304

Paper Intent

The paper proposes ORCA (an LLM agentic system) designed to address end-to-end causal analysis tasks in data analysis. End-to-end causal analysis refers to questions such as: “Do coupons increase the probability of users purchasing a certain product?” ORCA (Orchestrating Causal Agent) automates routine workflows in relational databases while preserving expert oversight through human-AI interactions. The system covers the entire data analysis pipeline: interpreting natural language queries, navigating database table structures, generating correct SQL codes, preprocessing data, and configuring causal inference models. Domain experts can maintain control over the analysis process through iterative interactions with ORCA, enabling robust data-driven decision-making without requiring deep statistical expertise.

Dataset Analysis

REEF is a synthetic e-commerce database that simulates business logic and causal relationships based on industry knowledge. Variable generation combines rule-based logic and probabilistic sampling, implemented in JavaScript using Faker.js. Some variables are generated to depend on others to ensure experimental reproducibility and suitability for causal analysis evaluation. The generation methods are primarily divided into two categories:

Randomly Sampled Variables: For example, product prices are randomly generated within the range of [5, 500].
Causally-Driven Variables: For instance, user activity status (is_active) is influenced by the time since registration (signup_days_ago), using a sigmoid function for probability scaling to simulate the trend of “the longer the registration, the lower the activity.”

Summary

Despite REEF’s complex structure and closer resemblance to real business environments, ORCA achieved an execution accuracy of 60.00% on this dataset, while GPT-4o mini’s performance was significantly lower at only 6.67%. However, ORCA does not automatically discover general causal relationships and still requires manual organization of causality-corresponding fields. The paper notes that different domains require different knowledge, so achieving a 60% accuracy rate relies on manually organizing potential causal mappings.

text2SQL4PM

text2SQL4PM is a bilingual (Portuguese-English) text-to-SQL benchmark dataset designed for the process mining domain. It addresses challenges specific to process mining, covering specialized vocabulary and single-table relational structures based on event logs. The dataset contains 1,655 natural language statements (including manual paraphrases), 205 SQL queries, and 10 qualifiers. Its construction integrates expert manual curation, professional translation, and detailed annotation, supporting in-depth analysis of task complexity.

Dataset URL: https://github.com/pm-usp/text-2-sql

Paper URL: https://arxiv.org/html/2509.09684

Paper Intent

Process mining is a form of data analysis that utilizes data from system event logs to reconstruct and analyze the actual execution of business processes. For example, enterprise systems (such as ERP, CRM, OA, ticketing systems, etc.) record operational logs. Process mining reads these logs, automatically generates a “flowchart” of actual user operations, displays how the process is actually executed, and identifies bottlenecks, deviations, inefficiencies, and other issues in the process.

In the paper, the authors aim to combine NL2SQL with domain analysis to improve work efficiency. The most common standard for storing event logs in process mining is XES (eXtensible Event Stream). When converted for use in a relational database, it generates a single non-normalized table. The lack of normalization, combined with the specialized vocabulary and unique information needs of process mining, often results in text-to-SQL strategies underperforming in this domain compared to traditional domains.

Although retrieving information from a single table in a relational database via SQL may seem straightforward, exploratory studies have shown that querying information in such a context can be quite challenging.

Dataset Analysis

The dataset was primarily manually annotated and generated through three stages:

Data Collection: 29 undergraduate students and 13 graduate students, who had taken courses focused on process mining and had prior SQL knowledge, were invited to participate in an exercise. A total of 237 utterance-SQL pairs were generated as the initial dataset content.
Dataset Content Improvement: Three process mining experts validated the dataset, performed semantic replacements, and annotated it across eight dimensions.
Dataset Expansion: A professional native English-speaking translator was hired to translate all utterances into English. The original dataset was in Portuguese, and process mining experts ensured the English version maintained semantic consistency with the original.

Summary

In the evaluation of text2SQL4PM, GPT-3.5 Turbo achieved accuracy rates of only 30%–40% in both English and Portuguese, indicating that NL2SQL still has significant room for improvement in the field of process mining. With its bilingual nature, rich paraphrasing resources, and high-quality annotations jointly constructed by native speakers and experts, this dataset can serve not only as an important benchmark for semantic parsing tasks but also has potential applications in natural language processing tasks such as machine translation and paraphrase generation. Particularly in the field of process mining, the carefully crafted paraphrases by researchers with deep expertise represent an extremely valuable corpus resource.

What is SQLFlash?

SQLFlash is your AI-powered SQL Optimization Partner.

Based on AI models, we accurately identify SQL performance bottlenecks and optimize query performance, freeing you from the cumbersome SQL tuning process so you can fully focus on developing and implementing business logic.

AI-Driven SQL Dataset Optimization 202509: REEF&text2SQL4PM

REEF

Paper Intent

Dataset Analysis

Summary

text2SQL4PM

Paper Intent

Dataset Analysis

Summary

Recommended reading

What is SQLFlash?

How to use SQLFlash in a database?

Ready to elevate your SQL performance?