Securing the Pipeline: Analyzing the Latest Trends in NL2SQL Datasets and LLM Vulnerabilities | SQLFlash

Securing the Pipeline: Analyzing the Latest Trends in NL2SQL Datasets and LLM Vulnerabilities

Rebooter.S
6 min read
Securing the Pipeline: Analyzing the Latest Trends in NL2SQL Datasets and LLM Vulnerabilities

SQL Injection Dataset

This paper evaluates the SQL injection risks associated with Large Language Models (LLMs) generating SQL. It covers 14 models (including web and API versions), three types of databases (SQLite, MySQL, and PostgreSQL), and four detection technologies (Regular Expressions, Decision Trees, CNNs, and RoBERTa). The results indicate that the refusal rate for malicious prompts is significantly higher in web-based models compared to API-based models. MySQL is the most vulnerable to attacks, whereas PostgreSQL offers the best protection. The performance of existing detectors has generally seen a sharp decline, with deep learning models even failing completely, exposing major security blind spots. The research also released a public evaluation framework alongside the paper.

Paper Intent

This paper aims to bridge the gap in security research regarding LLM-generated SQL, systematically quantify SQL injection risks, and evaluate the effectiveness of current defense mechanisms. Through cross-model and cross-database adversarial testing, it highlights the severe failure of traditional detection technologies when faced with LLM outputs. This proves that current defense systems relying on deterministic rules can no longer handle the new threats introduced by LLMs, calling for a shift towards a multi-layered, adaptive “LLM-aware” security architecture.

Dataset Analysis

The research constructed a comprehensive dataset containing 100 attack prompts across three levels of complexity, 14 LLMs, four detectors, and three databases. The attack prompts cover techniques such as authentication bypass, encoding obfuscation, and time-based blind SQL injection. The models span web interfaces and APIs, while the detectors include Regular Expressions, Decision Trees, CNNs, and RoBERTa. A total of 1,400 prompt-response pairs were generated, of which 1,110 queries were successfully formulated to serve as the core analysis subjects.

The dataset was built layer-by-layer along the entire LLM-to-database pipeline: attack prompts were sequentially submitted to each model, and the generated SQL queries were collected. All queries were manually classified by security experts into three categories: critical, vulnerable, and safe. These were then evaluated through the four detectors, and finally executed in the three databases to record the results. The final step involved cross-referencing to identify penetrating queries that successfully bypassed all detections and executed successfully.

All queries were annotated by a security expert with five years of experience, and two researchers conducted independent cross-validation, achieving a 92.5% agreement rate. The database execution was performed on a unified schema and test data, with the environment reset each time to ensure reproducibility. The LLMs used fixed decoding parameters and random seeds to guarantee reproducible generation results. Ultimately, 124 penetrating queries were identified, and all data was open-sourced alongside the paper.

Conclusion

This paper represents the first systematic evaluation of SQL injection risks in LLM-generated SQL, revealing three major blind spots in current defense systems:

  1. Input protection varies by deployment mode: Web models rejected an average of 38.4% of malicious prompts, whereas API models (like OpenChat) had a generation rate of up to 98.6%—a 3.1x security gap. Furthermore, even safety-aligned models could be bypassed using prompt obfuscation.
  2. Failure of traditional detection mechanisms: The F1 score of RoBERTa plummeted from a benchmark of 0.98 down to 0.03, as LLM outputs significantly deviate from classic injection patterns in both syntax and logic.
  3. Real-world attacks are highly feasible: 13.3% of the generated queries were executed successfully, with a penetration rate of 15.7% (OpenChat-3.5), and even PostgreSQL was not immune (with a 20% penetration rate from GPT-4). The study found that LLMs can autonomously combine multi-layered encoding and nested subqueries to generate novel queries that are “syntactically valid but semantically malicious.” This probabilistic generation fundamentally challenges the deterministic assumptions of existing defenses, underscoring the urgent need to transition to an LLM-aware security architecture.

BIRDTurk

This paper introduces BIRDTurk, the first Turkish version of the BIRD benchmark, aiming to fill the evaluation gap for cross-lingual and low-resource languages. Through a controlled translation process, it preserves the original database logic while localizing natural language questions and schema identifiers into Turkish.

Paper Intent

This paper aims to address the current lack of evaluation benchmarks in the Text-to-SQL domain for morphologically rich and low-resource languages. Specifically, while existing English benchmarks (such as Spider and BIRD) have made significant progress, their performance on languages like Turkish—characterized by agglutinative morphology and Subject-Object-Verb (SOV) word order—has not yet been systematically investigated. Existing Turkish datasets are small in scale and have simple schemas, failing to reflect the “dirty data” and complex reasoning requirements found in real-world enterprise environments.

Dataset Analysis

The BIRDTurk dataset is built upon the training and development sets of the English BIRD benchmark. It retains the original database content and SQL execution logic, only localizing the natural language interface into Turkish.

  1. The dataset curation employs a three-stage controlled translation pipeline: First, it establishes English-to-Turkish schema mapping, translating table and column names into snake_case identifiers (e.g., movie_popularity -> film_populerligi).
  2. Second, it standardizes the evidence fields by uniformly replacing backtick identifiers. Finally, the SQL is parsed into an Abstract Syntax Tree (AST), and only the identifiers are replaced to ensure the execution semantics remain entirely unchanged.
  3. To guarantee data quality, the authors determined the sampling size based on the Central Limit Theorem (CLT) and conducted manual evaluations at a 95% confidence level, achieving a translation accuracy rate of 98.15%.

Conclusion

This paper introduces BIRDTurk—the first Turkish version of the BIRD benchmark—filling the void in cross-lingual evaluations for complex scenarios. Experiments demonstrate that migrating English Text-to-SQL to Turkish (an SOV and agglutinative language) results in a significant performance drop, attributed to structural language divergences and insufficient representation in LLM pre-training corpora. Methodologically, supervised fine-tuning remains challenging, but instruction-tuned models show greater potential; multi-stage agent reasoning demonstrates stronger cross-lingual robustness. BIRDTurk provides a real-world evaluation platform for end-to-end SQL research in low-resource languages.

What is SQLFlash?

SQLFlash is your AI-powered SQL Optimization Partner.

Based on AI models, we accurately identify SQL performance bottlenecks and optimize query performance, freeing you from the cumbersome SQL tuning process so you can fully focus on developing and implementing business logic.

How to use SQLFlash in a database?

Ready to elevate your SQL performance?

Join us and experience the power of SQLFlash today!.