Securing the Pipeline: Analyzing the Latest Trends in NL2SQL Datasets and LLM Vulnerabilities


This paper evaluates the SQL injection risks associated with Large Language Models (LLMs) generating SQL. It covers 14 models (including web and API versions), three types of databases (SQLite, MySQL, and PostgreSQL), and four detection technologies (Regular Expressions, Decision Trees, CNNs, and RoBERTa). The results indicate that the refusal rate for malicious prompts is significantly higher in web-based models compared to API-based models. MySQL is the most vulnerable to attacks, whereas PostgreSQL offers the best protection. The performance of existing detectors has generally seen a sharp decline, with deep learning models even failing completely, exposing major security blind spots. The research also released a public evaluation framework alongside the paper.

This paper aims to bridge the gap in security research regarding LLM-generated SQL, systematically quantify SQL injection risks, and evaluate the effectiveness of current defense mechanisms. Through cross-model and cross-database adversarial testing, it highlights the severe failure of traditional detection technologies when faced with LLM outputs. This proves that current defense systems relying on deterministic rules can no longer handle the new threats introduced by LLMs, calling for a shift towards a multi-layered, adaptive “LLM-aware” security architecture.
The research constructed a comprehensive dataset containing 100 attack prompts across three levels of complexity, 14 LLMs, four detectors, and three databases. The attack prompts cover techniques such as authentication bypass, encoding obfuscation, and time-based blind SQL injection. The models span web interfaces and APIs, while the detectors include Regular Expressions, Decision Trees, CNNs, and RoBERTa. A total of 1,400 prompt-response pairs were generated, of which 1,110 queries were successfully formulated to serve as the core analysis subjects.
The dataset was built layer-by-layer along the entire LLM-to-database pipeline: attack prompts were sequentially submitted to each model, and the generated SQL queries were collected. All queries were manually classified by security experts into three categories: critical, vulnerable, and safe. These were then evaluated through the four detectors, and finally executed in the three databases to record the results. The final step involved cross-referencing to identify penetrating queries that successfully bypassed all detections and executed successfully.
All queries were annotated by a security expert with five years of experience, and two researchers conducted independent cross-validation, achieving a 92.5% agreement rate. The database execution was performed on a unified schema and test data, with the environment reset each time to ensure reproducibility. The LLMs used fixed decoding parameters and random seeds to guarantee reproducible generation results. Ultimately, 124 penetrating queries were identified, and all data was open-sourced alongside the paper.
This paper represents the first systematic evaluation of SQL injection risks in LLM-generated SQL, revealing three major blind spots in current defense systems:
This paper introduces BIRDTurk, the first Turkish version of the BIRD benchmark, aiming to fill the evaluation gap for cross-lingual and low-resource languages. Through a controlled translation process, it preserves the original database logic while localizing natural language questions and schema identifiers into Turkish.

This paper aims to address the current lack of evaluation benchmarks in the Text-to-SQL domain for morphologically rich and low-resource languages. Specifically, while existing English benchmarks (such as Spider and BIRD) have made significant progress, their performance on languages like Turkish—characterized by agglutinative morphology and Subject-Object-Verb (SOV) word order—has not yet been systematically investigated. Existing Turkish datasets are small in scale and have simple schemas, failing to reflect the “dirty data” and complex reasoning requirements found in real-world enterprise environments.
The BIRDTurk dataset is built upon the training and development sets of the English BIRD benchmark. It retains the original database content and SQL execution logic, only localizing the natural language interface into Turkish.
This paper introduces BIRDTurk—the first Turkish version of the BIRD benchmark—filling the void in cross-lingual evaluations for complex scenarios. Experiments demonstrate that migrating English Text-to-SQL to Turkish (an SOV and agglutinative language) results in a significant performance drop, attributed to structural language divergences and insufficient representation in LLM pre-training corpora. Methodologically, supervised fine-tuning remains challenging, but instruction-tuned models show greater potential; multi-stage agent reasoning demonstrates stronger cross-lingual robustness. BIRDTurk provides a real-world evaluation platform for end-to-end SQL research in low-resource languages.
SQLFlash is your AI-powered SQL Optimization Partner.
Based on AI models, we accurately identify SQL performance bottlenecks and optimize query performance, freeing you from the cumbersome SQL tuning process so you can fully focus on developing and implementing business logic.
Join us and experience the power of SQLFlash today!.