Data Agents Finally Get Real: DAComp & DP-Bench Crush the 'erfect Query' Myth

DAComp isn’t your usual NL2SQL test

It’s a real-world benchmark for data AI agents with 210 tasks covering the entire data lifecycle—from grabbing data to making actual business decisions. Forget the old “perfect query” nonsense; DAComp throws agents into real enterprise workflows: cleaning messy datasets, exploring patterns, building models, visualizing results, and even suggesting next steps. No more pretending models understand databases when they’re still guessing how to handle real data.

Leaderboard: https://da-comp.github.io/
Paper: https://arxiv.org/html/2512.04324

Paper Intent

Let’s be real: most NL2SQL tests (Spider, BIRD) are just a single step—translate to SQL. But real analysts don’t stop there. They grab data, clean it, build models, and actually decide what to do next. DAComp throws LLMs into the actual chaos of enterprise workflows: handling messy files, picking the right Python library, fixing errors, and even drafting business recommendations. No more “perfect query” fantasy.

Dataset Analysis

Built by 8 real data engineers (not just AI), DAComp has two parts:

DE (Data Engineering)

73 enterprise SaaS setups with 400+ columns each, filled with synthetic but realistic data. Split into three phases:

DE-Arch: Engineers draft 5 business needs per project → pick the most painful one (e.g., “Sync CRM data with inventory in real-time”).
DE-Impl: Reverse-engineer the full workflow (DAG + logic) from the chosen need.
DE-Evol: Senior engineers write new real-world requirements (e.g., “Handle 10x traffic spikes during Black Friday”).

DA (Data Analysis)

100 complex live databases + analysis layers from DE. For each table, annotators draft 8 open-ended questions → 5-person voting panel picks the top 2 that’d make a real analyst sweat (e.g., “Why did sales drop 30% in Q3?”).

Summary

DAComp isn’t a toy test. Even GPT-4o cringes at the engineering tasks—only 20% success rate in DE, and way lower for strategy-level decisions. This isn’t about “getting SQL right.” It’s about building agents that actually work in the wild. Finally, a benchmark that stops pretending.

DP-Bench: Finally, a Text-to-SQL Test That Doesn’t Ignore the Real Work

This isn’t just another NL2SQL benchmark. DP-Bench is the first test for data product generation systems—where “data product” means real business value, like predicting customer churn before they cancel, so support teams can actually do something about it.

Forget “just generate SQL” nonsense. DP-Bench forces models to:

Find the right tables from messy databases
Pick the right columns (not just dump everything)
Most importantly—generate and validate derived columns (like “total sales”) with actual SQL provenance

No more “perfect query” fantasy. Every metric in DP-Bench has a traceable SQL behind it—so you can actually see how the model got there.

Dataset: https://huggingface.co/datasets/ibm-research/dp-bench
Paper: https://arxiv.org/html/2512.15798

Paper Intent

Let’s be honest: most Text-to-SQL tests (BIRD, Spider) only care about “translate to SQL.” But real data work? It’s messy. You need to clean data, derive metrics, track where they came from. DP-Bench makes models actually do the whole thing—starting from a business request (DPR), finding relevant tables, selecting columns, and proving how they built each derived metric.

No more pretending LLMs understand databases when they’re still guessing how to handle real business needs.

Dataset Analysis

Built from BIRD’s real database schemas + ELT-Bench’s transformation pipelines—no more “fake” data.

234 business-ready data product requests (DPRs) from 78 BIRD databases → 383 final result tables.
Reverse-engineered labels:
- Start with what the output should look like (e.g., “total sales by region”)
- Let Llama-3.3-70B generate the business request (DPR)
- 5 experts two rounds of checks to make sure every request actually matches real business needs (no “just SQL” nonsense).
Hard subset: 30 tables with >50 columns each—designed to break LLMs in long-context chaos.

Summary

DP-Bench isn’t just another benchmark. It finally tests if models can handle real business data work—not just spit out SQL. But here’s the catch: 71% of initial requests needed zero tweaks to work. Meaning? We’re still not testing enough messy business edge cases.

It’s not the finish line—it’s the first step toward Data Mesh. Think of it as NL2SQL’s awkward cousin who actually tries to understand business.

What is SQLFlash?

SQLFlash is your AI-powered SQL Optimization Partner.

Based on AI models, we accurately identify SQL performance bottlenecks and optimize query performance, freeing you from the cumbersome SQL tuning process so you can fully focus on developing and implementing business logic.

Data Agents Finally Get Real: DAComp & DP-Bench Crush the 'erfect Query' Myth

DAComp isn’t your usual NL2SQL test

Paper Intent

Dataset Analysis

DE (Data Engineering)

DA (Data Analysis)

Summary

DP-Bench: Finally, a Text-to-SQL Test That Doesn’t Ignore the Real Work

Paper Intent

Dataset Analysis

Summary

Recommended reading

What is SQLFlash?

How to use SQLFlash in a database?

Ready to elevate your SQL performance?