A systematic evaluation of large language models on the Text-to-SQL task, comparing multiple model architectures across different prompting strategies, with execution-based result verification and an interactive results dashboard.
For live Dashboard : Streamlit Dashboard
- Project Overview
- Dataset
- Models Evaluated
- Prompting Strategies
- Evaluation Pipeline
- Tech Stack
- Results & Tracking
- Dashboard
- Threats to Validity
- Future Work
- How to run
- Project Structute
This project benchmarks the capability of large language models (LLMs) to translate natural language questions into executable SQL queries — a task known as Text-to-SQL. Rather than relying solely on string-level comparison, the evaluation uses execution-based assessment to determine whether a generated query retrieves the correct answer from the database, accounting for the fact that multiple syntactically different queries can be semantically equivalent.
The study covers four modls across three prompting strategies, evaluated under identical conditions for a rigorous and fair comparison.
Each sample in the benchmark consists of three components:
- A database schema describing table and column structure
- A natural language question posed by a user
- A ground-truth SQL query as the reference answer
Text-to-SQL is a non-trivial task. The core difficulties include:
- Interpeting the user's intent from loosely worded questions
- Mapping natural language terms to the correct schema attributes
- Selecting the right tables and columns among potentially many candidates
- Generating syntactically valid SQL
- Handling aggregations, filters, joins, and nested queries correctly
- Recognizing that multiple SQL formulations can produce identical results
Because of the last point, execution-based evaluation is used alongside Exact Match — a generated query us considered correct if it retrieves the same result set as the ground-truth query, even if the SQL text differs.
All inference was served through the Groq API, which provides optimized low-latency infrastructure for LLM serving. The following models were selected to span a wide range of parameter scales and architectures:
| Model | Parameters |
|---|---|
llama-3.1-8b-instant |
8 Billion |
llama-3.3-70b-versatile |
70 Billion |
openai/gpt-oss-20b |
20 Billion |
openai/gpt-oss-120b |
120 Billion |
All models were evaluated under identical prompt conditions and using the same evaluation pipeline.
All metrics are reported as percentages. Latency is the average inference time per query in milliseconds.
| Model | Strategy | Exact Match (%) | Token F1 (%) | Execution Accuracy (%) | Latency (ms) |
|---|---|---|---|---|---|
| llama-3.1-8b-instant | Zero Shot | 15.0 | 76.3 | 97.0 | 1786 |
| llama-3.1-8b-instant | Few Shot | 27.0 | 77.3 | 95.0 | 2463 |
| llama-3.1-8b-instant | Chain of Thought | 13.0 | 78.3 | 95.0 | 2476 |
| llama-3.3-70b-versatile | Zero Shot | 16.0 | 75.5 | 98.0 | 1831 |
| llama-3.3-70b-versatile | Few Shot | 28.0 | 83.0 | 98.0 | 2487 |
| llama-3.3-70b-versatile | Chain of Thought | 16.0 | 72.5 | 98.0 | 2487 |
| openai/gpt-oss-20b | Zero Shot | 8.0 | 65.9 | 98.0 | 2150 |
| openai/gpt-oss-20b | Few Shot | 15.0 | 69.9 | 96.0 | 2903 |
| openai/gpt-oss-20b | Chain of Thought | 4.0 | 65.7 | 96.0 | 2783 |
| openai/gpt-oss-120b | Zero Shot | 2.0 | 65.7 | 98.0 | 2272 |
| openai/gpt-oss-120b | Few Shot | 19.0 | 68.8 | 98.0 | 2915 |
| openai/gpt-oss-120b | Chain of Thought | 1.0 | 66.1 | 98.0 | 2919 |
Note: Execution Accuracy is the primary correctness signal. Exact Match is a strict string comparison and consistently underestimates true model capability across all models and strategies.
Three distinct strategies were tested to understand how prompt design affects SQL generation quality.
The model receives only the database schema and the user question — no examples. Performance here reflects the model's raw, pre-trained capabiity for SQL reasoning.
A set of example Question–SQL pairs is prepended to the prompt before the target question. This gives the model in-context demonstrations of SQL syntax patterns, schema interpetation, and query structure without any fine-tuning.
The model is instructed to reason through intermediate steps — understanding the question, identifing relevant tables and columns, and planning the query — before producing the final SQL. This strategy targets improvement in logical reasoning and schema understanding.
For every combination of model and prompting strategy, the following steps were executed:
- Provide the input question and schema to the model via the Groq API
- Receive the generated SQL query
- Execute both the generated query and the ground-truth query against the database using SQLite3
- Compare execution results to determine correctness
- Compute Exact Match and Execution Accuracy metrics
- Log all results and aggregate statistics
The pipeline was kept identical across all experiments to ensure comparability.
| Component | Tool |
|---|---|
| Inference API | Groq |
| Database Engine | SQLite3 |
| Experiment Tracking | Weights & Biases (wandb) |
| Results Dashboard | Streamlit |
| Language | Python |
- Groq — chosen for its high-throughput, low-latency inference, essential when running evaluations across many model-prompt combinations
- SQLite3 — used to execute both reference and generated SQL queries and compare result sets directly, enabling execution-based evaluation
- Weights & Biases (wandb) — tracks all experiment runs, logs per-sample results, metrics, and aggregates, providing full reproducibility and run comparison
- Streamlit — powers an interactive dashboard where users can explore model outputs, compare SQL generations, and view results for every question across all models
All runs are tracked with Weights & Biases. Each logged run includes:
- Model name and prompting strategy
- Per-sample generated SQL and ground-truth SQL
- Execution match result per sample
- Exact Match result per sample
- Aggregate accuracy metrics across the dataset
This makes it straightforward to reproduce any run or audit individual predictions.
The Streamlit application provides an interactive interface to explore the benchmark results. Users can:
- Browse every question in the benchmark
- View the generated SQL from all four models side by side
- See the ground-truth SQL and the execution result comparison
- Filter by prompting strategy or model
- Identify where models agree or diverge
This makes the evaluation transoarent and acessible without needing to inspect raw logs or code.
The following limitations should be kept in mind when interpreting results.
The benchmark may not capture the full diversity of real-world database systems. A larger, more varied dataset could surface additional model strengths and weaknesses.
LLMs are sensitive to prompt phrasing, schema formatting, and example selection. Results are specific to the prompts used in this study and may shift with minor modifications.
Exact Match is a strict string-level comparison that can underestimate true model capability. Many generated queries that fail Exact Match still produce correct execution results. Execution Accuracy is the more reliable signal.
Evaluation reflects the model versions available on Groq at the time of the study. Future updates to these models may change performance characteristics.
All experiments were conducted on one dataset. Generalization to different domains, schemas, or database scales has not been assessed.
Several directions are identified for extending this study:
- Advanced Prompting — explore Self-Consistency, ReAct, Program-of-Thought, and Tree-of-Thought prompting
- Larger Benchmarks — evaluate on Spider, Spider 2.0, BIRD, and WikiSQL for comparability with published research
- Error Taxonomy — categorize failure modes such as wrong column selection, incorrect aggregation, missing filters, and join errors
- Cross-Model Ensembles — combine outputs from multiple models to improve robustness
- Production Database Evaluation — test on real-world schemas with complex relationships, large table counts, and domain-specific terminology
1. Clone the repo
git clone https://github.com/MLbyTharun/text2sql-benchmark.git
cd text2sql-benchmark2. Install dependencies
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt3. Add your API keys
# Create a .env file
GROQ_API_KEY=your_key_here
WANDB_API_KEY=your_key_here4. Run the evaluation
python experiments/run_eval.py5. Launch the dashboard
streamlit run dashboard/app.pytext2sql-benchmark/
├── data/ # Dataset loading
├── evals/ # Metrics and scoring
├── experiments/ # Eval runner and prompt strategies
├── models/ # Model abstractions
├── dashboard/ # Streamlit UI
├── requirements.txt
└── results/ #json file of actual results
└── README.md
All runs are tracked on Weights & Biases with per-example logging of metrics, latency, and model outputs.