Benchmarking Agentic LLMs on SQL Generation

I built a benchmark for evaluating how well agentic LLMs handle SQL generation tasks.

I built my own benchmark to find out. 25 text questions of various difficulty that a LLM needs to build a SQL query from, with an agentic debugging loop to allow it to correct its own mistakes.

Rather than duplicate the write-up here, see the full results and methodology at sql-benchmark.nicklothian.com.

Source code: github.com/nlothian/llm-sql-benchmark

Kaggle AGI Progress 2026: Optimizer Benchmark

My Ideal Agent Orchestrator

Proxying Bluetooth to a Home Assistant VM