Kaggle AGI Progress 2026: Optimizer Benchmark
The competition
Measuring Progress Toward AGI - Cognitive Abilities was a featured hackathon run by Google DeepMind and Kaggle. The premise: today’s general benchmarks conflate recall with reasoning, so it’s hard to tell whether a frontier model is genuinely solving a novel problem or pattern-matching against its training data. Entrants were asked to build benchmarks targeting one of five cognitive faculties drawn from DeepMind’s paper Measuring progress toward AGI: A cognitive framework — learning, metacognition, attention, executive functions, or social cognition — so that progress toward AGI becomes something you can actually measure rather than argue about.
I entered it — my first Kaggle competition in a while — on the Executive Functions track. My benchmark frames LLM planning as employee shift scheduling: a constraint-optimization problem where OR-Tools can compute a verifiably optimal answer, so a model’s output can be scored on a continuous 0–100 scale against ground truth instead of a binary pass/fail. Because instances are generated programmatically and scale cleanly from 105 to 3,360 assignment slots, the benchmark isolates planning from general reasoning and shows where each model’s planning budget runs out — which is exactly the kind of cognitive profile the competition asked for.
- Deployed site
- Kaggle write-up
- Source:
/Users/nlothian/dev/fifthvertex/optimizer-benchmark-kaggle
Results by tier
| Tier | Mean score (all models) |
|---|---|
| Small | 81.5 |
| Medium | 58.3 |
| Large | 28.6 |
Top of the leaderboard
| # | Model | Avg | Small | Medium | Large |
|---|---|---|---|---|---|
| 1 | gemini-3.1-pro-preview | 84.2 | 100.0 | 98.9 | 53.8 |
| 2 | gpt-5.4-2026-03-05 | 71.4 | 97.7 | 85.4 | 31.2 |
| 3 | gemma-4-31b-it | 71.1 | 99.9 | 73.9 | 39.4 |
| 4 | claude-opus-4-6 | 63.1 | 88.1 | 53.6 | 47.6 |