Kaggle AGI Progress 2026: Optimizer Benchmark

The competition

Measuring Progress Toward AGI - Cognitive Abilities was a featured hackathon run by Google DeepMind and Kaggle. The premise: today’s general benchmarks conflate recall with reasoning, so it’s hard to tell whether a frontier model is genuinely solving a novel problem or pattern-matching against its training data. Entrants were asked to build benchmarks targeting one of five cognitive faculties drawn from DeepMind’s paper Measuring progress toward AGI: A cognitive framework — learning, metacognition, attention, executive functions, or social cognition — so that progress toward AGI becomes something you can actually measure rather than argue about.

I entered it — my first Kaggle competition in a while — on the Executive Functions track. My benchmark frames LLM planning as employee shift scheduling: a constraint-optimization problem where OR-Tools can compute a verifiably optimal answer, so a model’s output can be scored on a continuous 0–100 scale against ground truth instead of a binary pass/fail. Because instances are generated programmatically and scale cleanly from 105 to 3,360 assignment slots, the benchmark isolates planning from general reasoning and shows where each model’s planning budget runs out — which is exactly the kind of cognitive profile the competition asked for.

Results by tier

Tier	Mean score (all models)
Small	81.5
Medium	58.3
Large	28.6

Top of the leaderboard

#	Model	Avg	Small	Medium	Large
1	gemini-3.1-pro-preview	84.2	100.0	98.9	53.8
2	gpt-5.4-2026-03-05	71.4	97.7	85.4	31.2
3	gemma-4-31b-it	71.1	99.9	73.9	39.4
4	claude-opus-4-6	63.1	88.1	53.6	47.6

The competition

Results by tier

Top of the leaderboard

Benchmarking Agentic LLMs on SQL Generation

My Ideal Agent Orchestrator

Proxying Bluetooth to a Home Assistant VM