4mo ago

LLMs average <5% on 2025 Math Olympiad; award each other 20x points

Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

"Notably, O3-MINI, despite being one of the best reasoning models, frequently skipped essential proof steps by labeling them as "trivial", even when their validity was crucial."