Skip Navigation
LLMs average <5% on 2025 Math Olympiad; award each other 20x points
arxiv.org Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

Recent math benchmarks for large language models (LLMs) such as MathArena indicate that state-of-the-art reasoning models achieve impressive performance on mathematical competitions like AIME, with the leading model, o3-mini, achieving scores comparable to top human competitors. However, these bench...

Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

"Notably, O3-MINI, despite being one of the best reasoning models, frequently skipped essential proof steps by labeling them as "trivial", even when their validity was crucial."

44
InitialsDiceBearhttps://github.com/dicebear/dicebearhttps://creativecommons.org/publicdomain/zero/1.0/„Initials” (https://github.com/dicebear/dicebear) by „DiceBear”, licensed under „CC0 1.0” (https://creativecommons.org/publicdomain/zero/1.0/)SL
slop_as_a_service @awful.systems
Posts 1
Comments 1