Connect with us

Artificial Intelligence

From Silver to Gold: How DeepMind’s AI Conquered the Math Olympiad

mm

DeepMind’s AI has made remarkable progress in mathematical reasoning within a span of just one year.  After earning a silver medal at the International Mathematical Olympiad (IMO) in 2024, their AI system claimed gold medal in 2025. This rapid advancement highlights the growing capabilities of artificial intelligence in tackling complex, abstract problems that require human-like creativity and insight. This article will walk through how DeepMind achieved this transformation, the technical and strategic choices behind it, and the broader implications of these advancements.

The Significance of the IMO

The International Mathematical Olympiad, established in 1959, is recognized globally as the premier mathematics competition for high school students. Each year, top students from around the world face six daunting problems across algebra, geometry, number theory, and combinatorics. Solving these problems requires much more than computation; participants must show real mathematical creativity, rigorous logical thinking, and the ability to construct elegant proofs.

For artificial intelligence, the IMO presents a unique challenge. While AI has mastered pattern recognition, data analysis, and even complex games like Go and chess, Olympiad mathematics demands creative, abstract reasoning and the synthesis of new ideas, skills traditionally considered hallmarks of human intelligence. As a result, the IMO has become a natural testbed for evaluating how close AI is to achieving truly human-like reasoning.

The Silver Medal Breakthrough of 2024

In 2024, DeepMind introduced two AI systems to tackle IMO-level problems: AlphaProof and AlphaGeometry 2. Both systems are examples of “neuro-symbolic” AI, combining the strengths of large language models (LLMs) with the rigor of symbolic logic.

AlphaProof was designed to prove mathematical statements using Lean, a formal mathematical language. It combined Gemini, DeepMind’s large language model, with AlphaZero, which is a reinforcement learning engine known for its success in board games. In this setting, the role of Gemini was to translate natural language problems into Lean and attempt proofs by generating logical steps. AlphaProof was trained on millions of sample problems spanning different mathematical disciplines and difficulties. The system improved itself by trying to prove increasingly complex statements, similar to how AlphaZero learned by playing games against itself.

AlphaGeometry 2 was designed to solve geometry problems. Here, Gemini’s language understanding enabled the AI to predict helpful auxiliary constructions, while a symbolic reasoning engine managed the logical deductions. This hybrid approach allowed AlphaGeometry to tackle geometric problems well beyond the scope of traditional machine reasoning.

Together, these systems solved four out of six IMO problems: two in algebra, one in number theory, and one in geometry, achieving a score of 28 out of 42. This performance was a significant milestone, as it was the first time an AI had reached the silver medal level at the IMO. However, this success relied heavily on human experts to translate problems into formal mathematical languages. They also required massive computational resources, which took days of processing time for each problem.

Technical Innovations Behind the Gold Medal

DeepMind’s transition from a silver to a gold medal performance was driven by several significant technical improvements.

1. Natural Language as the Medium for Proofs

The most significant change was shifting from systems that required expert translations into formal languages to treating natural language as the medium for proofs. This shift is achieved through an enhanced version of Gemini equipped with Deep Think capabilities. Rather than converting problems into Lean, the model processes the text directly, generates informal sketches, internally formalises critical steps, and produces a refined English proof. Reinforcement learning from human feedback (RLHF) was used to reward solutions that were logically consistent, brief, and presented.

Gemini Deep Think differs from the public version of Gemini in two main ways. First, it allocates longer context windows and more computing tokens per query, which enables the model to maintain multi-page chains of thought. Second, it uses parallel reasoning, where hundreds of speculative threads are generated for different potential solutions. A lightweight supervisor then ranks and promotes the most promising paths, borrowing concepts from Monte Carlo tree search but applied to text. This approach mimics how human teams brainstorm, discard unproductive ideas, and converge on elegant solutions.

2. Training and Reinforcement Learning

Training Gemini Deep Think involved fine-tuning the model to predict next steps rather than final answers. For this purpose, a corpus of 100,000 high-quality Olympiad and undergraduate contest solutions was compiled. The corpus was mainly collected from public math forums, arXiv preprints, and college problem sets. Human mentors reviewed training examples to filter illogical or incomplete proofs. Reinforcement learning helped refine the model, nudging it toward producing concise and precise proofs. Early versions produced overly verbose proofs, but penalties on redundant phrases helped trim the output.

Unlike conventional fine-tuning, which often struggles with sparse rewards where feedback is binary, either the proof is correct or not. DeepMind implemented a stepwise reward system, where each verified sub-lemma contributed to the overall score. This reward mechanism guides the Gemini even when complete proof is infrequent. The training process spanned three months and utilized approximately 25 million TPU-hours.

3. Massive Parallelization

Parallelization also played a critical role in DeepMind’s advancement from silver to gold. Each problem generated multiple reasoning branches in parallel, with resources dynamically shifting to more promising avenues when others stalled. This dynamic scheduling was particularly beneficial for combinatorics problems, which have large solution spaces. The approach is like how humans test auxiliary inequalities before committing to a full induction. While this technique was computationally expensive, it was manageable using DeepMind’s TPU v5 clusters.

DeepMind at the IMO 2025

To maintain the integrity of the competition, DeepMind froze the weights of the model three weeks before the IMO to prevent the leakage of official problems into the training set. They also filtered out data containing solutions to previously unpublished Olympiad questions.

During the competition, Gemini Deep Think was provided with the six official problems in plaintext format, without giving access to the internet. The system operated on a cluster configured to simulate the computational power of a standard laptop per process. The entire problem-solving process was completed in less than three hours, well within the time constraints. The generated proofs were submitted to the IMO coordinators without alterations.

Gemini Deep Think earned perfect scores on the first five problems. The final question, which was a challenging combinatorics puzzle, however, stumped both AI and 94% of human participants. Despite this, the AI finished with a total score of 35/42 to secure a gold medal. This score was seven points higher than the previous year’s silver performance. Observers later described AI’s proofs as ‘diligent’ and ‘complete,’ noting that they followed the rigorous justifications expected of human contestants.

Implications for AI and Mathematics

DeepMind’s achievement is a significant milestone for both AI and mathematics. For AI, mastering the IMO is a step towards artificial general intelligence (AGI), where systems can perform any intellectual task that a human can. Solving complex mathematical problems requires reasoning and understanding, which are fundamental components of general intelligence. This success indicates that AI is making strides toward more human-like cognitive abilities.

For mathematics, AI systems like Gemini Deep Think can become invaluable tools for mathematicians. They can assist in exploring new areas, verifying conjectures, and even discovering new theorems. By automating the more tedious aspects of proof construction, AI frees human mathematicians to focus on higher-level conceptual work. Additionally, the techniques developed for these AI systems could inspire new methods in mathematical research that may not be possible through human effort alone.

However, the progress of AI in mathematics also raises questions about the role of AI in educational settings and competitions. As AI’s capabilities continue to grow, there will be debates about how its involvement might alter the nature of mathematical education and competition.

Looking Forward

Winning IMO gold is a significant milestone, but many mathematical challenges still remain out of reach for current AI systems. However, the rapid advancement from silver to gold in just one year highlights the accelerating pace of AI innovations and developments. If this pace continues, AI systems might soon tackle some of mathematics’ most famous unsolved problems.  While the question of whether AI will replace or enhance human creativity remains unresolved, the 2025 IMO is a clear indication that artificial intelligence has made significant strides in logical reasoning.

Dr. Tehseen Zia is a Tenured Associate Professor at COMSATS University Islamabad, holding a PhD in AI from Vienna University of Technology, Austria. Specializing in Artificial Intelligence, Machine Learning, Data Science, and Computer Vision, he has made significant contributions with publications in reputable scientific journals. Dr. Tehseen has also led various industrial projects as the Principal Investigator and served as an AI Consultant.