Artificial Intelligence

Gemini 3.1 Pro Hits Record Reasoning Gains

Published February 20, 2026

Alex McFarland

Google released Gemini 3.1 Pro on February 19, an update to its flagship AI model that more than doubles reasoning performance while keeping pricing identical to its predecessor.

The most striking number: on ARC-AGI-2, a benchmark that tests whether models can solve entirely novel logic patterns rather than recalling training data, Gemini 3.1 Pro scores 77.1%. Gemini 3 Pro scored 31.1%. That 46 percentage point jump is the largest single-generation reasoning gain in any frontier model family.

The model is available immediately across Google’s consumer and developer platforms. Gemini app users on AI Pro and AI Ultra plans get access with higher usage limits, while developers can access 3.1 Pro through the Gemini API in AI Studio, Vertex AI, Gemini CLI, Antigravity, and Android Studio. NotebookLM also gains the upgrade for Pro and Ultra subscribers.

Pricing holds at $2 per million input tokens for prompts under 200,000 tokens, rising to $4 for longer contexts. Output costs $12 per million tokens. For anyone already using Gemini 3 Pro through the API, the upgrade is free.

Benchmark Performance Across the Board

The model card shows Gemini 3.1 Pro claiming first place on 12 of 18 tracked benchmarks. Beyond ARC-AGI-2, the standouts include 94.3% on GPQA Diamond, a graduate-level science reasoning test, and 2,887 Elo on LiveCodeBench Pro, the highest score across all frontier models for competitive programming.

On Humanity’s Last Exam—a benchmark drawn from crowdsourced expert questions across academic disciplines—3.1 Pro reaches 44.4%, up from 37.5% for Gemini 3 Pro and ahead of GPT-5.2’s 34.5%. The multilingual MMLU benchmark shows 92.6%, and long-context accuracy at 128,000 tokens holds at 84.9%.

The model retains a 1 million token input context window and generates up to 64,000 output tokens, matching the specifications of AI coding tools that need to ingest entire codebases and produce substantial code blocks in a single session.

Where 3.1 Pro doesn’t lead is also telling. On SWE-Bench Verified, a test of real-world software engineering tasks, it scores 80.6%—just behind Anthropic’s Claude Opus 4.6 at 80.8%. The gap is marginal, but it shows Anthropic retains a narrow edge in the practical coding tasks that drive enterprise adoption.

What Dynamic Thinking Changes

Gemini 3.1 Pro uses dynamic thinking by default, an approach where the model adjusts how much internal reasoning it applies based on the complexity of each prompt. Simple questions get fast answers. Complex multi-step problems trigger deeper processing chains before the model generates its response.

Developers can control this behavior through a thinking_level parameter in the API, setting the maximum depth of internal reasoning. This addresses a tension in reasoning models: extended thinking improves accuracy on hard problems but adds latency and cost for straightforward queries. Dynamic thinking attempts to automate that tradeoff.

The feature reflects a broader industry shift. OpenAI’s o-series models introduced chain-of-thought reasoning as a selectable mode. Anthropic’s Claude uses extended thinking as an opt-in feature. Google’s approach of making it the default—with variable intensity—bets that most users would rather let the model decide how hard to think than manage that decision themselves.

The Competitive Field Tightens

Gemini 3.1 Pro arrives in a market where benchmark leadership changes hands monthly. Google’s Gemini 3 triggered a “code red” at OpenAI that produced GPT-5.2 in under a month. Anthropic has been shipping Claude updates at an accelerating pace. Each release narrows the gap between models, making the choice between platforms increasingly dependent on ecosystem and pricing rather than raw capability.

Google’s advantage remains distribution. Gemini 3.1 Pro slots directly into products used by hundreds of millions of people: Gmail, Docs, Search, and the Personal Intelligence features that connect the model to users’ personal data. The model also powers Gemini Enterprise and Gemini CLI, giving developers and businesses access through tools they already use.

For developers choosing between frontier models, the pricing decision has gotten easier. At $2 per million input tokens, Gemini 3.1 Pro undercuts both OpenAI’s and Anthropic’s flagship pricing for comparable capability. The no-cost upgrade from 3 Pro removes any migration friction for existing users.

The reasoning gains matter most for agentic applications—AI systems that plan, execute multi-step tasks, and use tools autonomously. ARC-AGI-2 specifically tests the kind of novel pattern recognition that agents need when encountering problems their training data didn’t cover. A model that scores 77.1% on that test handles unfamiliar situations far more reliably than one scoring 31.1%.

Whether these benchmark gains translate to proportional real-world improvements is the question Google will need to answer over the coming weeks. Benchmarks capture specific capabilities under controlled conditions; actual user experience depends on how the model performs across the unpredictable range of tasks people throw at it. The ARC-AGI-2 jump suggests 3.1 Pro handles novelty better than any model before it. What users do with that capability will determine whether the numbers matter.

Unite.AI

Gemini 3.1 Pro Hits Record Reasoning Gains

Benchmark Performance Across the Board

What Dynamic Thinking Changes

The Competitive Field Tightens

You may like