Google AI Releases Gemini 3.1 Pro with 1 Million Token Context and 77.1 Percent ARC-AGI-2 Reasoning for AI Agents

Google has officially shifted the Gemini era into high gear with the release of Gemini 3.1 Pro, the first version update in the Gemini 3 series. This release is not just a minor patch; it is a targeted strike at the ‘agentic’ AI market, focusing on reasoning stability, software engineering, and tool-use reliability.

For devs, this update signals a transition. We are moving from models that simply ‘chat’ to models that ‘work.’ Gemini 3.1 Pro is designed to be the core engine for autonomous agents that can navigate file systems, execute code, and reason through scientific problems with a success rate that now rivals—and in some cases exceeds—the industry’s most elite frontier models.

Massive Context, Precise Output

One of the most immediate technical upgrades is the handling of scale. Gemini 3.1 Pro Preview maintains a massive 1M token input context window. To put this in perspective for software engineers: you can now feed the model an entire medium-sized code repository, and it will have enough ‘memory’ to understand the cross-file dependencies without losing the plot.

However, the real news is the 65k token output limit. This 65k window is a significant jump for developers building long-form generators. Whether you are generating a 100-page technical manual or a complex, multi-module Python application, the model can now finish the job in a single turn without hitting an abrupt ‘max token’ wall.

Doubling Down on Reasoning

If Gemini 3.0 was about introducing ‘Deep Thinking,’ Gemini 3.1 is about making that thinking efficient. The performance jumps on rigorous benchmarks are notable:

Benchmark	Score	What it measures
ARC-AGI-2	77.1%	Ability to solve entirely new logic patterns
GPQA Diamond	94.1%	Graduate-level scientific reasoning
SciCode	58.9%	Python programming for scientific computing
Terminal-Bench Hard	53.8%	Agentic coding and terminal use
Humanity’s Last Exam (HLE)	44.7%	Reasoning against near-human limits

The 77.1% on ARC-AGI-2 is the headline figure here. Google team claims this represents more than double the reasoning performance of the original Gemini 3 Pro. This means the model is much less likely to rely on pattern matching from its training data and is more capable of ‘figuring it out’ when faced with a novel edge case in a dataset.

https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/

The Agentic Toolkit: Custom Tools and ‘Antigravity‘

Google team is making a clear play for the developer’s terminal. Along with the main model, they launched a specialized endpoint: gemini-3.1-pro-preview-customtools.

This endpoint is optimized for developers who mix bash commands with custom functions. In previous versions, models often struggled to prioritize which tool to use, sometimes hallucinating a search when a local file read would have sufficed. The customtools variant is specifically tuned to prioritize tools like view_file or search_code, making it a more reliable backbone for autonomous coding agents.

This release also integrates deeply with Google Antigravity, the company’s new agentic development platform. Developers can now utilize a new ‘medium’ thinking level. This allows you to toggle the ‘reasoning budget’—using high-depth thinking for complex debugging while dropping to medium or low for standard API calls to save on latency and cost.

API Breaking Changes and New File Methods

For those already building on the Gemini API, there is a small but critical breaking change. In the Interactions API v1beta, the field total_reasoning_tokens has been renamed to total_thought_tokens. This change aligns with the ‘thought signatures’ introduced in the Gemini 3 family—encrypted representations of the model’s internal reasoning that must be passed back to the model to maintain context in multi-turn agentic workflows.

The model’s appetite for data has also grown. Key updates to file handling include:

100MB File Limit: The previous 20MB cap for API uploads has been quintupled to 100MB.
Direct YouTube Support: You can now pass a YouTube URL directly as a media source. The model ‘watches’ the video via the URL rather than requiring a manual upload.
Cloud Integration: Support for Cloud Storage buckets and private database pre-signed URLs as direct data sources.

The Economics of Intelligence

Pricing for Gemini 3.1 Pro Preview remains aggressive. For prompts under 200k tokens, input costs are $2 per 1 million tokens, and output is $12 per 1 million. For contexts exceeding 200k, the price scales to $4 input and $18 output.

When compared to competitors like Claude Opus 4.6 or GPT-5.2, Google team is positioning Gemini 3.1 Pro as the ‘efficiency leader.’ According to data from Artificial Analysis, Gemini 3.1 Pro now holds the top spot on their Intelligence Index while costing roughly half as much to run as its nearest frontier peers.

Key Takeaways

Massive 1M/65K Context Window: The model maintains a 1M token input window for large-scale data and repositories, while significantly upgrading the output limit to 65k tokens for long-form code and document generation.
A Leap in Logic and Reasoning: Performance on the ARC-AGI-2 benchmark reached 77.1%, representing more than double the reasoning capability of previous versions. It also achieved a 94.1% on GPQA Diamond for graduate-level science tasks.
Dedicated Agentic Endpoints: Google team introduced a specialized gemini-3.1-pro-preview-customtools endpoint. It is specifically optimized to prioritize bash commands and system tools (like view_file and search_code) for more reliable autonomous agents.
API Breaking Change: Developers must update their codebases as the field total_reasoning_tokens has been renamed to total_thought_tokens in the v1beta Interactions API to better align with the model’s internal “thought” processing.
Enhanced File and Media Handling: The API file size limit has increased from 20MB to 100MB. Additionally, developers can now pass YouTube URLs directly into the prompt, allowing the model to analyze video content without needing to download or re-upload files.