• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Tuesday, June 2, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale

Josh by Josh
April 18, 2026
in Al, Analytics and Automation
0
Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale


If you have ever stared at thousands of lines of integration test logs wondering which of the sixteen log files actually contains your bug, you are not alone — and Google now has data to prove it.

A team of Google researchers introduced Auto-Diagnose, an LLM-powered tool that automatically reads the failure logs from a broken integration test, finds the root cause, and posts a concise diagnosis directly into the code review where the failure showed up. On a manual evaluation of 71 real-world failures spanning 39 distinct teams, the tool correctly identified the root cause 90.14% of the time. It has run on 52,635 distinct failing tests across 224,782 executions on 91,130 code changes authored by 22,962 distinct developers, with a ‘Not helpful’ rate of just 5.8% on the feedback received.

https://arxiv.org/pdf/2604.12108

The problem: integration tests are a debugging tax

Integration tests verify that multiple components of a distributed system actually communicate to each other correctly. The tests Auto-Diagnose targets are hermetic functional integration tests: tests where an entire system under test (SUT) — typically a graph of communicating servers — is brought up inside an isolated environment by a test driver, and exercised against business logic. A separate Google survey of 239 respondents found that 78% of integration tests at Google are functional, which is what motivated the scope.

Diagnosing integration test failures showed up as one of the top five complaints in EngSat, a Google-wide survey of 6,059 developers. A follow-up survey of 116 developers found that 38.4% of integration test failures take more than an hour to diagnose, and 8.9% take more than a day — versus 2.7% and 0% for unit tests.

The root cause is structural. Test driver logs usually surface only a generic symptom (a timeout, an assertion). The actual error lives somewhere inside one of the SUT component logs, often buried under recoverable warnings and ERROR-level lines that are not actually the cause.

https://arxiv.org/pdf/2604.12108

How Auto-Diagnose works

When an integration test fails, a pub/sub event triggers Auto-Diagnose. The system collects all test driver and SUT component logs at level INFO and above — across data centers, processes, and threads — then joins and sorts them by timestamp into a single log stream. That stream is dropped into a prompt template along with component metadata.

The model is Gemini 2.5 Flash, called with temperature = 0.1 (for near-deterministic, debuggable outputs) and topp = 0.8. Gemini was not fine-tuned on Google’s integration test data; this is pure prompt engineering on a general-purpose model.

The prompt itself is the most instructive part of this research. It walks the model through an explicit step-by-step protocol: scan log sections, read component context, locate the failure, summarize errors, and only then attempt a conclusion. Critically, it includes hard negative constraints — for example: if the logs do not contain lines from the component that failed, do not draw any conclusion.

The model’s response is post-processed into a markdown finding with ==Conclusion==, ==Investigation Steps==, and ==Most Relevant Log Lines== sections, then posted as a comment in Critique, Google’s internal code review system. Each cited log line is rendered as a clickable link.

Numbers from production

Auto-Diagnose averages 110,617 input tokens and 5,962 output tokens per execution, and posts findings with a p50 latency of 56 seconds and p90 of 346 seconds — fast enough that developers see the diagnosis before they have switched contexts.

Critique exposes three feedback buttons on a finding: Please fix (used by reviewers), Helpful, and Not helpful (both used by authors). Across 517 total feedback reports from 437 distinct developers, 436 (84.3%) were “Please fix” from 370 reviewers — by far the dominant interaction, and a sign that reviewers are actively asking authors to act on the diagnoses. Among dev-side feedback, the helpfulness ratio (H / (H + N)) is 62.96%, and the “Not helpful” rate (N / (PF + H + N)) is 5.8% — well under Google’s 10% threshold for keeping a tool live. Across 370 tools that post findings to Critique, Auto-Diagnose ranks #14 in helpfulness, putting it in the top 3.78%.

The manual evaluation also surfaced a useful side effect. Of the seven cases where Auto-Diagnose failed, four were because test driver logs were not properly saved on crash, and three were because SUT component logs were not saved when the component crashed — both real infrastructure bugs, reported back to the relevant teams. In production, around 20 ‘more information is needed‘ diagnoses have similarly helped surface infrastructure issues.

Key Takeaways

  • Auto-Diagnose hit 90.14% root-cause accuracy on a manual evaluation of 71 real-world integration test failures spanning 39 teams at Google, addressing a problem 6,059 developers ranked among their top five complaints in the EngSat survey.
  • The system runs on Gemini 2.5 Flash with no fine-tuning — just prompt engineering. A pub/sub trigger collects logs across data centers and processes, joins them by timestamp, and sends them to the model at temperature 0.1 and topp 0.8.
  • The prompt is engineered to refuse rather than guess. Hard negative constraints force the model to respond with “more information is needed” when evidence is missing — a deliberate trade-off that prevents hallucinated root causes and even helped surface real infrastructure bugs in Google’s logging pipeline.
  • In production since May 2025, Auto-Diagnose has run on 52,635 distinct failing tests across 224,782 executions on 91,130 code changes from 22,962 developers, posting findings in a p50 of 56 seconds — fast enough that engineers see the diagnosis before switching contexts.

Check out the Pre-Print Paper here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us




Source_link

READ ALSO

JetBrains Releases Mellum2: A 12B MoE Model for Fast, Specialized Tasks in Multi-Model AI Pipelines

Meet Memory OS: A 6-Layer Open-Source Memory Stack Built on Top of Hermes Agent

Related Posts

JetBrains Releases Mellum2: A 12B MoE Model for Fast, Specialized Tasks in Multi-Model AI Pipelines
Al, Analytics and Automation

JetBrains Releases Mellum2: A 12B MoE Model for Fast, Specialized Tasks in Multi-Model AI Pipelines

June 2, 2026
Meet Memory OS: A 6-Layer Open-Source Memory Stack Built on Top of Hermes Agent
Al, Analytics and Automation

Meet Memory OS: A 6-Layer Open-Source Memory Stack Built on Top of Hermes Agent

June 1, 2026
Parallax: A Parameterized Local Linear Attention That Keeps Softmax and Adds a Learned Covariance Correction Branch
Al, Analytics and Automation

Parallax: A Parameterized Local Linear Attention That Keeps Softmax and Adds a Learned Covariance Correction Branch

June 1, 2026
An Implementation of the Microsoft Agent Governance Toolkit for Safe AI Agent Tool Use with Policies, Approvals, Audit Logs, and Risk Controls
Al, Analytics and Automation

An Implementation of the Microsoft Agent Governance Toolkit for Safe AI Agent Tool Use with Policies, Approvals, Audit Logs, and Risk Controls

May 31, 2026
Trajectory Releases a Concurrent Multi-LoRA Training Stack for Continual Learning, Reporting a 2.81× Experiment-Throughput Gain
Al, Analytics and Automation

Trajectory Releases a Concurrent Multi-LoRA Training Stack for Continual Learning, Reporting a 2.81× Experiment-Throughput Gain

May 31, 2026
Genesis AI Releases Nyx, Quadrants, and Genesis World 1.0 Physics Platform for Scalable Robotics Foundation Model Evaluation
Al, Analytics and Automation

Genesis AI Releases Nyx, Quadrants, and Genesis World 1.0 Physics Platform for Scalable Robotics Foundation Model Evaluation

May 30, 2026
Next Post
Cyberpunk platformers, gallivanting geckos and other new indie games worth checking out

Cyberpunk platformers, gallivanting geckos and other new indie games worth checking out

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

From Amazon Port to Motel Yahoo

From Amazon Port to Motel Yahoo

June 21, 2025
Animated film made with Google AI

Animated film made with Google AI

January 26, 2026
Exposing Power of Artificial Creativity

Exposing Power of Artificial Creativity

November 25, 2025
How to Do the Octo-bonk Again Event (Deja Bonk) in Goat Simulator 3

How to Do the Octo-bonk Again Event (Deja Bonk) in Goat Simulator 3

April 22, 2026

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Partiful Is Putting Ticket Payments on Its Platform
  • The State of Brand Intelligence in 2026 (Based on G2 Data)
  • Our Guide to the Summer 2026 Issue
  • Testing Google’s Gemini Spark AI agent: it’s incredible, and creepy
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions