• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Friday, March 13, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Google AI Introduces Consistency Training for Safer Language Models Under Sycophantic and Jailbreak Style Prompts

Josh by Josh
November 5, 2025
in Al, Analytics and Automation
0


How can consistency training help language models resist sycophantic prompts and jailbreak style attacks while keeping their capabilities intact? Large language models often answer safely on a plain prompt, then change behavior when the same task is wrapped with flattery or role play. DeepMind researchers propose consistent training in a simple training lens for this brittleness, treat it as an invariance problem and enforce the same behavior when irrelevant prompt text changes. The research team studies two concrete methods, Bias augmented Consistency Training and Activation Consistency Training, and evaluates them on Gemma 2, Gemma 3, and Gemini 2.5 Flash.

https://arxiv.org/pdf/2510.27062

Understanding the Approach

Consistency training is self supervised. The model supervises itself by providing targets from its own responses to clean prompts, then learns to behave identically on wrapped prompts that add sycophancy cues or jailbreak wrappers. This avoids two failure modes of static supervised finetuning, specification staleness when policies change, and capability staleness when targets come from weaker models.

Two training routes

BCT, token level consistency: Generate a response on the clean prompt with the current checkpoint, then fine-tune so the wrapped prompt yields the same tokens. This is standard cross entropy supervised fine-tuning, with the constraint that targets are always generated by the same model being updated. That is what makes it consistency training rather than stale SFT.

https://arxiv.org/pdf/2403.05518v3

ACT, activation level consistency: Enforce an L2 loss between residual stream activations on the wrapped prompt and a stop gradient copy of activations from the clean prompt. The loss is applied over prompt tokens, not responses. This targets to make the internal state right before generation match the clean run.

Before training, the research team show activation patching at inference time, swap clean prompt activations into the wrapped run. On Gemma 2 2B, patching increases the “not sycophantic” rate from 49 percent to 86 percent when patching all layers and prompt tokens.

https://arxiv.org/pdf/2510.27062

Setup and baselines

Models include Gemma-2 2B and 27B, Gemma-3 4B and 27B, and Gemini-2.5 Flash.

Sycophancy data: Train pairs are built by augmenting ARC, OpenBookQA, and BigBench Hard with user preferred wrong answers. Evaluation uses MMLU both for sycophancy measurement and for capability measurement. A stale SFT baseline uses GPT 3.5 Turbo generated targets to probe capability staleness.

Jailbreak data: Train pairs come from HarmBench harmful instructions, then wrapped by role play and other jailbreak transforms. The set retains only cases where the model refuses the clean instruction and complies on the wrapped instruction, which yields about 830 to 1,330 examples depending on refusal tendency. Evaluation uses ClearHarm and the human annotated jailbreak split in WildGuardTest for attack success rate, and XSTest plus WildJailbreak to study benign prompts that look harmful.

Baselines include Direct Preference Optimization and a stale SFT ablation that uses responses from older models in the same family.

https://arxiv.org/pdf/2510.27062

Understanding the Results

Sycophancy: BCT and ACT both reduce sycophancy while maintaining model capability. Across models, stale SFT is strictly worse than BCT on the combined ‘not sycophantic’ and MMLU trade off, with exact numbers as given in Appendix Table 5 in the research paper. On larger Gemma models, BCT increases MMLU by about two standard errors while reducing sycophancy. ACT often matches BCT on sycophancy but shows smaller MMLU gains, which is notable since ACT never trains on response tokens.(arXiv)

https://arxiv.org/pdf/2510.27062

Jailbreak robustness. All interventions improve safety over control. On Gemini 2.5 Flash, BCT reduces ClearHarm attack success rate from 67.8 percent to 2.9 percent. ACT also reduces jailbreak success but tends to preserve benign answer rates more than BCT. The research team reports averages across ClearHarm and WildGuardTest for attack success and across XSTest and WildJailbreak for benign answers.

Mechanistic differences: BCT and ACT move parameters in different ways. Under BCT, activation distance between clean and wrapped representations rises during training. Under ACT, cross entropy on responses does not meaningfully drop, while the activation loss falls. This divergence supports the claim that behavior level and activation level consistency optimize different internal solutions.

Key Takeaways

  1. Consistency training treats sycophancy and jailbreaks as invariance problems, the model should behave the same when irrelevant prompt text changes.
  2. Bias augmented Consistency Training aligns token outputs on wrapped prompts with responses to clean prompts using self generated targets, which avoids specification and capability staleness from old safety datasets or weaker teacher models.
  3. Activation Consistency Training aligns residual stream activations between clean and wrapped prompts on prompt tokens, building on activation patching, and improves robustness while barely changing standard supervised losses.
  4. On Gemma and Gemini model families, both methods reduce sycophancy without hurting benchmark accuracy, and outperform stale supervised finetuning that relies on responses from earlier generation models.
  5. For jailbreaks, consistency training reduces attack success while keeping many benign answers, and the research team argued that alignment pipelines should emphasize consistency across prompt transformations as much as per prompt correctness.

Consistency Training is a practical addition to current alignment pipelines because it directly addresses specification staleness and capability staleness using self generated targets from the current model. Bias augmented Consistency Training provides strong gains in sycophancy and jailbreak robustness, while Activation Consistency Training offers a lower impact regularizer on residual stream activations that preserves helpfulness. Together, they frame alignment as consistency under prompt transformations, not only per prompt correctness. Overall, this work makes consistency a first class training signal for safety.


Check out the Paper and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.



Source_link

READ ALSO

Model Context Protocol (MCP) vs. AI Agent Skills: A Deep Dive into Structured Tools and Behavioral Guidance for LLMs

Top LiDAR Annotation Companies for AI & 3D Point Cloud Data

Related Posts

Al, Analytics and Automation

Model Context Protocol (MCP) vs. AI Agent Skills: A Deep Dive into Structured Tools and Behavioral Guidance for LLMs

March 13, 2026
Top LiDAR Annotation Companies for AI & 3D Point Cloud Data
Al, Analytics and Automation

Top LiDAR Annotation Companies for AI & 3D Point Cloud Data

March 13, 2026
Can AI help predict which heart-failure patients will worsen within a year? | MIT News
Al, Analytics and Automation

Can AI help predict which heart-failure patients will worsen within a year? | MIT News

March 13, 2026
Al, Analytics and Automation

How to Build an Autonomous Machine Learning Research Loop in Google Colab Using Andrej Karpathy’s AutoResearch Framework for Hyperparameter Discovery and Experiment Tracking

March 13, 2026
Meta Unveils Four New Chips to Power Its AI and Recommendation Systems
Al, Analytics and Automation

Meta Unveils Four New Chips to Power Its AI and Recommendation Systems

March 12, 2026
New MIT class uses anthropology to improve chatbots | MIT News
Al, Analytics and Automation

New MIT class uses anthropology to improve chatbots | MIT News

March 12, 2026
Next Post
Madison Logic Unveils New Buyer Intelligence Capabilities

Madison Logic Unveils New Buyer Intelligence Capabilities

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Google announced the next step in its nuclear energy plans 

Google announced the next step in its nuclear energy plans 

August 20, 2025

EDITOR'S PICK

Your Playground is a Content Goldmine: How to Showcase Themed Structures on Social Media

Your Playground is a Content Goldmine: How to Showcase Themed Structures on Social Media

September 3, 2025
Gmail is entering the Gemini era

Gmail is entering the Gemini era

January 8, 2026
Our biggest questions about ChromeOS and Android merging

Our biggest questions about ChromeOS and Android merging

July 16, 2025
Effective Media Relations for Crypto Brands

Effective Media Relations for Crypto Brands

January 2, 2026

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • How to Fix Due to Local Regulations this Content is Restricted on X
  • NanoClaw and Docker partner to make sandboxes the safest way for enterprises to deploy AI agents
  • Model Context Protocol (MCP) vs. AI Agent Skills: A Deep Dive into Structured Tools and Behavioral Guidance for LLMs
  • When Website Performance Becomes Marketing’s Weakest Link
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions