• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Thursday, November 13, 2025
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Anthropic’s New Research Shows Claude can Detect Injected Concepts, but only in Controlled Layers

Josh by Josh
November 1, 2025
in Al, Analytics and Automation
0
0
SHARES
1
VIEWS
Share on FacebookShare on Twitter


How do you tell whether a model is actually noticing its own internal state instead of just repeating what training data said about thinking? In a latest Anthropic’s research study ‘Emergent Introspective Awareness in Large Language Models‘ asks whether current Claude models can do more than talk about their abilities, it asks whether they can notice real changes inside their network. To remove guesswork, the research team does not test on text alone, they directly edit the model’s internal activations and then ask the model what happened. This lets them tell apart genuine introspection from fluent self description.

Method, concept injection as activation steering

The core method is concept injection, described in the Transformer Circuits write up as an application of activation steering. The researchers first capture an activation pattern that corresponds to a concept, for example an all caps style or a concrete noun, then they add that vector into the activations of a later layer while the model is answering. If the model then says, there is an injected thought that matches X, that answer is causally grounded in the current state, not in prior internet text. Anthropic research team reports that this works best in later layers and with tuned strength.

https://transformer-circuits.pub/2025/introspection/index.html

Main result, about 20 percent success with zero false positives in controls

Claude Opus 4 and Claude Opus 4.1 show the clearest effect. When the injection is done in the correct layer band and with the right scale, the models correctly report the injected concept in about 20 percent of trials. On control runs with no injection, production models do not falsely claim to detect an injected thought over 100 runs, which makes the 20 percent signal meaningful.

Separating internal concepts from user text

A natural objection is that the model could be importing the injected word into the text channel. Anthropic researchers tests this. The model receives a normal sentence, the researchers inject an unrelated concept such as bread on the same tokens, and then they ask the model to name the concept and to repeat the sentence. The stronger Claude models can do both, they keep the user text intact and they name the injected thought, which shows that internal concept state can be reported separately from the visible input stream. For agent style systems, this is the interesting part, because it shows that a model can talk about the extra state that tool calls or agents may depend on.

Prefill, using introspection to tell what was intended

Another experiment targets an evaluation problem. Anthropic prefilled the assistant message with content the model did not plan. By default Claude says that the output was not intended. When the researchers retroactively inject the matching concept into earlier activations, the model now accepts the prefilled output as its own and can justify it. This shows that the model is consulting an internal record of its previous state to decide authorship, not only the final text. That is a concrete use of introspection.

Key Takeaways

  1. Concept injection gives causal evidence of introspection: Anthropic shows that if you take a known activation pattern, inject it into Claude’s hidden layers, and then ask the model what is happening, advanced Claude variants can sometimes name the injected concept. This separates real introspection from fluent roleplay.
  2. Best models succeed only in a narrow regime: Claude Opus 4 and 4.1 detect injected concepts only when the vector is added in the right layer band and with tuned strength, and the reported success rate is around the same scale Anthropic stated, while production runs show 0 false positives in controls, so the signal is real but small.
  3. Models can keep text and internal ‘thoughts’ separate: In experiments where an unrelated concept is injected on top of normal input text, the model can both repeat the user sentence and report the injected concept, which means the internal concept stream is not just leaking into the text channel.
  4. Introspection supports authorship checks: When Anthropic prefilled outputs that the model did not intend, the model disavowed them, but if the matching concept was retroactively injected, the model accepted the output as its own. This shows the model can consult past activations to decide whether it meant to say something.
  5. This is a measurement tool, not a consciousness claim: The research team frame the work as functional, limited introspective awareness that could feed future transparency and safety evaluations, including ones about evaluation awareness, but they do not claim general self awareness or stable access to all internal features.

Anthropic’s ‘Emergent Introspective Awareness in LLMs‘ research is a useful measurement advance, not a grand metaphysical claim. The setup is clean, inject a known concept into hidden activations using activation steering, then query the model for a grounded self report. Claude variants sometimes detect and name the injected concept, and they can keep injected ‘thoughts’ distinct from input text, which is operationally relevant for agent debugging and audit trails. The research team also shows limited intentional control of internal states. Constraints remain strong, effects are narrow, and reliability is modest, so downstream use should be evaluative, not safety critical.


Check out the Paper and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.


Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.



Source_link

READ ALSO

Talk to Your TV — Bitmovin’s Agentic AI Hub Quietly Redefines How We Watch

How to Build a Fully Functional Custom GPT-style Conversational AI Locally Using Hugging Face Transformers

Related Posts

Talk to Your TV — Bitmovin’s Agentic AI Hub Quietly Redefines How We Watch
Al, Analytics and Automation

Talk to Your TV — Bitmovin’s Agentic AI Hub Quietly Redefines How We Watch

November 13, 2025
How to Build a Fully Functional Custom GPT-style Conversational AI Locally Using Hugging Face Transformers
Al, Analytics and Automation

How to Build a Fully Functional Custom GPT-style Conversational AI Locally Using Hugging Face Transformers

November 13, 2025
Datasets for Training a Language Model
Al, Analytics and Automation

Datasets for Training a Language Model

November 13, 2025
PR Newswire via Morningstar PR Newswire Introduces AI-Led Platform Redefining the Future of Public Relations
Al, Analytics and Automation

PR Newswire via Morningstar PR Newswire Introduces AI-Led Platform Redefining the Future of Public Relations

November 12, 2025
How to Build an End-to-End Interactive Analytics Dashboard Using PyGWalker Features for Insightful Data Exploration
Al, Analytics and Automation

How to Build an End-to-End Interactive Analytics Dashboard Using PyGWalker Features for Insightful Data Exploration

November 12, 2025
The AI Image Model That Could Redefine Visual Creativity
Al, Analytics and Automation

The AI Image Model That Could Redefine Visual Creativity

November 12, 2025
Next Post
AI researchers ’embodied’ an LLM into a robot – and it started channeling Robin Williams

AI researchers 'embodied' an LLM into a robot – and it started channeling Robin Williams

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
7 Best EOR Platforms for Software Companies in 2025

7 Best EOR Platforms for Software Companies in 2025

June 21, 2025

EDITOR'S PICK

Marketing Works By Following Trends, Not Setting Them

Marketing Works By Following Trends, Not Setting Them

July 9, 2025
The AI Boom Faces Its Reckoning: Bank of England Sounds Alarm on a Bubble Ready to Pop

The AI Boom Faces Its Reckoning: Bank of England Sounds Alarm on a Bubble Ready to Pop

October 9, 2025
Delinea Released an MCP Server to Put Guardrails Around AI Agents Credential Access

Delinea Released an MCP Server to Put Guardrails Around AI Agents Credential Access

September 30, 2025
EQ Bank launches Business Banking designed to drive small business owners’ success in Canada

EQ Bank launches Business Banking designed to drive small business owners’ success in Canada

October 11, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • How to Compare Your AI Visibility Against Your Competitors
  • Patient Waitlist Management Software Development: key Features
  • TAG Bulletin: Q3 2025
  • How to Stand Out in the Promotions Tab With Gmail Annotations
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?