• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Wednesday, April 29, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

OpenAI Releases Privacy Filter: A 1.5B-Parameter Open-Source PII Redaction Model with 50M Active Parameters

Josh by Josh
April 29, 2026
in Al, Analytics and Automation
0
OpenAI Releases Privacy Filter: A 1.5B-Parameter Open-Source PII Redaction Model with 50M Active Parameters


OpenAI just quietly dropped something worth paying close attention to. Released on Hugging Face under an Apache 2.0 license, Privacy Filter is an open, bidirectional token-classification model purpose-built for detecting and redacting personally identifiable information (PII) in text. It is small enough to run in a web browser or on a laptop and fast enough for high-throughput data sanitization pipelines.

What It Does

Privacy Filter is a Named Entity Recognition (NER) model but one tuned specifically for the privacy use case. It detects eight categories of sensitive spans: account_number, private_address, private_email, private_person, private_phone, private_url, private_date, and secret. The secret category covers credential formats, project-specific token patterns, and high-entropy strings — the model card explicitly calls out missed detection of ‘novel credential formats’ and ‘secrets split across surrounding syntax’ as known failure modes, which signals what the category is trained to target.

The intended use case is clear: dev teams that need to clean datasets, scrub logs, or pre-process user-generated content before it enters a training pipeline or gets stored in a data warehouse. Because it runs on-premises and on commodity hardware, it fits squarely into the growing set of edge-deployable AI tools that organizations can adopt without routing sensitive data to a third-party API.

The Architecture is the Real Story

Privacy Filter has 1.5 billion total parameters but only 50 million active parameters at inference time. That gap, which is roughly 30x, is explained entirely by the model’s sparse mixture-of-experts (MoE) feed-forward design.

Architecturally, the model is ‘similar to gpt-oss, albeit of a smaller size.’ It is built on 8 pre-norm transformer blocks with a residual stream width (d_model) of 640. Attention uses grouped-query attention (GQA) with rotary positional embeddings (RoPE) — 14 query heads over 2 KV heads, meaning 7 query heads share each KV head — which reduces the memory footprint of the key-value cache significantly compared to standard multi-head attention. RoPE is also what enables the model’s 128,000-token context window. The feed-forward layers use sparse MoE with 128 total experts and top-4 routing per token: for each token, 4 of the 128 experts are activated, and all other expert parameters remain dormant. This is exactly the mechanism that produces the 30x gap between total and active parameter counts.

A Three-Phase Training Pipeline

What makes this model architecturally unusual is not just its size, but how it was built. Privacy Filter was produced in three distinct phases.

First, it was pretrained autoregressively as a standard next-token prediction language model — in the tradition of GPT-style decoders. Second, that checkpoint was architecturally converted: the language-model head was replaced with a token-classification head over the privacy label taxonomy, and the attention mechanism was switched from causal (unidirectional) to bidirectional banded attention with a band size of 128, giving each token an effective context window of 257 tokens (the token itself plus 128 on each side). Third, the converted model was post-trained with a supervised classification loss — a distinct fine-tuning phase using labeled PII data, separate from the architectural conversion step.

The autoregressive pretraining gives the model rich language representations learned from far more data and compute than any task-specific budget would support. The architectural conversion enables bidirectional context, which is essential for NER — a name like ‘Alice’ in ‘Alice Smith called’ is unambiguous, but with only left context it could be missed. The supervised post-training then specializes those representations for the privacy detection task.

Compared to classical masked-language-model approaches like BERT, this is a post-training conversion of an autoregressive model rather than a native masked-LM setup — a meaningful distinction in how the base representations were formed.

Constrained Viterbi Decoding Instead of Argmax

The label scheme Privacy Filter uses is BIOES — Begin, Inside, Outside, End, Single. Each of the 8 privacy categories gets four boundary-tagged token classes (B-, I-, E-, S-) plus the background class O, yielding 33 total output classes per token. For a sequence of length T, the output logits have shape [T, 33].

Rather than taking a per-token argmax over those 33 logits, which could produce incoherent label sequences like B- followed immediately by S-, the model runs a constrained Viterbi decoder at inference time. The decoder uses linear-chain transition scoring and enforces valid BIOES boundary transitions. It scores complete label paths using start, transition, and end terms, along with six transition-bias parameters that specifically control: background persistence, span entry, span continuation, span closure, and boundary-to-boundary handoff. This global path optimization improves span coherence and boundary stability by making each token decision depend on sequence-level structure, not just local logits — particularly valuable in noisy or mixed-format text.

Those six transition-bias parameters are also user-tunable at runtime. This brings AI developers to push toward broader, more contiguous masking for improved recall, or tighten boundaries for improved precision, without retraining the model.

Key Takeaways

  • OpenAI released Privacy Filter, an open-source PII redaction model under Apache 2.0, capable of detecting eight sensitive span categories including account_number, private_person, secret, and more — deployable on-premises without routing data to an external API.
  • The model has 1.5B total parameters but only 50M active at inference, thanks to a sparse MoE feed-forward design with 128 experts and top-4 routing per token — making it lightweight enough to run in a browser or on a laptop.
  • The backbone is architecturally similar to gpt-oss: 8 pre-norm transformer blocks, d_model=640, grouped-query attention with RoPE, and a sparse MoE FFN — first pretrained autoregressively, then converted to a bidirectional banded attention encoder, then post-trained with a supervised classification loss.
  • At inference, it runs constrained Viterbi decoding over a BIOES label scheme rather than per-token argmax, producing coherent span boundaries with six tunable transition-bias parameters that let engineers adjust the precision/recall tradeoff at runtime without retraining.

Check out the Model Weights. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us




Source_link

READ ALSO

Top 10 Physical AI Models Powering Real-World Robots in 2026

Build a Reinforcement Learning Powered Agent that Learns to Retrieve Relevant Long-Term Memories for Accurate LLM Question Answering

Related Posts

Top 10 Physical AI Models Powering Real-World Robots in 2026
Al, Analytics and Automation

Top 10 Physical AI Models Powering Real-World Robots in 2026

April 28, 2026
Build a Reinforcement Learning Powered Agent that Learns to Retrieve Relevant Long-Term Memories for Accurate LLM Question Answering
Al, Analytics and Automation

Build a Reinforcement Learning Powered Agent that Learns to Retrieve Relevant Long-Term Memories for Accurate LLM Question Answering

April 28, 2026
Microsoft has loosened its exclusive control over OpenAI, and now the artificial intelligence race appears wide open
Al, Analytics and Automation

Microsoft has loosened its exclusive control over OpenAI, and now the artificial intelligence race appears wide open

April 27, 2026
A faster way to estimate AI power consumption | MIT News
Al, Analytics and Automation

A faster way to estimate AI power consumption | MIT News

April 27, 2026
The LoRA Assumption That Breaks in Production 
Al, Analytics and Automation

The LoRA Assumption That Breaks in Production 

April 27, 2026
Top 7 Benchmarks That Actually Matter for Agentic Reasoning in Large Language Models
Al, Analytics and Automation

Top 7 Benchmarks That Actually Matter for Agentic Reasoning in Large Language Models

April 26, 2026
Next Post
American AI startup Poolside launches free, high-performing open model Laguna XS.2 for local agentic coding

American AI startup Poolside launches free, high-performing open model Laguna XS.2 for local agentic coding

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

Your brand should show up early to be relevant during Super Bowl LX

January 22, 2026
Figma’s IPO price hit a $19.3B valuation out of the gate

Figma’s IPO price hit a $19.3B valuation out of the gate

July 30, 2025
Mend Releases AI Security Governance Framework: Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model

Mend Releases AI Security Governance Framework: Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model

April 24, 2026
Is Pro Worth It in 2026?

Is Pro Worth It in 2026?

February 11, 2026

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • When CrowdStrike failed, this communicator eased the chaos at Providence Health
  • The Science of Lead Generation With Meta Ads
  • American AI startup Poolside launches free, high-performing open model Laguna XS.2 for local agentic coding
  • OpenAI Releases Privacy Filter: A 1.5B-Parameter Open-Source PII Redaction Model with 50M Active Parameters
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions