• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Thursday, July 3, 2025
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Getting Started with Microsoft’s Presidio: A Step-by-Step Guide to Detecting and Anonymizing Personally Identifiable Information PII in Text

Josh by Josh
June 24, 2025
in Al, Analytics and Automation
0
Getting Started with Microsoft’s Presidio: A Step-by-Step Guide to Detecting and Anonymizing Personally Identifiable Information PII in Text
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


In this tutorial, we will explore how to use Microsoft’s Presidio, an open-source framework designed for detecting, analyzing, and anonymizing personally identifiable information (PII) in free-form text. Built on top of the efficient spaCy NLP library, Presidio is both lightweight and modular, making it easy to integrate into real-time applications and pipelines.

We will cover how to:

  • Set up and install the necessary Presidio packages
  • Detect common PII entities such as names, phone numbers, and credit card details
  • Define custom recognizers for domain-specific entities (e.g., PAN, Aadhaar)
  • Create and register custom anonymizers (like hashing or pseudonymization)
  • Reuse anonymization mappings for consistent re-anonymization

Installing the libraries

To get started with Presidio, you’ll need to install the following key libraries:

  • presidio-analyzer: This is the core library responsible for detecting PII entities in text using built-in and custom recognizers.
  • presidio-anonymizer: This library provides tools to anonymize (e.g., redact, replace, hash) the detected PII using configurable operators.
  • spaCy NLP model (en_core_web_lg): Presidio uses spaCy under the hood for natural language processing tasks like named entity recognition. The en_core_web_lg model provides high-accuracy results and is recommended for English-language PII detection.
pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lg

You might need to restart the session to install the libraries, if you are using Jupyter/Colab.

Presidio Analyzer

Basic PII Detection

In this block, we initialize the Presidio Analyzer Engine and run a basic analysis to detect a U.S. phone number from a sample text. We also suppress lower-level log warnings from the Presidio library for cleaner output.

The AnalyzerEngine loads spaCy’s NLP pipeline and predefined recognizers to scan the input text for sensitive entities. In this example, we specify PHONE_NUMBER as the target entity.

import logging
logging.getLogger("presidio-analyzer").setLevel(logging.ERROR)

from presidio_analyzer import AnalyzerEngine

# Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers
analyzer = AnalyzerEngine()

# Call analyzer to get results
results = analyzer.analyze(text="My phone number is 212-555-5555",
                           entities=["PHONE_NUMBER"],
                           language="en")
print(results)

Creating a Custom PII Recognizer with a Deny List (Academic Titles)

This code block shows how to create a custom PII recognizer in Presidio using a simple deny list, ideal for detecting fixed terms like academic titles (e.g., “Dr.”, “Prof.”). The recognizer is added to Presidio’s registry and used by the analyzer to scan input text.
While this tutorial covers only the deny list approach, Presidio also supports regex-based patterns, NLP models, and external recognizers. For those advanced methods, refer to the official docs: Adding Custom Recognizers.

Presidio Analyzer

Basic PII Detection

In this block, we initialize the Presidio Analyzer Engine and run a basic analysis to detect a U.S. phone number from a sample text. We also suppress lower-level log warnings from the Presidio library for cleaner output.

The AnalyzerEngine loads spaCy’s NLP pipeline and predefined recognizers to scan the input text for sensitive entities. In this example, we specify PHONE_NUMBER as the target entity.

import logging
logging.getLogger("presidio-analyzer").setLevel(logging.ERROR)

from presidio_analyzer import AnalyzerEngine

# Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers
analyzer = AnalyzerEngine()

# Call analyzer to get results
results = analyzer.analyze(text="My phone number is 212-555-5555",
                           entities=["PHONE_NUMBER"],
                           language="en")
print(results)

Creating a Custom PII Recognizer with a Deny List (Academic Titles)

This code block shows how to create a custom PII recognizer in Presidio using a simple deny list, ideal for detecting fixed terms like academic titles (e.g., “Dr.”, “Prof.”). The recognizer is added to Presidio’s registry and used by the analyzer to scan input text.
While this tutorial covers only the deny list approach, Presidio also supports regex-based patterns, NLP models, and external recognizers. For those advanced methods, refer to the official docs: Adding Custom Recognizers.

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, RecognizerRegistry

# Step 1: Create a custom pattern recognizer using deny_list
academic_title_recognizer = PatternRecognizer(
    supported_entity="ACADEMIC_TITLE",
    deny_list=["Dr.", "Dr", "Professor", "Prof."]
)

# Step 2: Add it to a registry
registry = RecognizerRegistry()
registry.load_predefined_recognizers()
registry.add_recognizer(academic_title_recognizer)

# Step 3: Create analyzer engine with the updated registry
analyzer = AnalyzerEngine(registry=registry)

# Step 4: Analyze text
text = "Prof. John Smith is meeting with Dr. Alice Brown."
results = analyzer.analyze(text=text, language="en")

for result in results:
    print(result)

Presidio Anonymizer

This code block demonstrates how to use the Presidio Anonymizer Engine to anonymize detected PII entities in a given text. In this example, we manually define two PERSON entities using RecognizerResult, simulating output from the Presidio Analyzer. These entities represent the names “Bond” and “James Bond” in the sample text.

We use the “replace” operator to substitute both names with a placeholder value (“BIP”), effectively anonymizing the sensitive data. This is done by passing an OperatorConfig with the desired anonymization strategy (replace) to the AnonymizerEngine.

This pattern can be easily extended to apply other built-in operations like “redact”, “hash”, or custom pseudonymization strategies.

from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig

# Initialize the engine:
engine = AnonymizerEngine()

# Invoke the anonymize function with the text, 
# analyzer results (potentially coming from presidio-analyzer) and
# Operators to get the anonymization output:
result = engine.anonymize(
    text="My name is Bond, James Bond",
    analyzer_results=[
        RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
        RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
    ],
    operators={"PERSON": OperatorConfig("replace", {"new_value": "BIP"})},
)

print(result)

Custom Entity Recognition, Hash-Based Anonymization, and Consistent Re-Anonymization with Presidio

In this example, we take Presidio a step further by demonstrating:

  • ✅ Defining custom PII entities (e.g., Aadhaar and PAN numbers) using regex-based PatternRecognizers
  • 🔐 Anonymizing sensitive data using a custom hash-based operator (ReAnonymizer)
  • ♻️ Re-anonymizing the same values consistently across multiple texts by maintaining a mapping of original → hashed values

We implement a custom ReAnonymizer operator that checks if a given value has already been hashed and reuses the same output to preserve consistency. This is particularly useful when anonymized data needs to retain some utility — for example, linking records by pseudonymous IDs.

Define a Custom Hash-Based Anonymizer (ReAnonymizer)

This block defines a custom Operator called ReAnonymizer that uses SHA-256 hashing to anonymize entities and ensures the same input always gets the same anonymized output by storing hashes in a shared mapping.

from presidio_anonymizer.operators import Operator, OperatorType
import hashlib
from typing import Dict

class ReAnonymizer(Operator):
    """
    Anonymizer that replaces text with a reusable SHA-256 hash,
    stored in a shared mapping dict.
    """

    def operate(self, text: str, params: Dict = None) -> str:
        entity_type = params.get("entity_type", "DEFAULT")
        mapping = params.get("entity_mapping")

        if mapping is None:
            raise ValueError("Missing `entity_mapping` in params")

        # Check if already hashed
        if entity_type in mapping and text in mapping[entity_type]:
            return mapping[entity_type][text]

        # Hash and store
        hashed = "<HASH_" + hashlib.sha256(text.encode()).hexdigest()[:10] + ">"
        mapping.setdefault(entity_type, {})[text] = hashed
        return hashed

    def validate(self, params: Dict = None) -> None:
        if "entity_mapping" not in params:
            raise ValueError("You must pass an 'entity_mapping' dictionary.")

    def operator_name(self) -> str:
        return "reanonymizer"

    def operator_type(self) -> OperatorType:
        return OperatorType.Anonymize

Define Custom PII Recognizers for PAN and Aadhaar Numbers

We define two custom regex-based PatternRecognizers — one for Indian PAN numbers and one for Aadhaar numbers. These will detect custom PII entities in your text.

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern

# Define custom recognizers
pan_recognizer = PatternRecognizer(
    supported_entity="IND_PAN",
    name="PAN Recognizer",
    patterns=[Pattern(name="pan", regex=r"\b[A-Z]{5}[0-9]{4}[A-Z]\b", score=0.8)],
    supported_language="en"
)

aadhaar_recognizer = PatternRecognizer(
    supported_entity="AADHAAR",
    name="Aadhaar Recognizer",
    patterns=[Pattern(name="aadhaar", regex=r"\b\d{4}[- ]?\d{4}[- ]?\d{4}\b", score=0.8)],
    supported_language="en"
)

Set Up Analyzer and Anonymizer Engines

Here we set up the Presidio AnalyzerEngine, register the custom recognizers, and add the custom anonymizer to the AnonymizerEngine.

from presidio_anonymizer import AnonymizerEngine, OperatorConfig

# Initialize analyzer and register custom recognizers
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(pan_recognizer)
analyzer.registry.add_recognizer(aadhaar_recognizer)

# Initialize anonymizer and add custom operator
anonymizer = AnonymizerEngine()
anonymizer.add_anonymizer(ReAnonymizer)

# Shared mapping dictionary for consistent re-anonymization
entity_mapping = {}

Analyze and Anonymize Input Texts

We analyze two separate texts that both include the same PAN and Aadhaar values. The custom operator ensures they’re anonymized consistently across both inputs.

from pprint import pprint

# Example texts
text1 = "My PAN is ABCDE1234F and Aadhaar number is 1234-5678-9123."
text2 = "His Aadhaar is 1234-5678-9123 and PAN is ABCDE1234F."

# Analyze and anonymize first text
results1 = analyzer.analyze(text=text1, language="en")
anon1 = anonymizer.anonymize(
    text1,
    results1,
    {
        "DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping})
    }
)

# Analyze and anonymize second text
results2 = analyzer.analyze(text=text2, language="en")
anon2 = anonymizer.anonymize(
    text2,
    results2,
    {
        "DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping})
    }
)

View Anonymization Results and Mapping

Finally, we print both anonymized outputs and inspect the mapping used internally to maintain consistent hashes across values.

print("📄 Original 1:", text1)
print("🔐 Anonymized 1:", anon1.text)
print("📄 Original 2:", text2)
print("🔐 Anonymized 2:", anon2.text)

print("\n📦 Mapping used:")
pprint(entity_mapping)

Check out the Codes. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.


I am a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and I have a keen interest in Data Science, especially Neural Networks and their application in various areas.



Source_link

READ ALSO

DeepSeek R1T2 Chimera: 200% Faster Than R1-0528 With Improved Reasoning and Compact Output

Confronting the AI/energy conundrum

Related Posts

DeepSeek R1T2 Chimera: 200% Faster Than R1-0528 With Improved Reasoning and Compact Output
Al, Analytics and Automation

DeepSeek R1T2 Chimera: 200% Faster Than R1-0528 With Improved Reasoning and Compact Output

July 3, 2025
Confronting the AI/energy conundrum
Al, Analytics and Automation

Confronting the AI/energy conundrum

July 3, 2025
Baidu Open Sources ERNIE 4.5: LLM Series Scaling from 0.3B to 424B Parameters
Al, Analytics and Automation

Baidu Open Sources ERNIE 4.5: LLM Series Scaling from 0.3B to 424B Parameters

July 2, 2025
Novel method detects microbial contamination in cell cultures | MIT News
Al, Analytics and Automation

Novel method detects microbial contamination in cell cultures | MIT News

July 2, 2025
Baidu Researchers Propose AI Search Paradigm: A Multi-Agent Framework for Smarter Information Retrieval
Al, Analytics and Automation

Baidu Researchers Propose AI Search Paradigm: A Multi-Agent Framework for Smarter Information Retrieval

July 2, 2025
Merging design and computer science in creative ways | MIT News
Al, Analytics and Automation

Merging design and computer science in creative ways | MIT News

July 1, 2025
Next Post
A New Way to Reach the Decision-Makers That Matter Most

A New Way to Reach the Decision-Makers That Matter Most

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
7 Best EOR Platforms for Software Companies in 2025

7 Best EOR Platforms for Software Companies in 2025

June 21, 2025
Eating Bugs – MetaDevo

Eating Bugs – MetaDevo

May 29, 2025
Top B2B & Marketing Podcasts to Lead You to Succeed in 2025 – TopRank® Marketing

Top B2B & Marketing Podcasts to Lead You to Succeed in 2025 – TopRank® Marketing

May 30, 2025
Entries For The Elektra Awards 2025 Are Now Open!

Entries For The Elektra Awards 2025 Are Now Open!

May 30, 2025

EDITOR'S PICK

Signature Gala Returns to Raise Funds to Invest in Youth Education

Signature Gala Returns to Raise Funds to Invest in Youth Education

June 2, 2025
12 Christmas Images for the Holidays (+ 12 FREE Christmas Templates)

12 Christmas Images for the Holidays (+ 12 FREE Christmas Templates)

June 10, 2025
Still Marketing to Americans? Here are Direct Mail Trends in 2025

Still Marketing to Americans? Here are Direct Mail Trends in 2025

May 30, 2025
ChatGPT and Claude privacy: Why AI makes surveillance everyone’s issue

ChatGPT and Claude privacy: Why AI makes surveillance everyone’s issue

June 18, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Squid Game X Script (No Key, Auto Win, Glass Marker)
  • DeepSeek R1T2 Chimera: 200% Faster Than R1-0528 With Improved Reasoning and Compact Output
  • Google’s customizable Gemini chatbots are now in Docs, Sheets, and Gmail
  • 24 Effective Ways to Drive Website Traffic in 2025 (Complete Guide)
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?