• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Monday, June 29, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

OCRmyPDF Tutorial: Convert Scanned Documents into Searchable PDF/A Files with Sidecar Text Extraction and Batch Processing

Josh by Josh
June 29, 2026
in Al, Analytics and Automation
0
OCRmyPDF Tutorial: Convert Scanned Documents into Searchable PDF/A Files with Sidecar Text Extraction and Batch Processing


def _purge(*prefixes):
   for name in [m for m in list(sys.modules)
                if any(m == p or m.startswith(p + ".") for p in prefixes)]:
       del sys.modules[name]
def _load_ocrmypdf():
   _purge("PIL", "ocrmypdf")
   import ocrmypdf
   return ocrmypdf
try:
   ocrmypdf = _load_ocrmypdf()
except ImportError as e:
   if "_Ink" in str(e) or "PIL" in str(e):
       print("Repairing an incompatible Pillow (reinstalling pillow<12)...")
       sh(f'"{sys.executable}" -m pip install -q --force-reinstall "pillow<12"')
       try:
           ocrmypdf = _load_ocrmypdf()
           print("Pillow repaired — continuing without a restart.")
       except Exception:
           raise RuntimeError(
               "Pillow is still incompatible in this session. Use the Colab menu: "
               "Runtime > Restart session, then run this cell again."
           )
   else:
       raise
from ocrmypdf.exceptions import (
   ExitCode,
   PriorOcrFoundError,
   EncryptedPdfError,
   MissingDependencyError,
   TaggedPDFError,
   DigitalSignatureError,
   DpiError,
   InputFileError,
   UnsupportedImageFormatError,
)
from ocrmypdf.helpers import check_pdf
from ocrmypdf.pdfa import file_claims_pdfa
import img2pdf
from PIL import Image, ImageDraw, ImageFont, ImageFilter
logging.basicConfig(level=logging.WARNING, format="%(levelname)s: %(message)s")
logging.getLogger("ocrmypdf").setLevel(logging.WARNING)
logging.getLogger("pdfminer").setLevel(logging.ERROR)
logging.getLogger("PIL").setLevel(logging.WARNING)
SAMPLE_TEXT_PAGES = [
   "Optical Character Recognition, commonly abbreviated as OCR, is the "
   "process of converting images of typed or printed text into machine "
   "encoded text. This page was generated as a synthetic scan so that the "
   "OCRmyPDF pipeline has something realistic to recognize and search.",
   "On 14 March 2026 the archive contained 1,482 pages across 37 folders. "
   "Roughly 92 percent of those pages were scanned at 200 to 300 dots per "
   "inch. The remaining 8 percent were skewed and required deskewing before "
   "any reliable recognition was possible.",
   "After OCRmyPDF finishes, the output is a searchable PDF/A file. You can "
   "select text, copy it, and run full text search across thousands of "
   "documents. The original image resolution is preserved while a hidden "
   "text layer is placed accurately underneath the page image.",
]
def _find_font():
   for cand in (
       "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf",
       "/usr/share/fonts/truetype/liberation/LiberationSans-Regular.ttf",
   ):
       if os.path.exists(cand):
           return cand
   return None
_FONT_PATH = _find_font()
FONT = ImageFont.truetype(_FONT_PATH, 40) if _FONT_PATH else ImageFont.load_default()
def _add_speckle(img, n=6000, dark=60):
   """Sprinkle light dark specks to imitate scanner noise (motivates --clean)."""
   import random
   px = img.load()
   w, h = img.size
   for _ in range(n):
       px[random.randint(0, w - 1), random.randint(0, h - 1)] = random.randint(0, dark)
   return img
def render_page(text, skew=False):
   """Render one A4 page (1654x2339 px ≈ 200 DPI) of dark text on white."""
   W, H = 1654, 2339
   img = Image.new("L", (W, H), 255)
   draw = ImageDraw.Draw(img)
   draw.multiline_text((150, 180), textwrap.fill(text, width=58),
                       fill=25, font=FONT, spacing=18)
   if skew:
       img = img.rotate(6, resample=Image.BICUBIC, expand=False, fillcolor=255)
       img = img.filter(ImageFilter.GaussianBlur(0.6))
       img = _add_speckle(img)
   return img
def build_scanned_pdf(pdf_path: Path, pages_text, skew_index=1):
   """Render pages to PNGs and wrap them losslessly into an image-only PDF."""
   pngs = []
   for i, text in enumerate(pages_text):
       img = render_page(text, skew=(i == skew_index))
       p = pdf_path.parent / f"_pg_{pdf_path.stem}_{i}.png"
       img.save(p, format="PNG", dpi=(200, 200))
       pngs.append(str(p))
   with open(pdf_path, "wb") as f:
       f.write(img2pdf.convert(pngs))
   for p in pngs:
       os.remove(p)
   return pdf_path
def do_ocr(input_file, output_file, **kw):
   """Wrapper around ocrmypdf.ocr() that disables the progress bar and times it."""
   kw.setdefault("progress_bar", False)
   t0 = time.perf_counter()
   rc = ocrmypdf.ocr(input_file, output_file, **kw)
   return rc, time.perf_counter() - t0
def tokens(s: str):
   return re.findall(r"[a-z0-9]+", s.lower())
def kb(path) -> str:
   return f"{Path(path).stat().st_size / 1024:,.1f} KB"
def banner(title: str):
   line = "─" * 74
   print(f"\n{line}\n  {title}\n{line}")



Source_link

READ ALSO

What Works and What Doesn’t

Building a Stable Fable 5 Traces Workflow in Colab: Parsing Tool Calls, Auditing Data, and Training Baselines

Related Posts

What Works and What Doesn’t
Al, Analytics and Automation

What Works and What Doesn’t

June 29, 2026
Building a Stable Fable 5 Traces Workflow in Colab: Parsing Tool Calls, Auditing Data, and Training Baselines
Al, Analytics and Automation

Building a Stable Fable 5 Traces Workflow in Colab: Parsing Tool Calls, Auditing Data, and Training Baselines

June 28, 2026
Building an End-to-End Sentiment Analysis Pipeline with Scikit-LLM
Al, Analytics and Automation

Building an End-to-End Sentiment Analysis Pipeline with Scikit-LLM

June 28, 2026
LLMs help robots understand vague instructions and focus on key details | MIT News
Al, Analytics and Automation

LLMs help robots understand vague instructions and focus on key details | MIT News

June 27, 2026
DeepSeek Releases DSpark, a Speculative Decoding Framework That Accelerates DeepSeek-V4 Per-User Generation 60–85% Over MTP-1
Al, Analytics and Automation

DeepSeek Releases DSpark, a Speculative Decoding Framework That Accelerates DeepSeek-V4 Per-User Generation 60–85% Over MTP-1

June 27, 2026
The Roadmap to Mastering AI Agent Evaluation
Al, Analytics and Automation

The Roadmap to Mastering AI Agent Evaluation

June 27, 2026
Next Post

The new news reality: Bigger reach, lower trust

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

A Sneak Peek Into 2026

A Sneak Peek Into 2026

April 9, 2026
PR Strategies That Drive Success for New Lifestyle Summits

PR Strategies That Drive Success for New Lifestyle Summits

July 25, 2025
Nintendo has huge discounts on Switch 2 games in its holiday sale

Nintendo has huge discounts on Switch 2 games in its holiday sale

December 24, 2025
What Happens After BFCM? A Practical Playbook for Retention and Revenue November 2025 (Updated)

What Happens After BFCM? A Practical Playbook for Retention and Revenue November 2025 (Updated)

November 24, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Google Ads & Apple Search Ads
  • The new news reality: Bigger reach, lower trust
  • OCRmyPDF Tutorial: Convert Scanned Documents into Searchable PDF/A Files with Sidecar Text Extraction and Batch Processing
  • Why RAG Systems Fail in Enterprise AI (Root Causes + Fixes)
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions