Arsalan Younus.
Back to Projects

OCR Post-Processing (BeamSearch + LM)

Context-aware OCR error correction that lifts word-level accuracy on noisy documents, improving downstream extraction and search quality.

The Business Problem

Raw OCR output contained spelling and context errors that hurt downstream extraction and search. Simple spellcheckers were not domain-aware and either over-corrected or missed medical and insurance terminology.

Low post-OCR quality led to wrong entities and failed lookups in downstream systems.

The Technical Solution

I built an intelligent post-processing system using Beam Search with language models. The system applies context-aware corrections and dynamically adjusts confidence scores, preserving high-confidence OCR tokens while fixing low-confidence ones using the LM.

NLTK and Transformers provide the language model backbone; the pipeline slots in after recognition and before codification.

The Scalability Factor

Runs as a Python service in the OCR pipeline with no external API dependencies. Stateless processing supports parallel execution across document batches at production volume.

Business Impact

Word-level accuracy improved from 94% to 96% on noisy documents; extraction and search quality improved.

Reduced manual correction load across the document processing pipeline.

Built with

Beam Search
Language Models
NLP
NLTK
Transformers
Python