End-to-End Document Intelligence Pipeline
A production platform that turns messy scanned medical and insurance documents into clean, structured data, reaching 84% end-to-end accuracy and replacing hours of manual extraction per batch.
The Business Problem
Thousands of medical and insurance documents arrived as poor-quality scans mixing printed forms, handwriting, checkboxes, and multi-page records. No off-the-shelf OCR could handle the full journey from noisy image to validated, structured data.
Each failure mode (blur, shadows, handwriting variety, OCR typos, inconsistent formats, cross-page conflicts) needed a dedicated solution. The client needed all stages working together as one reliable production system, not a collection of disconnected scripts.
The Technical Solution
I designed the platform as eight specialized stages: a CNN noise classifier routes each scan to the right U-Net denoising pipeline; DBNET/EAST models localize handwriting and a MobileNetV3 detector finds handcheck marks; CRNN and trOCR models recognize printed and handwritten text; Beam Search with language models fixes OCR errors; a codification engine normalizes dates, measurements, and domain codes; and an interpage correlation system cross-references entities and flags conflicts for review.
Every stage is independently deployable, so each one can be retrained, scaled, and improved without touching the rest of the pipeline.
The Scalability Factor
All eight stages run as containerized services on AWS Kubernetes with Jenkins and GitHub Actions CI/CD. Each stage has its own deployment pipeline, so model updates roll out independently with zero-downtime rolling deploys.
Kubernetes auto-scaling handles daily document volume spikes. Per-stage health checks and CloudWatch monitoring catch failures before they cascade downstream. The modular design later allowed an LLM extraction layer to slot on top of the same infrastructure, raising accuracy to 88%.
Business Impact
Reached 84% end-to-end extraction accuracy in production, replacing hours of manual extraction per document batch.
The platform processes high document volumes daily on AWS Kubernetes; its modular design allowed the LLM-based extraction layer to slot in on top and raise accuracy to 88%.
Pipeline Stages
Each stage is a standalone system with its own case study.
- 1Document Noise Classification
Intelligent routing of scanned documents to the right denoising pipeline, improving OCR accuracy on poor-quality scans without wasted preprocessing.
CNNImage ClassificationPyTorch - 2Document Denoising Pipeline
Deep learning preprocessing that cleans noisy scans while preserving text clarity, significantly improving downstream OCR accuracy on poor-quality documents.
U-NetDeep LearningPyTorch - 3Handwriting Localization (DBNET)
Automated detection of handwritten text regions in mixed print-and-handwriting documents, feeding accurate regions to downstream OCR instead of whole-page guesses.
DBNETEASTPyTorch - 4Handcheck Detection
High-volume automated detection of checkboxes, ticks, and crosses in forms, replacing manual review of handcheck fields at scale.
MobileNetV3Object DetectionPyTorch - 5OCR Text Recognition
Domain-tuned text recognition for both printed and handwritten content, removing recognition accuracy as a bottleneck for the end-to-end pipeline.
CRNNtrOCRTransformers - 6OCR Post-Processing (BeamSearch + LM)
Context-aware OCR error correction that lifts word-level accuracy on noisy documents, improving downstream extraction and search quality.
Beam SearchLanguage ModelsNLP - 7OCR Value Codification
Centralized normalization engine that converts raw OCR text into standardized dates, measurements, and domain codes, eliminating scattered parsing logic across the pipeline.
NLPRegexPython - 8Interpage Document Correlation
Automated cross-page entity validation for multi-page medical and insurance forms, catching inconsistencies before they reach downstream systems.
NLPEntity ResolutionPython