Arsalan Younus.
Back to Projects

End-to-End Document Intelligence Pipeline

A production platform that turns messy scanned medical and insurance documents into clean, structured data, reaching 84% end-to-end accuracy and replacing hours of manual extraction per batch.

The Business Problem

Thousands of medical and insurance documents arrived as poor-quality scans mixing printed forms, handwriting, checkboxes, and multi-page records. No off-the-shelf OCR could handle the full journey from noisy image to validated, structured data.

Each failure mode (blur, shadows, handwriting variety, OCR typos, inconsistent formats, cross-page conflicts) needed a dedicated solution. The client needed all stages working together as one reliable production system, not a collection of disconnected scripts.

The Technical Solution

I designed the platform as eight specialized stages: a CNN noise classifier routes each scan to the right U-Net denoising pipeline; DBNET/EAST models localize handwriting and a MobileNetV3 detector finds handcheck marks; CRNN and trOCR models recognize printed and handwritten text; Beam Search with language models fixes OCR errors; a codification engine normalizes dates, measurements, and domain codes; and an interpage correlation system cross-references entities and flags conflicts for review.

Every stage is independently deployable, so each one can be retrained, scaled, and improved without touching the rest of the pipeline.

The Scalability Factor

All eight stages run as containerized services on AWS Kubernetes with Jenkins and GitHub Actions CI/CD. Each stage has its own deployment pipeline, so model updates roll out independently with zero-downtime rolling deploys.

Kubernetes auto-scaling handles daily document volume spikes. Per-stage health checks and CloudWatch monitoring catch failures before they cascade downstream. The modular design later allowed an LLM extraction layer to slot on top of the same infrastructure, raising accuracy to 88%.

Business Impact

Reached 84% end-to-end extraction accuracy in production, replacing hours of manual extraction per document batch.

The platform processes high document volumes daily on AWS Kubernetes; its modular design allowed the LLM-based extraction layer to slot in on top and raise accuracy to 88%.

Built with

PyTorch
OpenCV
Transformers
CRNN
DBNET
U-Net
NLP
AWS
Kubernetes
Docker

Pipeline Stages

Each stage is a standalone system with its own case study.

  1. 1Document Noise Classification

    Intelligent routing of scanned documents to the right denoising pipeline, improving OCR accuracy on poor-quality scans without wasted preprocessing.

    CNN
    Image Classification
    PyTorch
  2. 2Document Denoising Pipeline

    Deep learning preprocessing that cleans noisy scans while preserving text clarity, significantly improving downstream OCR accuracy on poor-quality documents.

    U-Net
    Deep Learning
    PyTorch
  3. 3Handwriting Localization (DBNET)

    Automated detection of handwritten text regions in mixed print-and-handwriting documents, feeding accurate regions to downstream OCR instead of whole-page guesses.

    DBNET
    EAST
    PyTorch
  4. 4Handcheck Detection

    High-volume automated detection of checkboxes, ticks, and crosses in forms, replacing manual review of handcheck fields at scale.

    MobileNetV3
    Object Detection
    PyTorch
  5. 5OCR Text Recognition

    Domain-tuned text recognition for both printed and handwritten content, removing recognition accuracy as a bottleneck for the end-to-end pipeline.

    CRNN
    trOCR
    Transformers
  6. 6OCR Post-Processing (BeamSearch + LM)

    Context-aware OCR error correction that lifts word-level accuracy on noisy documents, improving downstream extraction and search quality.

    Beam Search
    Language Models
    NLP
  7. 7OCR Value Codification

    Centralized normalization engine that converts raw OCR text into standardized dates, measurements, and domain codes, eliminating scattered parsing logic across the pipeline.

    NLP
    Regex
    Python
  8. 8Interpage Document Correlation

    Automated cross-page entity validation for multi-page medical and insurance forms, catching inconsistencies before they reach downstream systems.

    NLP
    Entity Resolution
    Python