Arsalan Younus.
Back to Projects

Document Noise Classification

Intelligent routing of scanned documents to the right denoising pipeline, improving OCR accuracy on poor-quality scans without wasted preprocessing.

The Business Problem

Scanned documents exhibit different noise types (blur, shadows, wrinkles, stains, low contrast), and a single denoising strategy does not fit all. Wrong preprocessing degraded text or wasted compute.

The client needed automatic routing of each scan to the appropriate denoising pipeline for optimal preprocessing before OCR.

The Technical Solution

I built a CNN classifier that predicts noise categories per scanned document, routing pages to the appropriate denoising pipeline (shadow correction vs blur reduction, etc.).

Fast inference keeps classification from becoming a bottleneck in the preprocessing stage.

The Scalability Factor

Deployed on AWS and integrated as the first stage in the preprocessing pipeline. Lightweight CNN inference runs at production volume without GPU contention.

Business Impact

Noise classification reached 94% accuracy, driving better denoising choices per document and improved OCR accuracy on poor-quality scans.

Reduces wasted denoising effort in the production preprocessing pipeline.

Built with

CNN
Image Classification
PyTorch
OpenCV
AWS