Arsalan Younus.
Back to Projects

LLM Provider Benchmarking (Price vs Accuracy)

Data-driven model selection that routes production traffic to the best price-accuracy trade-off, reserving expensive frontier models only for tasks that need them.

The Business Problem

The LLM landscape changes fast, and picking a model for production by reputation alone risks overpaying or shipping lower accuracy. Each provider prices and performs differently on domain-specific tasks like medical document extraction.

The client needed hard numbers on how each model performed on their own workloads before committing production traffic and budget.

The Technical Solution

I built an evaluation harness that ran each candidate model against the same labeled test sets from real production tasks, measuring extraction accuracy, latency, and cost per document.

Results produced a price-vs-accuracy matrix across hosted and self-hosted models, making trade-offs explicit instead of anecdotal.

The Scalability Factor

The evaluation harness is version-controlled and rerunnable as new models release, so routing decisions stay current without manual re-benchmarking from scratch.

Findings integrated into production model routing: expensive frontier models reserved for high-stakes tasks, cheaper or self-hosted models handle the rest.

Business Impact

Clear price-accuracy matrix across all major providers and open-source alternatives, grounded in production data.

Findings drove model selection and routing in production, with expensive models reserved for tasks that need them.

Built with

AWS Bedrock
Azure OpenAI
GPT-4.1
o1
o3
QWEN
Llama 4
Evaluation
Python