Local LLM Deployment & Serving Infrastructure
Self-hosted LLMs on owned GPU hardware serving two production pipelines, cutting inference cost per document from ~35 cents to ~10 cents while keeping sensitive data in-house.
The Business Problem
Production pipelines for medical entity extraction and the underwriter's assistant depended on third-party LLM APIs, with ongoing per-token costs, rate limits, and sensitive medical and insurance data leaving the infrastructure.
Running open-source models naively on available GPUs wasted memory and throughput; the hardware could not keep up with production load without serving optimizations.
The Technical Solution
I deployed two serving instances across the GPU fleet (RTX 4090s for smaller models and the A100 80GB for larger ones), wired into the production extraction and chatbot pipelines.
Serving was optimized with KV cache management, prefix caching for shared prompt segments, and request batching. Models were sized and placed to fit GPU memory budgets while leaving headroom for long contexts.
The Scalability Factor
Dockerized serving instances with health monitoring and graceful degradation under burst load from both pipelines. Prefix caching tuned around prompt structure for high cache hit rates.
Deployment degrades gracefully when one GPU is saturated, routing traffic to the A100 instance. Model updates roll out via containerized deploys without downtime to production pipelines.
Business Impact
Two self-hosted instances serve production traffic for medical entity extraction and the underwriter's assistant.
Serving optimizations cut inference cost per document from ~35 cents to ~10 cents versus hosted APIs, increased throughput per GPU, and kept sensitive documents in-house.