Past Project

Insurance AI Automation

A 14B specialized model that outperformed GPT-4o — on hardware the client owns.

Our flagship case study for the specialized-models thesis. The client was paying ~$2,500/mo for AI on a closed-API generalist and wanted to 4x their usage. We LoRA fine-tuned Qwen2.5 14B for the specific task and it outperformed GPT-4o on their workload, eliminated their token-based spending, and gave them headroom to scale without their bill scaling with them.

Specialized > Generalist

Instead of buying more tokens from a frontier API, we LoRA fine-tuned Qwen2.5 14B for the client's specific task. On their workload, the fine-tuned 14B model outperformed GPT-4o — proof that the right small model beats a large generalist when the task is well-defined.

Self-hosted AI

We self-hosted Qwen2.5 14B for document parsing, analyses, and data extraction. The model runs at full precision on ~48GB of hardware with batching — affordable, private, and entirely under the client's control.

Scalability

We calculated batching requirements for the client's current workload and 4x scale concerns. The same hardware absorbs all of it.

Results

Eliminated a $2,500/mo recurring cost, maintained — actually improved — performance on a core feature their users loved, and 4x'd usage with zero increase in spend.

Fine Tuning

We LoRA trained the model to know the task without explicit instruction. This increased speed, reduced prompt overhead, and simplified the API. Worth noting: fine-tuning cost is significant and with modern models it's rarely necessary — but when the task warrants it, the gains compound.

Tech Stack

TensorRT LLMOpenAI-CompatibleQwen2.5 14BPythonLoRA