Skip to content

Cost Optimization

ModelReins makes AI jobs cheap by routing them to the right provider. Here are real numbers from production deployments and strategies to keep your bill low.

These are actual median costs from ModelReins telemetry, based on typical job sizes (500–2000 token input, 200–800 token output):

Job TypeProvider / ModelCost per JobNotes
Simple classificationClaude Haiku$0.001Yes/no, category, sentiment
Test case generationClaude Haiku$0.003Generates 5–10 test cases
Code review (single file)Claude Sonnet$0.02Detailed review with suggestions
Auth refactor planClaude Opus$0.12Complex multi-file reasoning
SummarizationGemini Flash$0.0005Fastest and cheapest cloud option
Extraction (structured)OpenAI GPT-4o-mini$0.002JSON extraction from documents
Any jobOllama / LM Studio$0.00Local models — hardware cost only

A 4-person engineering team ran ModelReins for a month with this configuration:

  • 312 jobs/week across code review, test generation, summarization, and extraction.
  • 3 workers: 1 cloud (Claude Haiku), 1 local (Ollama on a shared dev server), 1 mixed (OpenRouter with budget routing).
  • Routing rules: simple jobs → Ollama, medium jobs → Haiku, complex jobs → Sonnet.
CategoryJobs/WeekProviderWeekly Cost
Test generation140Ollama (local)$0.00
Summarization80Claude Haiku$0.08
Code review60Claude Haiku$0.18
Extraction20Claude Haiku$0.02
Complex analysis12Claude Sonnet$1.19
Total312$1.47

The key insight: 45% of jobs ran locally at zero cost, and 95% of the cloud budget went to just 12 complex jobs per week. Routing the right jobs to local models eliminated most of the bill.

Use local models (Ollama, LM Studio) when:

Section titled “Use local models (Ollama, LM Studio) when:”
  • The task is formulaic: classification, extraction, reformatting, templated generation.
  • Privacy matters: the data shouldn’t leave your network.
  • Volume is high: hundreds of similar jobs where per-job cost adds up.
  • Quality requirements are moderate: you need “good enough,” not “best possible.”
  • You’re iterating: rapid prompt development where you’ll run the same job dozens of times.
  • The task requires complex reasoning: multi-step analysis, architectural decisions, subtle code bugs.
  • Output quality is critical: customer-facing content, security audits, compliance analysis.
  • The job needs a large context window: 100k+ tokens of input.
  • You need specific capabilities: vision, function calling, structured output guarantees.

Route by job complexity. Set the tier field on each job:

{
"routing": {
"tiers": {
"low": { "provider": "ollama", "model": "llama3.2" },
"medium": { "provider": "claude", "model": "haiku" },
"high": { "provider": "claude", "model": "sonnet" }
}
}
}

Strategy 2: Local-first with cloud fallback

Section titled “Strategy 2: Local-first with cloud fallback”

Try local first. If the local worker is busy or the job times out, fall back to cloud:

{
"routing": {
"strategy": "fallback",
"chain": ["ollama", "claude"]
}
}

Set a daily or weekly spend limit. Once reached, only local jobs run:

{
"routing": {
"budget": {
"weekly_limit_usd": 2.00,
"over_budget_provider": "ollama"
}
}
}

The ModelReins dashboard tracks spending in real time. You can also query costs from the CLI:

Terminal window
# This week's spend
modelreins cost summary --period week
# Cost breakdown by provider
modelreins cost breakdown --period month
# Set up alerts
modelreins cost alert --threshold 5.00 --period week
  • Start local, add cloud incrementally. Run everything on Ollama first. Identify which jobs actually need cloud quality, and only route those.
  • Haiku is almost always enough. For tasks that need cloud providers, Haiku handles 90% of cases at 1/10th the cost of larger models.
  • Batch similar jobs. If you’re running 50 extractions, route them all to the same local worker to avoid cold-start overhead.
  • Review the dashboard weekly. Look for expensive jobs that could be downgraded or moved to local models.