Cost Optimization

ModelReins makes AI jobs cheap by routing them to the right provider. Here are real numbers from production deployments and strategies to keep your bill low.

Real costs per job

These are actual median costs from ModelReins telemetry, based on typical job sizes (500–2000 token input, 200–800 token output):

Job Type	Provider / Model	Cost per Job	Notes
Simple classification	Claude Haiku	$0.001	Yes/no, category, sentiment
Test case generation	Claude Haiku	$0.003	Generates 5–10 test cases
Code review (single file)	Claude Sonnet	$0.02	Detailed review with suggestions
Auth refactor plan	Claude Opus	$0.12	Complex multi-file reasoning
Summarization	Gemini Flash	$0.0005	Fastest and cheapest cloud option
Extraction (structured)	OpenAI GPT-4o-mini	$0.002	JSON extraction from documents
Any job	Ollama / LM Studio	$0.00	Local models — hardware cost only

The $1.47/week case study

A 4-person engineering team ran ModelReins for a month with this configuration:

312 jobs/week across code review, test generation, summarization, and extraction.
3 workers: 1 cloud (Claude Haiku), 1 local (Ollama on a shared dev server), 1 mixed (OpenRouter with budget routing).
Routing rules: simple jobs → Ollama, medium jobs → Haiku, complex jobs → Sonnet.

Cost breakdown

Category	Jobs/Week	Provider	Weekly Cost
Test generation	140	Ollama (local)	$0.00
Summarization	80	Claude Haiku	$0.08
Code review	60	Claude Haiku	$0.18
Extraction	20	Claude Haiku	$0.02
Complex analysis	12	Claude Sonnet	$1.19
Total	312		$1.47

The key insight: 45% of jobs ran locally at zero cost, and 95% of the cloud budget went to just 12 complex jobs per week. Routing the right jobs to local models eliminated most of the bill.

When to use local vs cloud

Use local models (Ollama, LM Studio) when:

The task is formulaic: classification, extraction, reformatting, templated generation.
Privacy matters: the data shouldn’t leave your network.
Volume is high: hundreds of similar jobs where per-job cost adds up.
Quality requirements are moderate: you need “good enough,” not “best possible.”
You’re iterating: rapid prompt development where you’ll run the same job dozens of times.

Use cloud models when:

The task requires complex reasoning: multi-step analysis, architectural decisions, subtle code bugs.
Output quality is critical: customer-facing content, security audits, compliance analysis.
The job needs a large context window: 100k+ tokens of input.
You need specific capabilities: vision, function calling, structured output guarantees.

Routing strategies for cost

Strategy 1: Tiered routing (recommended)

Route by job complexity. Set the tier field on each job:

{
  "routing": {
    "tiers": {
      "low": { "provider": "ollama", "model": "llama3.2" },
      "medium": { "provider": "claude", "model": "haiku" },
      "high": { "provider": "claude", "model": "sonnet" }
    }
  }
}

Strategy 2: Local-first with cloud fallback

Try local first. If the local worker is busy or the job times out, fall back to cloud:

{
  "routing": {
    "strategy": "fallback",
    "chain": ["ollama", "claude"]
  }
}

Strategy 3: Budget cap

Set a daily or weekly spend limit. Once reached, only local jobs run:

{
  "routing": {
    "budget": {
      "weekly_limit_usd": 2.00,
      "over_budget_provider": "ollama"
    }
  }
}

Cost monitoring

The ModelReins dashboard tracks spending in real time. You can also query costs from the CLI:

# This week's spend
modelreins cost summary --period week

# Cost breakdown by provider
modelreins cost breakdown --period month

# Set up alerts
modelreins cost alert --threshold 5.00 --period week

Tips

Start local, add cloud incrementally. Run everything on Ollama first. Identify which jobs actually need cloud quality, and only route those.
Haiku is almost always enough. For tasks that need cloud providers, Haiku handles 90% of cases at 1/10th the cost of larger models.
Batch similar jobs. If you’re running 50 extractions, route them all to the same local worker to avoid cold-start overhead.
Review the dashboard weekly. Look for expensive jobs that could be downgraded or moved to local models.