Cost Optimization
ModelReins makes AI jobs cheap by routing them to the right provider. Here are real numbers from production deployments and strategies to keep your bill low.
Real costs per job
Section titled “Real costs per job”These are actual median costs from ModelReins telemetry, based on typical job sizes (500–2000 token input, 200–800 token output):
| Job Type | Provider / Model | Cost per Job | Notes |
|---|---|---|---|
| Simple classification | Claude Haiku | $0.001 | Yes/no, category, sentiment |
| Test case generation | Claude Haiku | $0.003 | Generates 5–10 test cases |
| Code review (single file) | Claude Sonnet | $0.02 | Detailed review with suggestions |
| Auth refactor plan | Claude Opus | $0.12 | Complex multi-file reasoning |
| Summarization | Gemini Flash | $0.0005 | Fastest and cheapest cloud option |
| Extraction (structured) | OpenAI GPT-4o-mini | $0.002 | JSON extraction from documents |
| Any job | Ollama / LM Studio | $0.00 | Local models — hardware cost only |
The $1.47/week case study
Section titled “The $1.47/week case study”A 4-person engineering team ran ModelReins for a month with this configuration:
- 312 jobs/week across code review, test generation, summarization, and extraction.
- 3 workers: 1 cloud (Claude Haiku), 1 local (Ollama on a shared dev server), 1 mixed (OpenRouter with budget routing).
- Routing rules: simple jobs → Ollama, medium jobs → Haiku, complex jobs → Sonnet.
Cost breakdown
Section titled “Cost breakdown”| Category | Jobs/Week | Provider | Weekly Cost |
|---|---|---|---|
| Test generation | 140 | Ollama (local) | $0.00 |
| Summarization | 80 | Claude Haiku | $0.08 |
| Code review | 60 | Claude Haiku | $0.18 |
| Extraction | 20 | Claude Haiku | $0.02 |
| Complex analysis | 12 | Claude Sonnet | $1.19 |
| Total | 312 | $1.47 |
The key insight: 45% of jobs ran locally at zero cost, and 95% of the cloud budget went to just 12 complex jobs per week. Routing the right jobs to local models eliminated most of the bill.
When to use local vs cloud
Section titled “When to use local vs cloud”Use local models (Ollama, LM Studio) when:
Section titled “Use local models (Ollama, LM Studio) when:”- The task is formulaic: classification, extraction, reformatting, templated generation.
- Privacy matters: the data shouldn’t leave your network.
- Volume is high: hundreds of similar jobs where per-job cost adds up.
- Quality requirements are moderate: you need “good enough,” not “best possible.”
- You’re iterating: rapid prompt development where you’ll run the same job dozens of times.
Use cloud models when:
Section titled “Use cloud models when:”- The task requires complex reasoning: multi-step analysis, architectural decisions, subtle code bugs.
- Output quality is critical: customer-facing content, security audits, compliance analysis.
- The job needs a large context window: 100k+ tokens of input.
- You need specific capabilities: vision, function calling, structured output guarantees.
Routing strategies for cost
Section titled “Routing strategies for cost”Strategy 1: Tiered routing (recommended)
Section titled “Strategy 1: Tiered routing (recommended)”Route by job complexity. Set the tier field on each job:
{ "routing": { "tiers": { "low": { "provider": "ollama", "model": "llama3.2" }, "medium": { "provider": "claude", "model": "haiku" }, "high": { "provider": "claude", "model": "sonnet" } } }}Strategy 2: Local-first with cloud fallback
Section titled “Strategy 2: Local-first with cloud fallback”Try local first. If the local worker is busy or the job times out, fall back to cloud:
{ "routing": { "strategy": "fallback", "chain": ["ollama", "claude"] }}Strategy 3: Budget cap
Section titled “Strategy 3: Budget cap”Set a daily or weekly spend limit. Once reached, only local jobs run:
{ "routing": { "budget": { "weekly_limit_usd": 2.00, "over_budget_provider": "ollama" } }}Cost monitoring
Section titled “Cost monitoring”The ModelReins dashboard tracks spending in real time. You can also query costs from the CLI:
# This week's spendmodelreins cost summary --period week
# Cost breakdown by providermodelreins cost breakdown --period month
# Set up alertsmodelreins cost alert --threshold 5.00 --period week- Start local, add cloud incrementally. Run everything on Ollama first. Identify which jobs actually need cloud quality, and only route those.
- Haiku is almost always enough. For tasks that need cloud providers, Haiku handles 90% of cases at 1/10th the cost of larger models.
- Batch similar jobs. If you’re running 50 extractions, route them all to the same local worker to avoid cold-start overhead.
- Review the dashboard weekly. Look for expensive jobs that could be downgraded or moved to local models.