Ollama
Ollama runs open-weight models directly on your hardware. No API key, no cloud dependency, no per-token charges. Models run on your CPU or GPU and never send data off your machine.
Why Ollama
Section titled “Why Ollama”- Zero cost — no API fees, ever. Your only cost is electricity.
- Full privacy — prompts and completions never leave your machine.
- Offline capable — works without an internet connection after model download.
- Fast iteration — no rate limits, no quotas, no waiting.
Install Ollama
Section titled “Install Ollama”macOS / Linux
Section titled “macOS / Linux”curl -fsSL https://ollama.ai/install.sh | shmacOS (Homebrew)
Section titled “macOS (Homebrew)”brew install ollamaWindows
Section titled “Windows”Download the installer from ollama.ai.
Verify installation
Section titled “Verify installation”ollama --versionPull a model
Section titled “Pull a model”# Recommended starter model — good balance of quality and speedollama pull llama3.2
# Lightweight option for fast machinesollama pull phi3
# Larger model for better quality (needs 16GB+ RAM)ollama pull llama3.1:70bConfigure ModelReins
Section titled “Configure ModelReins”export MODELREINS_PROVIDER=ollamaexport MODELREINS_OLLAMA_HOST=http://localhost:11434Or in modelreins.config.json:
{ "provider": "ollama", "ollama": { "host": "http://localhost:11434", "model": "llama3.2" }}Start the worker:
modelreins worker start --provider ollamaStarter models
Section titled “Starter models”| Model | RAM Required | Speed | Quality | Best For |
|---|---|---|---|---|
llama3.2 | 8 GB | Fast | Good | General tasks, summaries, extraction |
phi3 | 4 GB | Very fast | Moderate | Quick completions, high throughput |
codellama | 8 GB | Fast | Good (code) | Code generation, review |
mistral | 8 GB | Fast | Good | Multilingual, instruction following |
llama3.1:70b | 48 GB | Slow | Excellent | Complex reasoning (GPU recommended) |
Selecting a model per job
Section titled “Selecting a model per job”Override the default model for specific jobs:
modelreins job dispatch \ --provider ollama \ --model codellama \ --prompt "Review this function for bugs" \ --input ./handler.tsGPU acceleration
Section titled “GPU acceleration”Ollama automatically detects and uses NVIDIA GPUs via CUDA. For Apple Silicon, Metal acceleration is enabled by default.
Check GPU status:
ollama psIf your model is running on CPU and you have a GPU available, ensure your drivers are up to date and restart Ollama.
Troubleshooting
Section titled “Troubleshooting”“Could not connect to Ollama” — Make sure the Ollama service is running:
# Start the serviceollama serve
# Or on macOS, launch the app — it starts the server automaticallyModel not found — Pull the model first: ollama pull <model-name>. Model names are case-sensitive.
Slow responses — Check if the model fits in RAM. If the system is swapping, use a smaller model or add more memory.