LM Studio

LM Studio provides a desktop application for downloading, managing, and serving open-weight models. It exposes an OpenAI-compatible local API that ModelReins connects to directly.

Why LM Studio

Free forever — no API fees, no subscriptions, no usage limits.
Full privacy — all inference runs locally. Nothing leaves your machine.
GUI model management — browse, download, and configure models through a visual interface.
High-volume friendly — no rate limits. Run as many jobs as your hardware can handle.
OpenAI-compatible API — works with any tool expecting the OpenAI API format.

Install LM Studio

Download from lmstudio.ai for macOS, Windows, or Linux.
Run the installer and launch LM Studio.

System requirements

	Minimum	Recommended
RAM	8 GB	16 GB+
Disk	10 GB free	50 GB+ (models are large)
GPU	None (CPU works)	6 GB+ VRAM for acceleration

Load a model

Open LM Studio and go to the Discover tab.
Search for a model — start with TheBloke/Llama-3.2 or microsoft/phi-3.
Click Download and wait for it to complete.
Go to the Chat tab, select the model from the dropdown, and verify it responds.

Enable the local server

Click the Local Server tab (left sidebar, <-> icon).
Select your loaded model from the dropdown.
Click Start Server.
The server runs on http://localhost:1234 by default.

Verify the server is running:

curl http://localhost:1234/v1/models

You should see your loaded model in the response.

Configure ModelReins

export MODELREINS_PROVIDER=lmstudio
export MODELREINS_LMSTUDIO_HOST=http://localhost:1234

Or in modelreins.config.json:

{
  "provider": "lmstudio",
  "lmstudio": {
    "host": "http://localhost:1234",
    "model": "llama-3.2"
  }
}

Start the worker:

modelreins worker start --provider lmstudio

Model recommendations

Model	Download Size	Quality	Best For
Llama 3.2 7B	4 GB	Good	General tasks, balanced speed/quality
Phi-3 Mini	2.3 GB	Moderate	Fast completions, lower resource machines
Mistral 7B	4 GB	Good	Instruction following, multilingual
CodeLlama 7B	4 GB	Good (code)	Code generation, refactoring
Llama 3.1 70B	40 GB	Excellent	Complex tasks (needs powerful GPU)

Running multiple models

LM Studio can load one model at a time per server instance. To serve multiple models, run multiple instances on different ports:

{
  "provider": "lmstudio",
  "lmstudio": {
    "host": "http://localhost:1234",
    "model": "llama-3.2"
  }
}

For multi-model routing, consider using Ollama instead — it handles model switching automatically.

Platform support

Platform	Status	Notes
macOS (Apple Silicon)	Supported	Metal acceleration, best experience
macOS (Intel)	Supported	CPU only, slower
Windows	Supported	NVIDIA CUDA acceleration available
Linux	Supported	NVIDIA CUDA, some AMD ROCm support

Troubleshooting

“Connection refused” — The LM Studio server is not running. Open LM Studio, go to Local Server, and click Start Server.

“Model not found” — Make sure a model is selected and loaded in the Local Server tab. The model dropdown must show an active model.

Slow responses — Use a smaller quantization (Q4 instead of Q8) or a smaller model. Check that GPU offloading is enabled in LM Studio settings if you have a compatible GPU.