Running LLMs locally with Ollama means an open-weight model lives on your machine — no API keys, no per-token billing, no sending your data to someone else’s servers. You install Ollama, pull a model with one command, and chat from the terminal or hit a local API. For prototyping, privacy-sensitive work, or just learning how these models behave without a credit card attached, it’s the simplest on-ramp there is.
Install and run a model
Installation is straightforward: grab the Ollama binary for macOS, Linux, or Windows from the project site and run it. Then, from the command line:
ollama run llama3.2
That pulls the model (one-time download) and drops you into an interactive chat. Other useful commands:
ollama pull mistral # download a model without chatting
ollama list # show installed models
ollama rm llama3.2 # free up disk space
Models are stored on disk and loaded into RAM when you use them — which is why memory is the thing that matters most.
Hardware: what you actually need
Models come in sizes measured in parameters (and billions thereof), and size drives RAM/VRAM needs:
- Small models (a few billion params, e.g. Phi, small Llama/Mistral variants) run on modest laptops — think 8–16 GB RAM.
- Larger models want 16 GB or more, and a GPU makes them dramatically faster.
- Quantized versions trade a little quality for much smaller memory use — usually the right call on a laptop.
[needs source]
If a model feels painfully slow, drop to a smaller or more aggressively quantized variant before blaming Ollama.
The OpenAI-compatible local API
Here’s the part that makes Ollama genuinely useful for developers: it exposes a local HTTP API at http://localhost:11434, including an OpenAI-compatible endpoint. That means an app written against the OpenAI API can be pointed at your local Ollama by changing the base URL and the model name — no other code changes.
This is why Spring AI works the same against Ollama or OpenAI: in Part 2 of the Spring AI series you literally just set base-url: http://localhost:11434 and a local model name. Develop offline against Ollama, deploy against a hosted model — identical code. The project supports many open-weight models, with new ones added regularly.
When local beats the cloud (and when it doesn’t)
Run locally when:
- Privacy matters — data never leaves your machine. Great for sensitive documents or regulated work.
- You’re prototyping — no metered billing while you experiment.
- You want to learn — poke at temperature, prompts, and model differences freely.
- You need offline — no internet dependency after the download.
Stick with the cloud when:
- You need the frontier-quality answers only the biggest hosted models give.
- You need to serve many concurrent users — your laptop is not a fleet.
- You don’t want to manage hardware, memory, and model updates.
A common, pragmatic pattern: Ollama for local dev and CI, a hosted model in production — and because the API is compatible, switching is a config change.
Common gotchas
- Out-of-memory / crawling speed. The model is too big for your RAM/VRAM — use a smaller or quantized variant.
- “Model not found.” You have to
ollama pull(orrun) a model before an app can use it. - App can’t reach Ollama. The daemon must be running and listening on
localhost:11434before your app starts. - Expecting GPT-4-class output from a 3B model. Small local models are capable but not magic — calibrate expectations to model size.
FAQ
Is Ollama free?
Can I use Ollama as a drop-in for the OpenAI API?
localhost:11434, so many OpenAI clients work by changing the base URL and model name. Verify the specific features you use are supported.
What hardware do I need?
Can I use Ollama with Spring Boot?
http://localhost:11434, set a pulled model name, and the same ChatClient code runs locally or against the cloud. See the Spring AI chat post.
Key takeaway: Ollama lets you run LLMs locally with one command — free, private, and offline. Match model size to your RAM, use its OpenAI-compatible API to develop against local models and deploy to the cloud unchanged, and reach for hosted models when you need frontier quality or real concurrency.