Running LLMs locally with Ollama means an open-weight model lives on your machine — no API keys, no per-token billing, no sending your data to someone else’s servers. You install Ollama, pull a model with one command, and chat from the terminal or hit a local API. For prototyping, privacy-sensitive work, or just learning how these models behave without a credit card attached, it’s the simplest on-ramp there is.

Install and run a model

Installation is straightforward: grab the Ollama binary for macOS, Linux, or Windows from the project site and run it. Then, from the command line:

ollama run llama3.2

That pulls the model (one-time download) and drops you into an interactive chat. Other useful commands:

ollama pull mistral      # download a model without chatting
ollama list              # show installed models
ollama rm llama3.2       # free up disk space

Models are stored on disk and loaded into RAM when you use them — which is why memory is the thing that matters most.

Hardware: what you actually need

Models come in sizes measured in parameters (and billions thereof), and size drives RAM/VRAM needs:

  • Small models (a few billion params, e.g. Phi, small Llama/Mistral variants) run on modest laptops — think 8–16 GB RAM.
  • Larger models want 16 GB or more, and a GPU makes them dramatically faster.
  • Quantized versions trade a little quality for much smaller memory use — usually the right call on a laptop. [needs source]

If a model feels painfully slow, drop to a smaller or more aggressively quantized variant before blaming Ollama.

The OpenAI-compatible local API

Here’s the part that makes Ollama genuinely useful for developers: it exposes a local HTTP API at http://localhost:11434, including an OpenAI-compatible endpoint. That means an app written against the OpenAI API can be pointed at your local Ollama by changing the base URL and the model name — no other code changes.

This is why Spring AI works the same against Ollama or OpenAI: in Part 2 of the Spring AI series you literally just set base-url: http://localhost:11434 and a local model name. Develop offline against Ollama, deploy against a hosted model — identical code. The project supports many open-weight models, with new ones added regularly.

When local beats the cloud (and when it doesn’t)

Run locally when:

  • Privacy matters — data never leaves your machine. Great for sensitive documents or regulated work.
  • You’re prototyping — no metered billing while you experiment.
  • You want to learn — poke at temperature, prompts, and model differences freely.
  • You need offline — no internet dependency after the download.

Stick with the cloud when:

  • You need the frontier-quality answers only the biggest hosted models give.
  • You need to serve many concurrent users — your laptop is not a fleet.
  • You don’t want to manage hardware, memory, and model updates.

A common, pragmatic pattern: Ollama for local dev and CI, a hosted model in production — and because the API is compatible, switching is a config change.

Common gotchas

  • Out-of-memory / crawling speed. The model is too big for your RAM/VRAM — use a smaller or quantized variant.
  • “Model not found.” You have to ollama pull (or run) a model before an app can use it.
  • App can’t reach Ollama. The daemon must be running and listening on localhost:11434 before your app starts.
  • Expecting GPT-4-class output from a 3B model. Small local models are capable but not magic — calibrate expectations to model size.

FAQ

Is Ollama free?
Yes — Ollama is free and runs open-weight models locally. There are no API or token charges; your only cost is the hardware it runs on.
Can I use Ollama as a drop-in for the OpenAI API?
Largely, yes. It exposes an OpenAI-compatible endpoint at localhost:11434, so many OpenAI clients work by changing the base URL and model name. Verify the specific features you use are supported.
What hardware do I need?
Small/quantized models run on 8–16 GB RAM; larger models want 16 GB+ and benefit a lot from a GPU. Pick a model size that fits your machine.
Can I use Ollama with Spring Boot?
Yes — Spring AI has an Ollama starter. Point it at http://localhost:11434, set a pulled model name, and the same ChatClient code runs locally or against the cloud. See the Spring AI chat post.

Key takeaway: Ollama lets you run LLMs locally with one command — free, private, and offline. Match model size to your RAM, use its OpenAI-compatible API to develop against local models and deploy to the cloud unchanged, and reach for hosted models when you need frontier quality or real concurrency.