Taming the AI Beast on Your Own Laptop with Ollama

Running LLMs locally with Ollama means an open-weight model lives on your machine — no API keys, no per-token billing, no sending your data to someone else’s servers. You install Ollama, pull a model with one command, and chat from the terminal or hit a local API. For prototyping, privacy-sensitive work, or just learning how these models behave without a credit card attached, it’s the simplest on-ramp there is.

Install and run a model

Installation is straightforward: grab the Ollama binary for macOS, Linux, or Windows from the project site and run it. Then, from the command line:

ollama run llama3.2

That pulls the model (one-time download) and drops you into an interactive chat. Other useful commands:

ollama pull mistral      # download a model without chatting
ollama list              # show installed models
ollama rm llama3.2       # free up disk space

Models are stored on disk and loaded into RAM when you use them — which is why memory is the thing that matters most.

Hardware: what you actually need

Models come in sizes measured in parameters (and billions thereof), and size drives RAM/VRAM needs:

Small models (a few billion params, e.g. Phi, small Llama/Mistral variants) run on modest laptops — think 8–16 GB RAM.
Larger models want 16 GB or more, and a GPU makes them dramatically faster.
Quantized versions trade a little quality for much smaller memory use — usually the right call on a laptop. [needs source]

If a model feels painfully slow, drop to a smaller or more aggressively quantized variant before blaming Ollama.

The OpenAI-compatible local API

Here’s the part that makes Ollama genuinely useful for developers: it exposes a local HTTP API at http://localhost:11434, including an OpenAI-compatible endpoint. That means an app written against the OpenAI API can be pointed at your local Ollama by changing the base URL and the model name — no other code changes.

This is why Spring AI works the same against Ollama or OpenAI: in Part 2 of the Spring AI series you literally just set base-url: http://localhost:11434 and a local model name. Develop offline against Ollama, deploy against a hosted model — identical code. The project supports many open-weight models, with new ones added regularly.

When local beats the cloud (and when it doesn’t)

Run locally when:

Privacy matters — data never leaves your machine. Great for sensitive documents or regulated work.
You’re prototyping — no metered billing while you experiment.
You want to learn — poke at temperature, prompts, and model differences freely.
You need offline — no internet dependency after the download.

Stick with the cloud when:

You need the frontier-quality answers only the biggest hosted models give.
You need to serve many concurrent users — your laptop is not a fleet.
You don’t want to manage hardware, memory, and model updates.

A common, pragmatic pattern: Ollama for local dev and CI, a hosted model in production — and because the API is compatible, switching is a config change.

Common gotchas

Out-of-memory / crawling speed. The model is too big for your RAM/VRAM — use a smaller or quantized variant.
“Model not found.” You have to ollama pull (or run) a model before an app can use it.
App can’t reach Ollama. The daemon must be running and listening on localhost:11434 before your app starts.
Expecting GPT-4-class output from a 3B model. Small local models are capable but not magic — calibrate expectations to model size.

FAQ

Is Ollama free?

Yes — Ollama is free and runs open-weight models locally. There are no API or token charges; your only cost is the hardware it runs on.

Can I use Ollama as a drop-in for the OpenAI API?

Largely, yes. It exposes an OpenAI-compatible endpoint at localhost:11434, so many OpenAI clients work by changing the base URL and model name. Verify the specific features you use are supported.

What hardware do I need?

Small/quantized models run on 8–16 GB RAM; larger models want 16 GB+ and benefit a lot from a GPU. Pick a model size that fits your machine.

Can I use Ollama with Spring Boot?

Yes — Spring AI has an Ollama starter. Point it at http://localhost:11434, set a pulled model name, and the same ChatClient code runs locally or against the cloud. See the Spring AI chat post.

Key takeaway: Ollama lets you run LLMs locally with one command — free, private, and offline. Match model size to your RAM, use its OpenAI-compatible API to develop against local models and deploy to the cloud unchanged, and reach for hosted models when you need frontier quality or real concurrency.

Install and run a model#

Hardware: what you actually need#

The OpenAI-compatible local API#

When local beats the cloud (and when it doesn’t)#

Common gotchas#

FAQ#

Install and run a model

Hardware: what you actually need

The OpenAI-compatible local API

When local beats the cloud (and when it doesn’t)

Common gotchas

FAQ