With Spring AI on the classpath, calling a chat model comes down to three things: add the right starter, set an API key or base URL in config, and inject ChatClient. The same Java code then works whether you’re hitting OpenAI in the cloud or a local Ollama model on your laptop — you swap the dependency and the config, not the logic. This is Part 2 of the Spring AI series.

1. Dependencies

For OpenAI:

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-open-ai-spring-boot-starter</artifactId>
</dependency>

For Ollama (local, no API key):

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-ollama-spring-boot-starter</artifactId>
</dependency>

Version heads-up [needs source]: starter artifact IDs have changed across Spring AI versions (e.g. the -spring-boot-starter suffix). Check the exact coordinates for the version you pinned in Part 1.

You can even include both starters and choose at runtime — handy for “Ollama in dev, OpenAI in prod.”

2. Configuration

OpenAI in application.yml — note the key comes from an environment variable, never hardcoded:

spring:
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      chat:
        options:
          model: gpt-4o-mini

Ollama (default base URL is http://localhost:11434):

spring:
  ai:
    ollama:
      base-url: http://localhost:11434
      chat:
        options:
          model: llama3.2

The options block is where you set per-request defaults like model and temperature. Anything you set here applies globally; you can override per call in code.

3. Use the client

Inject ChatClient.Builder, build once, and call:

@Service
public class ChatService {

    private final ChatClient chatClient;

    public ChatService(ChatClient.Builder builder) {
        this.chatClient = builder.build();
    }

    public String ask(String userMessage) {
        return chatClient.prompt()
            .user(userMessage)
            .call()
            .content();
    }
}

ChatClient is provider-agnostic: this exact code runs against OpenAI or Ollama. Switch by changing the starter and config — the service doesn’t know or care which model answered.

System prompts and parameters

Real prompts usually set a system message (the model’s “role”) and tune parameters. The fluent API handles both:

public String ask(String userMessage) {
    return chatClient.prompt()
        .system("You are a terse senior engineer. Answer in at most three sentences.")
        .user(userMessage)
        .call()
        .content();
}

A clear system prompt is the cheapest quality lever you have — it shapes tone, format, and guardrails before the user ever types anything. For tips on writing them, see prompt engineering that actually works.

Streaming responses

For chat UIs you don’t want to wait for the whole answer. Use stream() instead of call() and consume a reactive Flux of tokens:

public Flux<String> askStreaming(String userMessage) {
    return chatClient.prompt()
        .user(userMessage)
        .stream()
        .content();
}

Return that Flux from a controller (or push it over Server-Sent Events) and the response renders token-by-token, exactly like the chat apps you’ve used.

Structured output

Need JSON or a typed object back instead of a string? Spring AI can map the model’s response straight onto a Java record via its structured-output support (.entity(MyRecord.class)), so you skip manual parsing. The exact method name varies by version [needs source], but the capability is there — lean on it instead of regexing model output.

Common gotchas

  • Hardcoded API keys. Use ${OPENAI_API_KEY} and an env var. A key in source is a key in your git history forever.
  • Wrong/unavailable model name. gpt-4o-mini or llama3.2 must exist for your provider/account; for Ollama you must ollama pull the model first.
  • Ollama not running. The Ollama starter expects the daemon at localhost:11434 — start it before your app.
  • Blocking on stream(). The streaming API returns a Flux; consume it reactively, don’t .block() it back into a string and lose the point.

FAQ

Can I switch from OpenAI to Ollama without changing code?
Yes — that’s the main selling point. Swap the starter dependency and the application.yml config; your ChatClient code stays identical.
How do I set temperature or max tokens?
In application.yml under the provider’s chat.options, or per call via the prompt builder’s options. Config sets the default; code overrides it for a specific request.
How do I stream the response token by token?
Use .stream().content() instead of .call().content(). It returns a Flux<String> you can return from a controller or push over SSE for a live-typing UI.
Can the model return a typed Java object?
Yes — Spring AI’s structured-output support maps a response onto a record/class so you don’t parse JSON by hand. Check your version’s exact API.

Key takeaway: Spring AI chat completions = a starter + application.yml config + an injected ChatClient. The same code targets OpenAI or Ollama; add a system prompt for quality, use stream() for live UIs, and keep keys in env vars. Next: embeddings and vector stores, the foundation for RAG.