Part 4: Spring AI Building a RAG Pipeline
A Spring AI RAG pipeline lets an LLM answer using your documents instead of just its training data. RAG — retrieval-augmented generation — is three moves: take the user’s question, find the most relevant chunks from your data (using embeddings and a vector store), and send those chunks plus the question to the model. The model answers grounded in what you gave it. Spring AI’s ChatClient and document/vector abstractions make this surprisingly little code. This is Part 4 of the Spring AI series and it ties the previous parts together. ...
Part 3: Spring AI Embeddings and Vector Stores
RAG rests on two primitives: embeddings (turning text into vectors) and a vector store (saving those vectors and finding the nearest ones to a query). Spring AI gives you one interface for each — EmbeddingModel and VectorStore — and you choose the implementation with a dependency and config, exactly like the chat client in Part 2. This is Part 3 of the Spring AI series, and it’s the groundwork for the RAG pipeline in Part 4. ...
To RAG or to Fine-Tune? Picking the Right Tool for the AI Job
When you need an LLM to use your knowledge or behave a specific way, two approaches dominate the conversation: RAG (retrieval-augmented generation) and fine-tuning. They sound interchangeable and they’re not — they solve different problems and have very different cost, complexity, and maintenance profiles. Getting RAG vs fine-tuning right early saves you a lot of wasted GPU budget. Here’s the honest comparison. The one-line difference RAG changes what the model knows right now by injecting relevant documents into the prompt at query time. The model’s weights never change. Fine-tuning changes how the model behaves by updating its weights on your examples. Knowledge problem → reach for RAG. Behavior/format/style problem → consider fine-tuning. Most “the AI doesn’t know our stuff” issues are knowledge problems. ...
Part 2: Spring AI Chat Completions with OpenAI or Ollama
With Spring AI on the classpath, calling a chat model comes down to three things: add the right starter, set an API key or base URL in config, and inject ChatClient. The same Java code then works whether you’re hitting OpenAI in the cloud or a local Ollama model on your laptop — you swap the dependency and the config, not the logic. This is Part 2 of the Spring AI series. ...
Part 1: Introduction to Spring AI
Spring AI brings AI capabilities into the Spring ecosystem as a first-class citizen. Instead of hand-rolling HTTP clients for OpenAI, Anthropic, or Ollama and wiring JSON parsing, retries, and secrets yourself, you get a consistent abstraction over chat models, embeddings, and vector stores — with the usual Spring benefits: dependency injection, configuration properties, auto-configuration, and optional observability. If you already think in @Service and application.yml, Spring AI will feel immediately familiar. This is Part 1 of a four-part series that ends with a working RAG app. ...
Taming the AI Beast on Your Own Laptop with Ollama
Running LLMs locally with Ollama means an open-weight model lives on your machine — no API keys, no per-token billing, no sending your data to someone else’s servers. You install Ollama, pull a model with one command, and chat from the terminal or hit a local API. For prototyping, privacy-sensitive work, or just learning how these models behave without a credit card attached, it’s the simplest on-ramp there is. Install and run a model Installation is straightforward: grab the Ollama binary for macOS, Linux, or Windows from the project site and run it. Then, from the command line: ...
Stop Yelling at the AI: Prompt Engineering That Actually Works
Prompt engineering is just the craft of phrasing your request so the model gives you what you actually want — the right format, tone, and level of detail. It’s less “magic words” and more “clear communication with a literal-minded intern.” These prompt engineering tips work across most modern LLMs (GPT, Claude, Llama, and friends), and none of them require yelling, emojis, or threatening the model into compliance. Be explicit about task and format Vague in, vague out. “Tell me about APIs” can mean anything; the model guesses, and you get a rambling essay. Spell out task, audience, length, and format: ...
WTF is an LLM? A Human-Friendly Guide to AI Brains
A large language model (LLM) is a neural network trained on enormous amounts of text to do one deceptively simple thing: predict the next token (roughly, the next word-piece) in a sequence. Do that well enough, at billions of parameters, and something surprising falls out — the model can answer questions, summarize documents, translate, and write code. Models like GPT, Claude, and Llama are all LLMs. This is the no-hype, human-friendly explanation of what they are and why they matter to anyone building software. ...
Dockerizing Spring Boot (Because "It Works on My Machine" Isn’t Enough)
“It works on my machine” stops being funny the moment it has to run on someone else’s. Dockerizing a Spring Boot app gives you one artifact that runs identically on your laptop, in CI, and on Kubernetes or AWS — same JDK, same dependencies, same behavior. The trick is doing it so the image is small, secure, and cached well, not a 600 MB blob that rebuilds from scratch on every code change. Here’s the approach I actually use. ...
No Ticket, No Entry: Securing Spring Boot with JWTs
Spring Boot JWT authentication is the default way most teams secure a REST API today. Instead of keeping a session in server memory, the client carries a signed JSON Web Token on every request, and the server verifies it. No session table, no sticky load balancing, no “which node has my session?” — which is exactly why it fits SPAs, mobile apps, and microservices. Let’s wire it up properly, and talk about the parts that bite you later. ...
Going Async: Painless REST APIs with Spring WebFlux
Spring WebFlux is the reactive, non-blocking sibling of Spring MVC. Instead of one thread per request blocking on I/O, WebFlux runs on a small event-loop and frees the thread while it waits for the database or another service to answer. Built on Project Reactor, it’s a great fit for high-concurrency APIs and reactive data sources — but it’s not a free upgrade, and that nuance is the most important thing to get right. ...
Surviving & Thriving on Day 1 with Spring Boot 3
Getting started with Spring Boot 3 is mostly the familiar Spring experience with a few hard requirements you can’t skip. The big two: Java 17 is the new minimum, and the whole framework moved from the javax.* namespace to jakarta.*. If you’re starting fresh, you barely notice. If you’re upgrading from Boot 2, those two facts explain 90% of the migration pain. Let’s get you running, then cover what changed. ...
Spring Boot Exception Handling That Doesn't Make You Cry
Good Spring Boot exception handling means your API fails predictably. Out of the box, an unhandled exception gives the client either a wall of HTML (the Whitelabel Error Page) or a JSON blob that leaks your package names and stack trace. Both are bad: ugly for your frontend, and a small gift to attackers. The fix is to handle errors in one place with @RestControllerAdvice and return a consistent error body. Here’s the version I actually ship. ...
Your First Spring Boot REST Controller in 5 Minutes
You want to ship an API. You don’t want to hand-configure Tomcat, write a web.xml, or fight a servlet container. A Spring Boot REST controller gets you from zero to a working HTTP endpoint in about five minutes — and this guide is the copy-paste-run version, plus just enough explanation that you actually understand what you pasted. What you need first One dependency: spring-boot-starter-web. Add it to your pom.xml (or build.gradle) and Spring Boot pulls in Spring MVC, Jackson (for JSON), and an embedded Tomcat. That last part is the magic — there’s no external server to install or deploy to. Your app is the server. Generate a skeleton at start.spring.io if you don’t have one (more on that in Surviving Day 1 with Spring Boot 3). ...