Part 4: Spring AI Building a RAG Pipeline

Part 4: Spring AI Building a RAG Pipeline

A Spring AI RAG pipeline lets an LLM answer using your documents instead of just its training data. RAG — retrieval-augmented generation — is three moves: take the user’s question, find the most relevant chunks from your data (using embeddings and a vector store), and send those chunks plus the question to the model. The model answers grounded in what you gave it. Spring AI’s ChatClient and document/vector abstractions make this surprisingly little code. This is Part 4 of the Spring AI series and it ties the previous parts together. ...

Part 3: Spring AI Embeddings and Vector Stores

Part 3: Spring AI Embeddings and Vector Stores

RAG rests on two primitives: embeddings (turning text into vectors) and a vector store (saving those vectors and finding the nearest ones to a query). Spring AI gives you one interface for each — EmbeddingModel and VectorStore — and you choose the implementation with a dependency and config, exactly like the chat client in Part 2. This is Part 3 of the Spring AI series, and it’s the groundwork for the RAG pipeline in Part 4. ...

To RAG or to Fine-Tune? Picking the Right Tool for the AI Job

To RAG or to Fine-Tune? Picking the Right Tool for the AI Job

When you need an LLM to use your knowledge or behave a specific way, two approaches dominate the conversation: RAG (retrieval-augmented generation) and fine-tuning. They sound interchangeable and they’re not — they solve different problems and have very different cost, complexity, and maintenance profiles. Getting RAG vs fine-tuning right early saves you a lot of wasted GPU budget. Here’s the honest comparison. The one-line difference RAG changes what the model knows right now by injecting relevant documents into the prompt at query time. The model’s weights never change. Fine-tuning changes how the model behaves by updating its weights on your examples. Knowledge problem → reach for RAG. Behavior/format/style problem → consider fine-tuning. Most “the AI doesn’t know our stuff” issues are knowledge problems. ...

Part 2: Spring AI Chat Completions with OpenAI or Ollama

Part 2: Spring AI Chat Completions with OpenAI or Ollama

With Spring AI on the classpath, calling a chat model comes down to three things: add the right starter, set an API key or base URL in config, and inject ChatClient. The same Java code then works whether you’re hitting OpenAI in the cloud or a local Ollama model on your laptop — you swap the dependency and the config, not the logic. This is Part 2 of the Spring AI series. ...

Part 1: Introduction to Spring AI

Part 1: Introduction to Spring AI

Spring AI brings AI capabilities into the Spring ecosystem as a first-class citizen. Instead of hand-rolling HTTP clients for OpenAI, Anthropic, or Ollama and wiring JSON parsing, retries, and secrets yourself, you get a consistent abstraction over chat models, embeddings, and vector stores — with the usual Spring benefits: dependency injection, configuration properties, auto-configuration, and optional observability. If you already think in @Service and application.yml, Spring AI will feel immediately familiar. This is Part 1 of a four-part series that ends with a working RAG app. ...

Stop Yelling at the AI: Prompt Engineering That Actually Works

Stop Yelling at the AI: Prompt Engineering That Actually Works

Prompt engineering is just the craft of phrasing your request so the model gives you what you actually want — the right format, tone, and level of detail. It’s less “magic words” and more “clear communication with a literal-minded intern.” These prompt engineering tips work across most modern LLMs (GPT, Claude, Llama, and friends), and none of them require yelling, emojis, or threatening the model into compliance. Be explicit about task and format Vague in, vague out. “Tell me about APIs” can mean anything; the model guesses, and you get a rambling essay. Spell out task, audience, length, and format: ...

WTF is an LLM? A Human-Friendly Guide to AI Brains

WTF is an LLM? A Human-Friendly Guide to AI Brains

A large language model (LLM) is a neural network trained on enormous amounts of text to do one deceptively simple thing: predict the next token (roughly, the next word-piece) in a sequence. Do that well enough, at billions of parameters, and something surprising falls out — the model can answer questions, summarize documents, translate, and write code. Models like GPT, Claude, and Llama are all LLMs. This is the no-hype, human-friendly explanation of what they are and why they matter to anyone building software. ...