Building Intelligent AI Applications with Spring AI

A year ago, if you wanted to integrate AI into a Java application, you were stitching together raw REST calls to OpenAI, manually parsing JSON responses, and writing boilerplate that had nothing to do with your actual business logic. Spring AI changes that completely. It brings the same opinionated, convention-over-configuration approach that made Spring Boot successful — but applied to AI-powered applications. As a result, the cognitive distance between “I have an idea for an AI feature” and “it is running in production” shrinks dramatically.

Teams have been shipping with this framework in production for several months now, and this guide walks through what it actually looks like to build intelligent features with it. Throughout, the emphasis stays on patterns that survive contact with real traffic, real costs, and real failure modes — not toy demos.

Getting Started with Spring AI

Add the dependency for your preferred model provider. The framework supports OpenAI, Anthropic, Ollama, Azure OpenAI, Amazon Bedrock, and others through a unified API. Consequently, the choice of provider becomes a configuration concern rather than an architectural one.

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
</dependency>

Configure your API key:

# application.yml
spring:
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      chat:
        options:
          model: gpt-4o
          temperature: 0.7

Switching providers is straightforward. Replace the starter dependency and update the configuration — your application code stays the same thanks to the ChatClient abstraction. In practice, teams often run Ollama locally during development and a hosted provider in production, swapping only the YAML.

Using ChatClient

The ChatClient is the core interface. Think of it as the RestTemplate of AI — a clean, fluent API for interacting with language models.

@Service
public class AssistantService {

    private final ChatClient chatClient;

    public AssistantService(ChatClient.Builder builder) {
        this.chatClient = builder
            .defaultSystem("You are a helpful technical assistant for a software company. " +
                "Answer questions concisely and accurately.")
            .build();
    }

    public String askQuestion(String userQuestion) {
        return chatClient.prompt()
            .user(userQuestion)
            .call()
            .content();
    }

    public TechRecommendation getRecommendation(String requirements) {
        return chatClient.prompt()
            .user("Analyze these requirements and recommend a tech stack: " + requirements)
            .call()
            .entity(TechRecommendation.class);  // Automatic deserialization
    }
}

The .entity() method is particularly powerful — the framework handles the prompt engineering needed to coax structured JSON out of the model and deserializes it directly into your Java object. No manual parsing required. Moreover, when the model returns malformed output, you get a clear deserialization exception rather than a silent null.

Prompt Templates

Hardcoding prompts in Java strings is messy. The framework supports prompt templates with variable substitution, so prompts live as version-controlled resources:

@Service
public class CodeReviewService {

    private final ChatClient chatClient;

    @Value("classpath:prompts/code-review.st")
    private Resource codeReviewPrompt;

    public CodeReviewResult reviewCode(String code, String language) {
        return chatClient.prompt()
            .user(u -> u
                .text(codeReviewPrompt)
                .param("code", code)
                .param("language", language))
            .call()
            .entity(CodeReviewResult.class);
    }
}

Keeping prompts in resource files makes them testable and easy to iterate on without recompiling. Furthermore, because prompts become first-class artifacts, you can diff them in code review and roll them back like any other change.

RAG: Retrieval-Augmented Generation

This is where the framework gets genuinely interesting for enterprise applications. RAG lets you ground AI responses in your own data — product documentation, internal wikis, customer records — instead of relying solely on the model’s training data. The pattern has three stages: ingest documents into a vector store, retrieve relevant chunks based on the user’s query, then generate a response using those chunks as context.

@Service
public class DocumentIngestionService {

    private final VectorStore vectorStore;

    public void ingestDocuments(List<Resource> documents) {
        TokenTextSplitter splitter = new TokenTextSplitter(800, 350, 5, 10000, true);

        for (Resource doc : documents) {
            List<Document> chunks = new TikaDocumentReader(doc).get();
            List<Document> splitDocs = splitter.apply(chunks);
            vectorStore.add(splitDocs);
        }
    }
}

The two numbers that matter most here are chunk size and overlap. Chunks that are too large dilute relevance; chunks that are too small lose context. A common starting point is 800-token chunks with roughly 350 tokens of overlap, then tuning based on retrieval quality. For deeper treatment of chunking trade-offs, see the companion piece on advanced RAG chunking strategies.

Vector Store with PgVector

If you are already running PostgreSQL, PgVector is the easiest path to a production vector store. No new infrastructure to manage, and your embeddings live alongside your transactional data with the same backup and replication story.

spring:
  ai:
    vectorstore:
      pgvector:
        index-type: HNSW
        distance-type: COSINE_DISTANCE
        dimensions: 1536
  datasource:
    url: jdbc:postgresql://localhost:5432/myapp

The HNSW index type trades a slower build for fast, high-recall queries, which is usually the right call for read-heavy retrieval. The dimensions value must match your embedding model exactly — 1536 for OpenAI’s text-embedding-3-small, but 768 or 1024 for many open-source models. A dimension mismatch is one of the most common first-day errors, so pin it explicitly.

Building an AI-Powered Q&A Service

Here is a complete Q&A service that combines RAG with the chat abstraction. This is the canonical pattern for internal knowledge-base queries.

@Service
public class QnAService {

    private final ChatClient chatClient;

    public QnAService(ChatClient.Builder builder, VectorStore vectorStore) {
        this.chatClient = builder
            .defaultSystem("Answer questions based on the provided context. " +
                "If the context doesn't contain enough information, say so clearly.")
            .defaultAdvisors(new QuestionAnswerAdvisor(vectorStore,
                SearchRequest.defaults().withTopK(5).withSimilarityThreshold(0.7)))
            .build();
    }

    public Answer askAboutDocs(String question) {
        String response = chatClient.prompt()
            .user(question)
            .call()
            .content();
        return new Answer(question, response);
    }
}

The QuestionAnswerAdvisor automatically retrieves relevant documents and injects them into the prompt context. The similarityThreshold of 0.7 is a guardrail: chunks below that score are discarded, which keeps weakly related text from polluting the answer. If your assistant starts hallucinating, raising the threshold and lowering topK is usually the first lever to pull.

Function Calling / Tool Use

Function calling lets the model invoke your Java methods when it needs real-time data or to perform an action. This is the line between a text generator and an actual agent.

@Service
public class WeatherAiService {

    private final ChatClient chatClient;

    public WeatherAiService(ChatClient.Builder builder) {
        this.chatClient = builder
            .defaultFunctions("currentWeather", "weatherForecast")
            .build();
    }

    @Bean
    @Description("Get current weather for a given city")
    public Function<WeatherRequest, WeatherResponse> currentWeather() {
        return request -> weatherApiClient.getCurrentWeather(request.city());
    }

    public String chat(String userMessage) {
        return chatClient.prompt().user(userMessage).call().content();
    }
}

The model decides when to call your functions based on the user’s query. You declare the tools, describe what they do clearly in the @Description, and the framework handles the orchestration. The description quality matters enormously — vague descriptions lead the model to call the wrong tool or skip it entirely.

Error Handling and Rate Limiting

AI APIs fail. They rate-limit you, they time out, and occasionally they return content-policy refusals. Production code must distinguish retryable from non-retryable failures. The framework helps by splitting exceptions into TransientAiException (retry with backoff) and NonTransientAiException (fail fast, log, surface to the caller). Wrapping the client in a Resilience4j rate limiter and an exponential-backoff retry loop covers the vast majority of real incidents.

Production Considerations and Trade-offs

Running AI features in production is not just about making API calls. The honest picture includes several recurring concerns.

Cost control is critical. A single frontier-model call can cost a few cents, and benchmarks from teams running these workloads show that naive request volume multiplied across thousands of users becomes a real budget line. Cache aggressively — identical or near-identical queries should hit a cache, not the API.

@Cacheable(value = "ai-responses", key = "#prompt.hashCode()")
public String getCachedResponse(String prompt) {
    return chatClient.prompt().user(prompt).call().content();
}

Latency varies wildly, from a few hundred milliseconds to tens of seconds depending on model, prompt length, and provider load. Stream responses to users when possible, run calls on virtual threads, and set aggressive timeouts so a slow upstream never ties up a request thread indefinitely.

When NOT to reach for this framework. Honesty matters here. If your AI feature is a single, stable call to one provider and you never plan to swap models, add RAG, or use tool calling, the abstraction layer earns little — a thin HTTP client may be simpler. Likewise, teams with a heavy existing Python ML platform and mature LangChain pipelines should not rewrite working infrastructure just to keep everything in the JVM. The framework shines when AI is a recurring, evolving concern across a Java codebase, not a one-off curiosity.

Concern	Strategy
Cost	Cache responses, route simple tasks to cheaper models, set budget alerts
Latency	Stream responses, use async on virtual threads, set timeouts
Quality	Version prompts, A/B test, log and review outputs
Security	Sanitize inputs, validate outputs, never trust model responses blindly
Reliability	Retry with backoff, circuit breakers, fallback responses

Final Thoughts

The framework makes the Java ecosystem a first-class citizen in the AI application space. You do not need to switch to Python to build intelligent features. The same dependency injection, the same testing patterns, and the same deployment pipelines you already know all apply directly. What excites most teams is not the technology itself but what it enables: RAG-powered knowledge bases, agents with tool use, and intelligent automation without leaving the JVM.

Key Takeaways

Start with a solid foundation and build incrementally based on your requirements
Test thoroughly in staging before deploying to production environments
Monitor performance metrics and iterate based on real-world data
Follow security best practices and keep dependencies up to date
Document architectural decisions for future team members

For further reading, refer to the Spring Boot documentation and the Oracle Java documentation for comprehensive reference material. You may also find the discussion of vector databases with pgvector useful when sizing your retrieval layer.

In conclusion, Spring AI is an essential topic for modern software development. By applying the patterns and practices covered in this guide, you can build more robust, scalable, and maintainable systems. Start with the fundamentals, iterate on your implementation, and continuously measure results to ensure you are getting the most value from these approaches.