OpenShift AI with vLLM and Spring AI

This article will teach you how to use OpenShift AI and vLLM to serve models used by the Spring AI application. To run the model on OpenShift AI, we will use a solution called KServe ModelCar. It can serve models directly from a container without using the S3 bucket. KServe is a standard, cloud-agnostic Model Inference Platform designed to serve predictive and generative AI models on Kubernetes. OpenShift AI includes a single model serving platform based on the KServe component. We can serve models on the single-model serving platform using model-serving runtimes. OpenShift AI includes several preinstalled runtimes. However, only the vLLM runtime is compatible with the OpenAI REST API. Therefore, we will use this one.
Previously, I published several articles about Spring AI with examples of using different AI models. Therefore, I will not focus on the introduction to Spring AI. For example, you can read about integration between Spring AI and Azure AI in the following post. Please refer to the following article for a quick intro to the Spring AI project.
Source Code
Feel free to use my source code if you’d like to try it out yourself. To do that, you must clone my sample GitHub repository. Then you should only follow my instructions.
Prerequisites
Create the OpenShift Cluster
For this exercise, you will need a relatively large OpenShift cluster. At least one of the cluster’s nodes must have a GPU. I created a cluster on AWS with one node on a g4dn.12xlarge
machine. On OpenShift, you can achieve this by creating the MachineSet
object that creates nodes using the appropriate virtual machine available on AWS.

Install Required Operators
Next, install and configure several operators on the cluster. Begin with the “Node Feature Discovery” operator. On OpenShift, this operator enables automatic discovery of cluster nodes with features such as GPUs. After installing the operator, create the NodeFeatureDiscovery
object. The default values set by the OpenShift console during object creation are sufficient.

The operator’s task is to mark the node with the detected GPU using the appropriate label. The label is feature.node.kubernetes.io/pci-10de.present=true
. After configuring the operator, verify that the correct GPU has been detected.
$ oc get node -l feature.node.kubernetes.io/pci-10de.present=true
NAME STATUS ROLES AGE VERSION
ip-10-0-45-120.us-east-2.compute.internal Ready worker 15d v1.31.6
ShellSessionNext, install the NVIDIA GPU Operator. This operator automatically installs, configures, and manages NVIDIA drivers and tools on nodes with NVIDIA graphics cards. This allows OpenShift to recognize the GPU as a resource that can be declared in pods. This will enable OpenShift to work with the “Node Feature Discovery” operator to label nodes with GPUs. The NVIDIA GPU operator uses the feature.node.kubernetes.io/pci-10de.present=true
label to determine where to install the drivers. For this to happen, the ClusterPolicy
object must be created. As before, you can use the default values generated by the OpenShift Console when creating this object.

The OpenShift AI feature for serving AI models requires the installation of OpenShift Serverless and OpenShift Service Mesh operators. The key solution here is KServe. KServe uses Knative to scale models on demand and integrates with Istio to secure model routing and versioning.

The final step in this phase is to install the OpenShift AI Operator and create the DataScienceCluster
object. If the previous installations were successful, everything will be configured automatically after creating the DataScienceCluster
object. For instance, OpenShift AI will make the Istio control plane and the Knative Serving component.

OpenShift AI creates several namespaces within a cluster. The most important is the redhat-ods-applications
namespace, where most components comprising the entire solution are run.
$ oc get pod -n redhat-ods-applications
NAME READY STATUS RESTARTS AGE
authorino-767bd64465-fq8bl 1/1 Running 0 15d
codeflare-operator-manager-5c69778b87-wxcwp 1/1 Running 0 15d
data-science-pipelines-operator-controller-manager-6686587wcmkr 1/1 Running 0 15d
etcd-549d769449-hqzwt 1/1 Running 0 15d
kserve-controller-manager-85f9b8d66f-qpxbf 1/1 Running 0 15d
kuberay-operator-8d77dcf84-qgsq5 1/1 Running 0 15d
kueue-controller-manager-7c895bd669-467nk 1/1 Running 0 6h8m
modelmesh-controller-7f9dd5f848-ljlxp 1/1 Running 0 15d
modelmesh-controller-7f9dd5f848-qqsl8 1/1 Running 0 24d
modelmesh-controller-7f9dd5f848-txlhd 1/1 Running 0 24d
notebook-controller-deployment-86f5b87585-p6nz5 1/1 Running 0 15d
odh-model-controller-574ff4657-q75gr 1/1 Running 0 15d
odh-notebook-controller-manager-9d754d5f-2ptk9 1/1 Running 0 15d
rhods-dashboard-5b96595667-79tx6 2/2 Running 0 15d
rhods-dashboard-5b96595667-8m52g 2/2 Running 0 15d
rhods-dashboard-5b96595667-kx7p4 2/2 Running 0 15d
rhods-dashboard-5b96595667-nn2cf 2/2 Running 0 15d
rhods-dashboard-5b96595667-ttcht 2/2 Running 0 15d
trustyai-service-operator-controller-manager-bd9fbdb6d-kcd57 1/1 Running 0 15d
ShellSessionConfigure and Use OpenShift AI
After installing OpenShift AI on a cluster, you can use its graphical UI. To access it, select “Red Hat OpenShift AI” from the menu at the top of the page.

After selecting the indicated option, you will be redirected to the following page. This page allows you to configure and use OpenShift AI on a cluster. The first step is to select a namespace on the cluster for the AI project. In my case, the namespace is ai
.

To run an AI model on a cluster, choose how to serve it first. You can choose between a single-model serving platform and a multi-model serving platform. With the former, each model is deployed on its model server. Multiple models can be deployed on a single shared server with multi-model platforms. This article will use the first option: a single-model serving platform.

The next step is to create an acceleration profile. This profile should be created automatically after installing and configuring the NVIDIA GPU Operator. If, for some reason, it was not, you can easily create it manually. When creating this object, enter the nvidia.com/gpu
value in the identifier
field.

You can either click on the profile from the UI or create it using the YAML manifest.
apiVersion: dashboard.opendatahub.io/v1
kind: AcceleratorProfile
metadata:
name: nvidia
namespace: redhat-ods-applications
spec:
displayName: nvidia
enabled: true
identifier: nvidia.com/gpu
YAMLServe Model on OpenShift AI with vLLM
Create ServingRuntime CRD
In the previous step, we configured OpenShift AI to deploy the model with a single-model serving platform and a GPU accelerator. We will use KServe’s ModelCar functionality to deploy the model, which allows us to serve models directly from a container. This functionality is described in an article published on the Red Hat Developer blog. The article demonstrates how to build an image containing a model downloaded from the Hugging Face Hub. In turn, we will use images that have already been built and are available in the quay.io/repository/redhat-ai-services/modelcar-catalog
repository. You can find ready-made images for AI models such as Granite and Llama.
To run a model on OpenShift AI in single-model serving runtime mode, you must define two CRD objects: ServingRuntime
and InferenceService
. According to the OpenShift AI documentation, the ServingRuntime
CRD creates a serving runtime, an environment for deploying and managing a model. Here’s the ServingRuntime
object that creates a runtime for the Llama 3.2 AI model. The annotation opendatahub.io/recommended-accelerators
sets the name of the recommended accelerator to use with the runtime. Its value should be identical to the identifier
field in the AcceleratorProfile
object (1). The openshift.io/display-name
annotation keeps the name with which the serving runtime is displayed (2). The spec.containers.image
field indicates the runtime container image used by the serving runtime (3). This image differs depending on the type of accelerator used. Finally, the ServingRuntime
object specifies that the single-model serving is used (4) and the vLLM model is supported by the runtime (5).
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
annotations:
opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]' # (1)
openshift.io/display-name: vLLM ServingRuntime for KServe # (2)
labels:
opendatahub.io/dashboard: "true"
name: llama-32-3b-instruct
spec:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "8080"
containers :
- args:
- --port=8080
- --model=/mnt/models
- --served-model-name={{.Name}}
command:
- python
- '-m'
- vllm.entrypoints.openai.api_server
env:
- name: HF_HOME
value: /tmp/hf_home
# (3)
image:
quay.io/modh/vllm@sha256:0d55419f3d168fd80868a36ac89815dded9e063937a8409b7edf3529771383f3
name: kserve-container
ports:
- containerPort: 8080
protocol: TCP
multiModel: false # (4)
supportedModelFormats: # (5)
- autoSelect: true
name: vLLM
YAMLCreate InterferenceService CRD
The InferenceService
CRD creates a server or inference service that processes inference queries, passes them to the model, and returns the inference output. Here’s the
object related to the previously created InferenceService
llama-32-3b-instruct
runtime (1). It must define some vLLM parameters to successfully run the model on the existing infrastructure and enable tool calling support on the Llama 3.2 model (2). The InferenceService object specifies the image containing the Llama 3.2 model, published in the the quay.io/redhat-ai-services/modelcar-catalog:llama-3.2-3b-instruct
repository (3). Alternatively, you can create your image, publish it in the custom registry, and run it on OpenShift using InferenceService
CRD.
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
annotations:
openshift.io/display-name: llama-32-3b-instruct
serving.knative.openshift.io/enablePassthrough: 'true'
serving.kserve.io/deploymentMode: Serverless
sidecar.istio.io/inject: 'true'
sidecar.istio.io/rewriteAppHTTPProbers: 'true'
name: llama-32-3b-instruct # (1)
labels:
opendatahub.io/dashboard: 'true'
spec:
predictor:
maxReplicas: 1
minReplicas: 1
model: # (2)
args:
- '--dtype=half'
- '--max_model_len=8192'
- '--gpu_memory_utilization=.95'
- '--enable-auto-tool-choice'
- '--tool_call_parser=llama3_json'
modelFormat:
name: vLLM
name: ''
resources:
limits:
cpu: '8'
memory: 10Gi
nvidia.com/gpu: '1'
requests:
cpu: '4'
memory: 8Gi
nvidia.com/gpu: '1'
runtime: llama-32-3b-instruct
storageUri: 'oci://quay.io/redhat-ai-services/modelcar-catalog:llama-3.2-3b-instruct' # (3)
YAMLDeploy with OpenShift AI
You can also create the same configuration using the OpenShift AI UI. The diagram below shows the settings you need for Granite 3.2.

The OpenShift AI UI lists all the models running in a given AI project. You can check the endpoint where a particular model is available. In this case, two models are running in the AI project: Llama 3.2 and Granite 3.2. Both models are available internally on the cluster and externally via the Knative Route
object.

Both models are automatically exposed on the node with the GPU. You can check the GPU resource reservations on a node using the oc describe
command:

A single-model serving platform runs AI models as the Knative Service
. You can use the oc get ksvc
command to display a list of Knative services running in the ai
namespace.
$ oc get ksvc -n ai
NAME URL LATESTCREATED LATESTREADY READY REASON
granite-32-2b-instruct-predictor https://granite-32-2b-instruct-predictor-ai.apps.piomin.ewyw.p1.openshiftapps.com granite-32-2b-instruct-predictor-00007 granite-32-2b-instruct-predictor-00007 True
llama-32-3b-instruct-predictor https://llama-32-3b-instruct-predictor-ai.apps.piomin.ewyw.p1.openshiftapps.com llama-32-3b-instruct-predictor-00002 llama-32-3b-instruct-predictor-00002 True
ShellSessionIntegrate Spring AI with vLLM
Dependencies and Properties
The vLLM runtime is compatible with the OpenAI REST API. To integrate our sample Spring Boot application with a model running on vLLM, we must use the standard Spring AI OpenAI starter. The app in the spring-ai-showcase
repository has more functionality than what is tested in this article. In simplified terms, the list of dependencies needed for the app to communicate with the OpenAI API and the model running on OpenShift AI is below.
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-starter-model-openai</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-autoconfigure-model-openai</artifactId>
</dependency>
XMLAlthough the model itself served on OpenShift AI does not require authorization with an API key the spring.ai.openai.api-key
Spring AI parameter must be set. The endpoint’s address provided through the vLLM runtime must be specified in the spring.ai.openai.chat.base-url
parameter. The default name of the model used must also be overwritten with the name under which the model was run on OpenShift AI. This name is for Llama 3.2 llama-32-3b-instruct. Below is a list of all the Spring Boot settings required for vLLM integration, which is available in the application-vllm.properties
file.
spring.ai.openai.api-key = ${OPENAI_API_KEY:dummy}
spring.ai.openai.chat.base-url = https://llama-32-3b-instruct-ai.apps.piomin.ewyw.p1.openshiftapps.com
spring.ai.openai.chat.options.model = llama-32-3b-instruct
PlaintextImplementation with Spring AI
The code below demonstrates how @RestController
implements communication between the application and the target AI model. The @RestController
class injects an auto-configured ChatClient.Builder
to create an instance of ChatClient
. The PersonController
class implements a method for returning a list of persons from the GET /persons
endpoint. The main goal is to generate a list of 10 objects with the fields defined in the Person
class. The id field should be auto-incremented. The PromptTemplate
object defines a message that will be sent to the chat model AI API. It doesn’t have to specify the exact fields that should be returned. This part is handled automatically by the Spring AI library after we invoke the entity()
method on the ChatClient
instance. The ParameterizedTypeReference
object inside the entity method tells Spring AI to generate a list of objects.
@RestController
@RequestMapping("/persons")
public class PersonController {
private final ChatClient chatClient;
public PersonController(ChatClient.Builder chatClientBuilder,
ChatMemory chatMemory) {
this.chatClient = chatClientBuilder
.defaultAdvisors(
new PromptChatMemoryAdvisor(chatMemory),
new SimpleLoggerAdvisor())
.build();
}
@GetMapping
List<Person> findAll() {
PromptTemplate pt = new PromptTemplate("""
Return a current list of 10 persons if exists or generate a new list with random values.
Each object should contain an auto-incremented id field.
The age value should be a random number between 18 and 99.
Do not include any explanations or additional text.
Return data in RFC8259 compliant JSON format.
""");
return this.chatClient.prompt(pt.create())
.call()
.entity(new ParameterizedTypeReference<>() {});
}
@GetMapping("/{id}")
Person findById(@PathVariable String id) {
PromptTemplate pt = new PromptTemplate("""
Find and return the object with id {id} in a current list of persons.
""");
Prompt p = pt.create(Map.of("id", id));
return this.chatClient.prompt(p)
.call()
.entity(Person.class);
}
}
JavaThe llama-32-3b-instruct
model uses a “tool-calling” approach for API calls. You can read more about it in one of my Spring AI articles, which are available at this link. For instance, the class below implements the @Tool
annotation, connecting to the database and searching it for a list of shares for individual companies. The key to using this tool is its description in the description
field, which is then appropriately interpreted by the LLM model.
public class WalletTools {
private WalletRepository walletRepository;
public WalletTools(WalletRepository walletRepository) {
this.walletRepository = walletRepository;
}
@Tool(description = "Number of shares for each company in my wallet")
public List<Share> getNumberOfShares() {
return (List<Share>) walletRepository.findAll();
}
}
JavaThen, the @Tool
reference is set to the chat client when it interacts with the AI model. The AI model can call the tool’s method as required based on the tool’s description and the input prompt’s content.
@RestController
@RequestMapping("/wallet")
public class WalletController {
private final ChatClient chatClient;
private final StockTools stockTools;
private final WalletTools walletTools;
public WalletController(ChatClient.Builder chatClientBuilder,
StockTools stockTools,
WalletTools walletTools) {
this.chatClient = chatClientBuilder
.defaultAdvisors(new SimpleLoggerAdvisor())
.build();
this.stockTools = stockTools;
this.walletTools = walletTools;
}
@GetMapping("/with-tools")
String calculateWalletValueWithTools() {
PromptTemplate pt = new PromptTemplate("""
What’s the current value in dollars of my wallet based on the latest stock daily prices ?
""");
return this.chatClient.prompt(pt.create())
.tools(stockTools, walletTools)
.call()
.content();
}
}
JavaRun Spring Boot Application
Activate the vllm
profile when launching the Spring Boot application. This will cause the application to read the settings entered in the application-vllm.properties
file.
mvn spring-boot:run -Dspring-boot.run.profiles=vllm
ShellSessionOnce the application runs, you will call all three endpoints implemented in the previously discussed code snippets. These endpoints are:
- GET /persons
- GET /persons/{id}
- GET /wallet/with-tools
Once launched, the application can be accessed locally on 8080
port.
$ curl http://localhost:8080/persons
$ curl http://localhost:8080/persons/1
$ curl http://localhost:8080/wallet/with-tools
ShellSessionAlternatively, you can deploy the Spring Boot application on OpenShift and expose it outside the cluster with the Route
object. The simplest way to achieve that is through the odo
CLI tool. You can find more details about odo
in the following post. To deploy the app with odo
run the following command:
odo dev
ShellSessionAfter that, the application should be deployed in the selected namespace and available for testing on the 20001
local port, thanks to the port-forwarding feature.

Here’s the example output:

Final Thoughts
This article demonstrates the simplest way to integrate a Java application with an AI model running on OpenShift via an OpenAI-compliant interface. Preparing and exposing such a model to OpenShift AI requires several steps, such as installing and configuring Kubernetes operators. However, KServe’s ModelCar approach standardizes the entire process, making AI models available as containers.
Leave a Reply