OpenShift AI with vLLM and Spring AI

This article will teach you how to use OpenShift AI and vLLM to serve models used by the Spring AI application. To run the model on OpenShift AI, we will use a solution called KServe ModelCar. It can serve models directly from a container without using the S3 bucket. KServe is a standard, cloud-agnostic Model Inference Platform designed to serve predictive and generative AI models on Kubernetes. OpenShift AI includes a single model serving platform based on the KServe component. We can serve models on the single-model serving platform using model-serving runtimes. OpenShift AI includes several preinstalled runtimes. However, only the vLLM runtime is compatible with the OpenAI REST API. Therefore, we will use this one.

Previously, I published several articles about Spring AI with examples of using different AI models. Therefore, I will not focus on the introduction to Spring AI. For example, you can read about integration between Spring AI and Azure AI in the following post. Please refer to the following article for a quick intro to the Spring AI project.

Source Code

Feel free to use my source code if you’d like to try it out yourself. To do that, you must clone my sample GitHub repository. Then you should only follow my instructions.

Prerequisites

Create the OpenShift Cluster

For this exercise, you will need a relatively large OpenShift cluster. At least one of the cluster’s nodes must have a GPU. I created a cluster on AWS with one node on a g4dn.12xlarge machine. On OpenShift, you can achieve this by creating the MachineSet object that creates nodes using the appropriate virtual machine available on AWS.

Install Required Operators

Next, install and configure several operators on the cluster. Begin with the “Node Feature Discovery” operator. On OpenShift, this operator enables automatic discovery of cluster nodes with features such as GPUs. After installing the operator, create the NodeFeatureDiscovery object. The default values set by the OpenShift console during object creation are sufficient.

The operator’s task is to mark the node with the detected GPU using the appropriate label. The label is feature.node.kubernetes.io/pci-10de.present=true. After configuring the operator, verify that the correct GPU has been detected.

$ oc get node -l feature.node.kubernetes.io/pci-10de.present=true
NAME                                        STATUS   ROLES    AGE   VERSION
ip-10-0-45-120.us-east-2.compute.internal   Ready    worker   15d   v1.31.6

ShellSession

Next, install the NVIDIA GPU Operator. This operator automatically installs, configures, and manages NVIDIA drivers and tools on nodes with NVIDIA graphics cards. This allows OpenShift to recognize the GPU as a resource that can be declared in pods. This will enable OpenShift to work with the “Node Feature Discovery” operator to label nodes with GPUs. The NVIDIA GPU operator uses the feature.node.kubernetes.io/pci-10de.present=true label to determine where to install the drivers. For this to happen, the ClusterPolicy object must be created. As before, you can use the default values generated by the OpenShift Console when creating this object.

The OpenShift AI feature for serving AI models requires the installation of OpenShift Serverless and OpenShift Service Mesh operators. The key solution here is KServe. KServe uses Knative to scale models on demand and integrates with Istio to secure model routing and versioning.

The final step in this phase is to install the OpenShift AI Operator and create the DataScienceCluster object. If the previous installations were successful, everything will be configured automatically after creating the DataScienceCluster object. For instance, OpenShift AI will make the Istio control plane and the Knative Serving component.

OpenShift AI creates several namespaces within a cluster. The most important is the redhat-ods-applications namespace, where most components comprising the entire solution are run.

$ oc get pod -n redhat-ods-applications
NAME                                                              READY   STATUS    RESTARTS   AGE
authorino-767bd64465-fq8bl                                        1/1     Running   0          15d
codeflare-operator-manager-5c69778b87-wxcwp                       1/1     Running   0          15d
data-science-pipelines-operator-controller-manager-6686587wcmkr   1/1     Running   0          15d
etcd-549d769449-hqzwt                                             1/1     Running   0          15d
kserve-controller-manager-85f9b8d66f-qpxbf                        1/1     Running   0          15d
kuberay-operator-8d77dcf84-qgsq5                                  1/1     Running   0          15d
kueue-controller-manager-7c895bd669-467nk                         1/1     Running   0          6h8m
modelmesh-controller-7f9dd5f848-ljlxp                             1/1     Running   0          15d
modelmesh-controller-7f9dd5f848-qqsl8                             1/1     Running   0          24d
modelmesh-controller-7f9dd5f848-txlhd                             1/1     Running   0          24d
notebook-controller-deployment-86f5b87585-p6nz5                   1/1     Running   0          15d
odh-model-controller-574ff4657-q75gr                              1/1     Running   0          15d
odh-notebook-controller-manager-9d754d5f-2ptk9                    1/1     Running   0          15d
rhods-dashboard-5b96595667-79tx6                                  2/2     Running   0          15d
rhods-dashboard-5b96595667-8m52g                                  2/2     Running   0          15d
rhods-dashboard-5b96595667-kx7p4                                  2/2     Running   0          15d
rhods-dashboard-5b96595667-nn2cf                                  2/2     Running   0          15d
rhods-dashboard-5b96595667-ttcht                                  2/2     Running   0          15d
trustyai-service-operator-controller-manager-bd9fbdb6d-kcd57      1/1     Running   0          15d

ShellSession

Configure and Use OpenShift AI

After installing OpenShift AI on a cluster, you can use its graphical UI. To access it, select “Red Hat OpenShift AI” from the menu at the top of the page.

After selecting the indicated option, you will be redirected to the following page. This page allows you to configure and use OpenShift AI on a cluster. The first step is to select a namespace on the cluster for the AI project. In my case, the namespace is ai.

To run an AI model on a cluster, choose how to serve it first. You can choose between a single-model serving platform and a multi-model serving platform. With the former, each model is deployed on its model server. Multiple models can be deployed on a single shared server with multi-model platforms. This article will use the first option: a single-model serving platform.

The next step is to create an acceleration profile. This profile should be created automatically after installing and configuring the NVIDIA GPU Operator. If, for some reason, it was not, you can easily create it manually. When creating this object, enter the nvidia.com/gpu value in the identifier field.

You can either click on the profile from the UI or create it using the YAML manifest.

apiVersion: dashboard.opendatahub.io/v1
kind: AcceleratorProfile
metadata:
  name: nvidia
  namespace: redhat-ods-applications
spec:
  displayName: nvidia
  enabled: true
  identifier: nvidia.com/gpu

YAML

Serve Model on OpenShift AI with vLLM

Create ServingRuntime Resource

In the previous step, we configured OpenShift AI to deploy the model with a single-model serving platform and a GPU accelerator. We will use KServe’s ModelCar functionality to deploy the model, which allows us to serve models directly from a container. This functionality is described in an article published on the Red Hat Developer blog. The article demonstrates how to build an image containing a model downloaded from the Hugging Face Hub. In turn, we will use images that have already been built and are available in the quay.io/repository/redhat-ai-services/modelcar-catalog repository. You can find ready-made images for AI models such as Granite and Llama.

To run a model on OpenShift AI in single-model serving runtime mode, you must define two objects: ServingRuntime and InferenceService. According to the OpenShift AI documentation, the ServingRuntime CR creates a serving runtime, an environment for deploying and managing a model. Here’s the ServingRuntime object that creates a runtime for the Llama 3.2 AI model. The annotation opendatahub.io/recommended-accelerators sets the name of the recommended accelerator to use with the runtime. Its value should be identical to the identifier field in the AcceleratorProfile object (1). The openshift.io/display-name annotation keeps the name with which the serving runtime is displayed (2). The spec.containers.image field indicates the runtime container image used by the serving runtime (3). This image differs depending on the type of accelerator used. Finally, the ServingRuntime object specifies that the single-model serving is used (4) and the vLLM model is supported by the runtime (5).

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]' # (1)
    openshift.io/display-name: vLLM ServingRuntime for KServe # (2)
  labels:
    opendatahub.io/dashboard: "true"
  name: llama-32-3b-instruct
spec:
  annotations:
    prometheus.io/path: /metrics 
    prometheus.io/port: "8080" 
  containers :
    - args:
        - --port=8080
        - --model=/mnt/models 
        - --served-model-name={{.Name}} 
      command: 
        - python
        - '-m'
        - vllm.entrypoints.openai.api_server
      env:
        - name: HF_HOME
          value: /tmp/hf_home
      # (3)
      image:
quay.io/modh/vllm@sha256:0d55419f3d168fd80868a36ac89815dded9e063937a8409b7edf3529771383f3
    name: kserve-container
    ports:
      - containerPort: 8080
        protocol: TCP
  multiModel: false # (4)
  supportedModelFormats: # (5) 
    - autoSelect: true
      name: vLLM

YAML

Create InterferenceService Resource

The InferenceService CRD creates a server or inference service that processes inference queries, passes them to the model, and returns the inference output. Here’s the InferenceService object related to the previously created llama-32-3b-instruct runtime (1). It must define some vLLM parameters to successfully run the model on the existing infrastructure and enable tool calling support on the Llama 3.2 model (2). The InferenceService object specifies the image containing the Llama 3.2 model, published in the the quay.io/redhat-ai-services/modelcar-catalog:llama-3.2-3b-instruct repository (3). Alternatively, you can create your image, publish it in the custom registry, and run it on OpenShift using InferenceService CR.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    openshift.io/display-name: llama-32-3b-instruct
    serving.knative.openshift.io/enablePassthrough: 'true'
    serving.kserve.io/deploymentMode: Serverless
    sidecar.istio.io/inject: 'true'
    sidecar.istio.io/rewriteAppHTTPProbers: 'true'
  name: llama-32-3b-instruct # (1)
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  predictor:
    maxReplicas: 1
    minReplicas: 1
    model: # (2)
      args:
        - '--dtype=half'
        - '--max_model_len=8192'
        - '--gpu_memory_utilization=.95'
        - '--enable-auto-tool-choice'
        - '--tool_call_parser=llama3_json'
      modelFormat:
        name: vLLM
      name: ''
      resources:
        limits:
          cpu: '8'
          memory: 10Gi
          nvidia.com/gpu: '1'
        requests:
          cpu: '4'
          memory: 8Gi
          nvidia.com/gpu: '1'
      runtime: llama-32-3b-instruct
      storageUri: 'oci://quay.io/redhat-ai-services/modelcar-catalog:llama-3.2-3b-instruct' # (3)

YAML

Deploy with OpenShift AI

You can also create the same configuration using the OpenShift AI UI. The diagram below shows the settings you need for Granite 3.2.

The OpenShift AI UI lists all the models running in a given AI project. You can check the endpoint where a particular model is available. In this case, two models are running in the AI project: Llama 3.2 and Granite 3.2. Both models are available internally on the cluster and externally via the Knative Route object.

Both models are automatically exposed on the node with the GPU. You can check the GPU resource reservations on a node using the oc describe command:

A single-model serving platform runs AI models as the Knative Service. You can use the oc get ksvc command to display a list of Knative services running in the ai namespace.

$ oc get ksvc -n ai
NAME                               URL                                                                                 LATESTCREATED                            LATESTREADY                              READY   REASON
granite-32-2b-instruct-predictor   https://granite-32-2b-instruct-predictor-ai.apps.piomin.ewyw.p1.openshiftapps.com   granite-32-2b-instruct-predictor-00007   granite-32-2b-instruct-predictor-00007   True    
llama-32-3b-instruct-predictor     https://llama-32-3b-instruct-predictor-ai.apps.piomin.ewyw.p1.openshiftapps.com     llama-32-3b-instruct-predictor-00002     llama-32-3b-instruct-predictor-00002     True

ShellSession

Integrate Spring AI with vLLM

Dependencies and Properties

The vLLM runtime is compatible with the OpenAI REST API. To integrate our sample Spring Boot application with a model running on vLLM, we must use the standard Spring AI OpenAI starter. The app in the spring-ai-showcase repository has more functionality than what is tested in this article. In simplified terms, the list of dependencies needed for the app to communicate with the OpenAI API and the model running on OpenShift AI is below.

<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-starter-model-openai</artifactId>
</dependency>
<dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-autoconfigure-model-openai</artifactId>
</dependency>

XML

Although the model itself served on OpenShift AI does not require authorization with an API key the spring.ai.openai.api-key Spring AI parameter must be set. The endpoint’s address provided through the vLLM runtime must be specified in the spring.ai.openai.chat.base-url parameter. The default name of the model used must also be overwritten with the name under which the model was run on OpenShift AI. This name is for Llama 3.2 llama-32-3b-instruct. Below is a list of all the Spring Boot settings required for vLLM integration, which is available in the application-vllm.properties file.

spring.ai.openai.api-key = ${OPENAI_API_KEY:dummy}
spring.ai.openai.chat.base-url = https://llama-32-3b-instruct-ai.apps.piomin.ewyw.p1.openshiftapps.com
spring.ai.openai.chat.options.model = llama-32-3b-instruct

Plaintext

Implementation with Spring AI

The code below demonstrates how @RestController implements communication between the application and the target AI model. The @RestController class injects an auto-configured ChatClient.Builder to create an instance of ChatClient. The PersonController class implements a method for returning a list of persons from the GET /persons endpoint. The main goal is to generate a list of 10 objects with the fields defined in the Person class. The id field should be auto-incremented. The PromptTemplate object defines a message that will be sent to the chat model AI API. It doesn’t have to specify the exact fields that should be returned. This part is handled automatically by the Spring AI library after we invoke the entity() method on the ChatClient instance. The ParameterizedTypeReference object inside the entity method tells Spring AI to generate a list of objects.

@RestController
@RequestMapping("/persons")
public class PersonController {

    private final ChatClient chatClient;

    public PersonController(ChatClient.Builder chatClientBuilder,
                            ChatMemory chatMemory) {
        this.chatClient = chatClientBuilder
                .defaultAdvisors(
                        new PromptChatMemoryAdvisor(chatMemory),
                        new SimpleLoggerAdvisor())
                .build();
    }

    @GetMapping
    List<Person> findAll() {
        PromptTemplate pt = new PromptTemplate("""
                Return a current list of 10 persons if exists or generate a new list with random values.
                Each object should contain an auto-incremented id field.
                The age value should be a random number between 18 and 99.
                Do not include any explanations or additional text.
                Return data in RFC8259 compliant JSON format.
                """);

        return this.chatClient.prompt(pt.create())
                .call()
                .entity(new ParameterizedTypeReference<>() {});
    }

    @GetMapping("/{id}")
    Person findById(@PathVariable String id) {
        PromptTemplate pt = new PromptTemplate("""
                Find and return the object with id {id} in a current list of persons.
                """);
        Prompt p = pt.create(Map.of("id", id));
        return this.chatClient.prompt(p)
                .call()
                .entity(Person.class);
    }
}

Java

The llama-32-3b-instruct model uses a “tool-calling” approach for API calls. You can read more about it in one of my Spring AI articles, which are available at this link. For instance, the class below implements the @Tool annotation, connecting to the database and searching it for a list of shares for individual companies. The key to using this tool is its description in the description field, which is then appropriately interpreted by the LLM model.

public class WalletTools {

    private WalletRepository walletRepository;

    public WalletTools(WalletRepository walletRepository) {
        this.walletRepository = walletRepository;
    }

    @Tool(description = "Number of shares for each company in my wallet")
    public List<Share> getNumberOfShares() {
        return (List<Share>) walletRepository.findAll();
    }
}

Java

Then, the @Tool reference is set to the chat client when it interacts with the AI model. The AI model can call the tool’s method as required based on the tool’s description and the input prompt’s content.

@RestController
@RequestMapping("/wallet")
public class WalletController {

    private final ChatClient chatClient;
    private final StockTools stockTools;
    private final WalletTools walletTools;

    public WalletController(ChatClient.Builder chatClientBuilder,
                            StockTools stockTools,
                            WalletTools walletTools) {
        this.chatClient = chatClientBuilder
                .defaultAdvisors(new SimpleLoggerAdvisor())
                .build();
        this.stockTools = stockTools;
        this.walletTools = walletTools;
    }
    
    @GetMapping("/with-tools")
    String calculateWalletValueWithTools() {
        PromptTemplate pt = new PromptTemplate("""
        What’s the current value in dollars of my wallet based on the latest stock daily prices ?
        """);

        return this.chatClient.prompt(pt.create())
                .tools(stockTools, walletTools)
                .call()
                .content();
    }
    
}

Java

Run Spring Boot Application

Activate the vllm profile when launching the Spring Boot application. This will cause the application to read the settings entered in the application-vllm.properties file.

mvn spring-boot:run -Dspring-boot.run.profiles=vllm

ShellSession

Once the application runs, you will call all three endpoints implemented in the previously discussed code snippets. These endpoints are:

GET /persons
GET /persons/{id}
GET /wallet/with-tools

Once launched, the application can be accessed locally on 8080 port.

$ curl http://localhost:8080/persons
$ curl http://localhost:8080/persons/1
$ curl http://localhost:8080/wallet/with-tools

ShellSession

Alternatively, you can deploy the Spring Boot application on OpenShift and expose it outside the cluster with the Route object. The simplest way to achieve that is through the odo CLI tool. You can find more details about odo in the following post. To deploy the app with odo run the following command:

odo dev

ShellSession

After that, the application should be deployed in the selected namespace and available for testing on the 20001 local port, thanks to the port-forwarding feature.

Here’s the example output:

Final Thoughts

This article demonstrates the simplest way to integrate a Java application with an AI model running on OpenShift via an OpenAI-compliant interface. Preparing and exposing such a model to OpenShift AI requires several steps, such as installing and configuring Kubernetes operators. However, KServe’s ModelCar approach standardizes the entire process, making AI models available as containers.

Piotr's TechBlog

OpenShift AI with vLLM and Spring AI