AI Models in Containers with RamaLama

AI Models in Containers with RamaLama

This article explains how to run AI models locally in containers with RamaLama and integrate the sample Java application with them. RamaLama brings AI inferencing to the container world of Podman, Docker, and Kubernetes. It automatically finds and pulls a container image optimized for your system’s GPUs, handling all dependencies and performance tweaks for you. It then uses a container engine, such as Podman or Docker, to pull the required image and prepare everything for running. If you want a hassle-free way to run AI models from multiple sources, using the runtime that fits your hardware, all within containers for simplicity, and with seamless integration with your existing workflows, RamaLama is a good choice. Let’s see how it works in practice!

You can find other articles about AI and Java on my blog. For example, if you are interested in how to use Ollama to serve models for Spring AI applications, you can read the following article.

Source Code

Source code will not play a key role in this article. Nevertheless, feel free to use my source code if you’d like to try it out yourself. To do that, you must clone my sample GitHub repository. It contains the sample Spring Boot application we will use to interact with AI models run on RamaLama in containers. You can find that application in the spring-ai-openai-compatibility directory. Then you should only follow my instructions.

Install RamaLama

You can install RamaLama on Linux or macOS with the following command:

curl -fsSL https://ramalama.ai/install.sh | bash
ShellSession

The script above uses Homebrew to install RamaLama on macOS. But alternatively, you can download the self-contained macOS installer that includes Python and all dependencies. You can find the latest .pkg installer in the releases page.

Finally, you can verify the version of the previously installed tool.

$ ramalama version
  ramalama version 0.17.1
ShellSession

Install and Configure Podman

Alternatively, you can use Docker, which I also use frequently. When it comes to Podman, I suggest installing Podman Desktop first. You can download it here. After installation, launch the Podman Desktop GUI and go to the “Settings” section to create a new Podman machine. Then, choose LibKrun as a default provider. This enabled GPU acceleration for containers running in macOS. This virtual machine is managed by krunkit and libkrun, a lightweight virtual machine manager (VMM) based on Apple’s low-level Hypervisor Framework. You can find a detailed explanation and performance analysis in the following article.

ramalama-containers-podman

After creation, the virtual machine should start up. You can check its status in Podman Desktop as shown below.

Then run the following command to verify that Podman works.

$ podman version
  Client:        Podman Engine
  Version:       5.7.1
  API Version:   5.7.1
  Go Version:    go1.25.5
  Git Commit:    f845d14e941889ba4c071f35233d09b29d363c75
  Built:         Wed Dec 10 15:53:41 2025
  Build Origin:  pkginstaller
  OS/Arch:       darwin/arm64
ShellSession

Run Model with Ramalama

RamaLama supports multiple AI model registries, including OCI Container Registries, Ollama, and HuggingFace. RamaLama defaults to the Ollama registry transport. Let’s assume we want to run the following model from the Ollama registry:

To run that tinyllama model with ramalama, you must execute the following command:

ramalama run tinyllama
ShellSession

By default, RamaLama tries to run the model inside the quay.io/ramalama/ramalama:latest container. The container exposes an OpenAI-compatible API on port 8080.

$ podman ps
CONTAINER ID  IMAGE                             COMMAND               CREATED        STATUS        PORTS                   NAMES
d36129d1f326  quay.io/ramalama/ramalama:latest  llama-server --ho...  5 minutes ago  Up 5 minutes  0.0.0.0:8080->8080/tcp  ramalama-OikvMye7v9
ShellSession

You can interact with the model using the ramalama chat command as shown below:

ramalama chat "What's the date today?"
ShellSession

If you want to change a default model registry, for example, to HuggingFace, use the RAMALAMA_TRANSPORT environment variable.

export RAMALAMA_TRANSPORT=huggingface
ShellSession

Then you can run any GGUF model from HuggingFace.

ramalama run unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF
ShellSession

Instead of the podman command to list running models, you can use the ramalama command. Of course, you can run several models at once. In this case, RamaLama will make them available externally using different ports.

$ ramalama ps
CONTAINER ID  IMAGE                             COMMAND               CREATED         STATUS         PORTS                   NAMES
49eddec5f2ef  quay.io/ramalama/ramalama:latest  llama-server --ho...  3 minutes ago   Up 3 minutes   0.0.0.0:8080->8080/tcp  ramalama-NWPXNpQDBt
023d05c70666  quay.io/ramalama/ramalama:latest  llama-server --ho...  39 seconds ago  Up 39 seconds  0.0.0.0:8086->8086/tcp  ramalama-ogwLbXQNDt
ShellSession

Integrate Spring AI with Models on RamaLama

To test various models with RamaLama, I created a very simple Spring Boot application. It uses Spring AI together with the OpenAI module for integration with models running in RamaLama containers. Below is a list of dependencies for this application.

    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-actuator</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-starter-model-openai</artifactId>
        </dependency>
    </dependencies>
XML

The application exposes a single REST endpoint GET /simple/{country}. Depending on the parameter, it asks about the capital city of a given country and requests a brief history of that city.

@RestController
@RequestMapping("/simple")
public class SimpleController {

    private final ChatClient chatClient;

    public SimpleController(ChatClient.Builder chatClientBuilder) {
        this.chatClient = chatClientBuilder
                .defaultAdvisors(SimpleLoggerAdvisor.builder().build())
                .build();
    }

    @GetMapping("/{country}")
    public String ping(@PathVariable String country) {
        PromptTemplate pt = new PromptTemplate("""
                What's the capital of {country} ?
                Describe the history of that city briefly.
        """);

        return chatClient.prompt(pt.create(Map.of("country", country)))
                .call()
                .content();
    }
}
Java

Some of the parameters below are optional, such as the model name or increasing the Spring AI logging level. The app communicates with the LLM model provided by RamaLama at http://localhost:8080. Due to a possible conflict between the ports used, it is best to change the default port used by Spring Boot Web to 9080. The value of the API key, on the other hand, is irrelevant. You just need to set it to something other than null so that Spring AI will accept it…

spring.ai.openai.api-key = ${OPENAI_API_KEY:dummy}
spring.ai.openai.chat.base-url = http://localhost:8080
spring.ai.openai.chat.options.model = tinyllama

logging.level.org.springframework.ai.chat.client.advisor = DEBUG

server.port = 9080
Plaintext

Then, run the app with the following command:

mvn spring-boot:run
Plaintext

Finally, you can call our test REST endpoint for different values of the country parameter.

curl http://localhost:9080/simple/Germany
curl http://localhost:9080/simple/France
curl http://localhost:9080/simple/Italy
Plaintext

The following diagram illustrates GPU usage during our test calls.

ramalama-containers-gpu

The last registry supported by Ramalama that I would like to discuss in this article is the registry of ready-made images containing selected AI models. At the moment, there are slightly more than 20 images with popular models such as gpt-oss, gemma3, qwen, and llama. You can view the full list of available images with models on this webpage.

ramalama-containers-oci-models

To run the container with a specific image, you must add the rlcr:// prefix to the model name. For example, you pull and run the gemma-3-1b-it model as shown below.

ramalama-containers-gemma-image

Then, you don’t even have to restart our sample application if you have already stopped the previously tested models. Here’s a fragment of Gemma’s answer about the Spanish capital.

Use RamaLama to Run Containers with AI Models in Kubernetes

We can use RamaLama to run AI models inside containers on Kubernetes, either on CPU or GPU nodes. However, in this section, I would like to use GPU acceleration on macOS, as I did earlier when running models in Podman. We will try this solution on Minikube. Krunkit is a macOS virtualization tool optimized for GPU-accelerated virtual machines and AI workloads. In the first step, we must install it on macOS using Homebrew:

$ brew tap slp/krunkit
$ brew install krunkit
Plaintext

To use the krunkit driver we must install vmnet-helper. Here’s the command that downloads the latest release from GitHub and installs it to /opt/vmnet-helper. After installing both tools, we can create a Minikube cluster. It is best to use the following command to increase the default resources allocated to the Minikube machine.

minikube start --memory='16gb' --cpus='8' --driver krunkit --disk-size 50000mb
Plaintext

Then, we must install the Kubernetes Generic Device plugin. It enables allocating generic Linux devices, such as serial devices or video cameras, to Kubernetes Pods. In our case, it will allow us to assign GPUs to Pods running AI models. The plugin is installed as a DaemonSet. The following configuration allows us to use up to 4 GPUs in Kubernetes.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: generic-device-plugin
  namespace: kube-system
  labels:
    app.kubernetes.io/name: generic-device-plugin
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: generic-device-plugin
  template:
    metadata:
      labels:
        app.kubernetes.io/name: generic-device-plugin
    spec:
      priorityClassName: system-node-critical
      tolerations:
      - operator: "Exists"
        effect: "NoExecute"
      - operator: "Exists"
        effect: "NoSchedule"
      containers:
      - image: squat/generic-device-plugin
        args:
        - --device
        - |
          name: dri
          groups:
          - count: 4
            paths:
            - path: /dev/dri
        name: generic-device-plugin
        resources:
          requests:
            cpu: 50m
            memory: 10Mi
          limits:
            cpu: 50m
            memory: 20Mi
        ports:
        - containerPort: 8080
          name: http
        securityContext:
          privileged: true
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
        - name: dev
          mountPath: /dev
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
      - name: dev
        hostPath:
          path: /dev
  updateStrategy:
    type: RollingUpdate
YAML

As before, we could use a ready-made image from the RamaLama registry. However, if we want to use the GPU support provided by the generic-device-plugin, we should mount the model as a volume to the ramalama container. First, let’s download the gemma-3-1b model from HuggingFace.

$ cd ~/models
$ curl -LO https://huggingface.co/google/gemma-3-1b-it-qat-q4_0-gguf/resolve/main/gemma-3-1b-it-q4_0.gguf?download=true
ShellSession

Then, we can mount the model from the ~/models directory to Minikube with the following command:

minikube mount ~/models:/mnt/models
ShellSession

The following Deployment uses the quay.io/ramalama/ramalama:latest image and mounts the gemma-3-1b-it-q4_0.gguf model from /mnt/models directory to that container. The model is launched internally using llama-server. The container with the model can use up to 1 GPU out of the 4 allowed across the entire cluster (squat.ai/dri: "1").

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gemma
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma
  template:
    metadata:
      labels:
        app: gemma
      name: gemma
    spec:
      containers:
        - name: llama-server
          image: quay.io/ramalama/ramalama:latest
          command: [
            llama-server,
            --host, "0.0.0.0",
            --port, "8080",
            --model, /mnt/models/gemma-3-1b-it-q4_0.gguf,
            --alias, "gemma",
            --ctx-size, "4096",
            --temp, "0.7",
            --cache-reuse, "256",
            -ngl, "999",
            --threads, "8",
            --no-warmup,
            --log-colors, auto,
          ]
          resources:
            limits:
              squat.ai/dri: "1"
          volumeMounts:
            - name: models
              mountPath: /mnt/models
      volumes:
        - name: models
          hostPath:
            path: /mnt/models
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: gemma
  name: gemma
spec:
  selector:
    app: gemma
  ports:
    - name: http
      port: 8080
  type: ClusterIP
YAML

Let’s verify if the pod is running after it was deployed:

$ kubectl get po
NAME                     READY   STATUS    RESTARTS   AGE
gemma-5466d666f7-d4wnv   1/1     Running   0          27s
ShellSession

Then, you can expose the gemma Service outside Minikube using port-forward on port 8080. Your application can remain enabled as before. Try repeating the exercise we did earlier and send a few test requests to it. Verify whether the GPU is being used in the graph or during requests.

kubectl port-forward svc/gemma 8080:8080
ShellSession

Conclusion

RamaLama makes AI models execution simple, reproducible, and container-native. It allows us to use various model registries, such as Ollama and HuggingFace, as well as ready-made images from various OCI registries. It also provides an easy path to run AI models on Podman, Docker, and even Kubernetes. Thanks to RamaLama, I was able to leverage Apple Silicon GPUs when running AI models in containers.

1 COMMENT

comments user
Nir Soffer

minikube mount use very slow 9p mount. For large models you want to use ‘minikube start —mount-string’ which uses virtiofs.

See full example in https://minikube.sigs.k8s.io/docs/tutorials/ai-playground/

Also there is no need to increase minikube cpu, memory, and disk size. The model is accessed via the mount and the work is offloaded to the host GPU using host unified memory.

Leave a Reply