AI Models in Containers with RamaLama
This article explains how to run AI models locally in containers with RamaLama and integrate the sample Java application with them. RamaLama brings AI inferencing to the container world of Podman, Docker, and Kubernetes. It automatically finds and pulls a container image optimized for your system’s GPUs, handling all dependencies and performance tweaks for you. It then uses a container engine, such as Podman or Docker, to pull the required image and prepare everything for running. If you want a hassle-free way to run AI models from multiple sources, using the runtime that fits your hardware, all within containers for simplicity, and with seamless integration with your existing workflows, RamaLama is a good choice. Let’s see how it works in practice!
You can find other articles about AI and Java on my blog. For example, if you are interested in how to use Ollama to serve models for Spring AI applications, you can read the following article.
Source Code
Source code will not play a key role in this article. Nevertheless, feel free to use my source code if you’d like to try it out yourself. To do that, you must clone my sample GitHub repository. It contains the sample Spring Boot application we will use to interact with AI models run on RamaLama in containers. You can find that application in the spring-ai-openai-compatibility directory. Then you should only follow my instructions.
Install RamaLama
You can install RamaLama on Linux or macOS with the following command:
curl -fsSL https://ramalama.ai/install.sh | bashShellSessionThe script above uses Homebrew to install RamaLama on macOS. But alternatively, you can download the self-contained macOS installer that includes Python and all dependencies. You can find the latest .pkg installer in the releases page.
Finally, you can verify the version of the previously installed tool.
$ ramalama version
ramalama version 0.17.1ShellSessionInstall and Configure Podman
Alternatively, you can use Docker, which I also use frequently. When it comes to Podman, I suggest installing Podman Desktop first. You can download it here. After installation, launch the Podman Desktop GUI and go to the “Settings” section to create a new Podman machine. Then, choose LibKrun as a default provider. This enabled GPU acceleration for containers running in macOS. This virtual machine is managed by krunkit and libkrun, a lightweight virtual machine manager (VMM) based on Apple’s low-level Hypervisor Framework. You can find a detailed explanation and performance analysis in the following article.

After creation, the virtual machine should start up. You can check its status in Podman Desktop as shown below.

Then run the following command to verify that Podman works.
$ podman version
Client: Podman Engine
Version: 5.7.1
API Version: 5.7.1
Go Version: go1.25.5
Git Commit: f845d14e941889ba4c071f35233d09b29d363c75
Built: Wed Dec 10 15:53:41 2025
Build Origin: pkginstaller
OS/Arch: darwin/arm64ShellSessionRun Model with Ramalama
RamaLama supports multiple AI model registries, including OCI Container Registries, Ollama, and HuggingFace. RamaLama defaults to the Ollama registry transport. Let’s assume we want to run the following model from the Ollama registry:

To run that tinyllama model with ramalama, you must execute the following command:
ramalama run tinyllamaShellSessionBy default, RamaLama tries to run the model inside the quay.io/ramalama/ramalama:latest container. The container exposes an OpenAI-compatible API on port 8080.
$ podman ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d36129d1f326 quay.io/ramalama/ramalama:latest llama-server --ho... 5 minutes ago Up 5 minutes 0.0.0.0:8080->8080/tcp ramalama-OikvMye7v9ShellSessionYou can interact with the model using the ramalama chat command as shown below:
ramalama chat "What's the date today?"ShellSessionIf you want to change a default model registry, for example, to HuggingFace, use the RAMALAMA_TRANSPORT environment variable.
export RAMALAMA_TRANSPORT=huggingfaceShellSessionThen you can run any GGUF model from HuggingFace.
ramalama run unsloth/DeepSeek-R1-Distill-Llama-8B-GGUFShellSessionInstead of the podman command to list running models, you can use the ramalama command. Of course, you can run several models at once. In this case, RamaLama will make them available externally using different ports.
$ ramalama ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
49eddec5f2ef quay.io/ramalama/ramalama:latest llama-server --ho... 3 minutes ago Up 3 minutes 0.0.0.0:8080->8080/tcp ramalama-NWPXNpQDBt
023d05c70666 quay.io/ramalama/ramalama:latest llama-server --ho... 39 seconds ago Up 39 seconds 0.0.0.0:8086->8086/tcp ramalama-ogwLbXQNDtShellSessionIntegrate Spring AI with Models on RamaLama
To test various models with RamaLama, I created a very simple Spring Boot application. It uses Spring AI together with the OpenAI module for integration with models running in RamaLama containers. Below is a list of dependencies for this application.
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-starter-model-openai</artifactId>
</dependency>
</dependencies>XMLThe application exposes a single REST endpoint GET /simple/{country}. Depending on the parameter, it asks about the capital city of a given country and requests a brief history of that city.
@RestController
@RequestMapping("/simple")
public class SimpleController {
private final ChatClient chatClient;
public SimpleController(ChatClient.Builder chatClientBuilder) {
this.chatClient = chatClientBuilder
.defaultAdvisors(SimpleLoggerAdvisor.builder().build())
.build();
}
@GetMapping("/{country}")
public String ping(@PathVariable String country) {
PromptTemplate pt = new PromptTemplate("""
What's the capital of {country} ?
Describe the history of that city briefly.
""");
return chatClient.prompt(pt.create(Map.of("country", country)))
.call()
.content();
}
}JavaSome of the parameters below are optional, such as the model name or increasing the Spring AI logging level. The app communicates with the LLM model provided by RamaLama at http://localhost:8080. Due to a possible conflict between the ports used, it is best to change the default port used by Spring Boot Web to 9080. The value of the API key, on the other hand, is irrelevant. You just need to set it to something other than null so that Spring AI will accept it…
spring.ai.openai.api-key = ${OPENAI_API_KEY:dummy}
spring.ai.openai.chat.base-url = http://localhost:8080
spring.ai.openai.chat.options.model = tinyllama
logging.level.org.springframework.ai.chat.client.advisor = DEBUG
server.port = 9080PlaintextThen, run the app with the following command:
mvn spring-boot:runPlaintextFinally, you can call our test REST endpoint for different values of the country parameter.
curl http://localhost:9080/simple/Germany
curl http://localhost:9080/simple/France
curl http://localhost:9080/simple/ItalyPlaintextThe following diagram illustrates GPU usage during our test calls.

The last registry supported by Ramalama that I would like to discuss in this article is the registry of ready-made images containing selected AI models. At the moment, there are slightly more than 20 images with popular models such as gpt-oss, gemma3, qwen, and llama. You can view the full list of available images with models on this webpage.

To run the container with a specific image, you must add the rlcr:// prefix to the model name. For example, you pull and run the gemma-3-1b-it model as shown below.

Then, you don’t even have to restart our sample application if you have already stopped the previously tested models. Here’s a fragment of Gemma’s answer about the Spanish capital.

Use RamaLama to Run Containers with AI Models in Kubernetes
We can use RamaLama to run AI models inside containers on Kubernetes, either on CPU or GPU nodes. However, in this section, I would like to use GPU acceleration on macOS, as I did earlier when running models in Podman. We will try this solution on Minikube. Krunkit is a macOS virtualization tool optimized for GPU-accelerated virtual machines and AI workloads. In the first step, we must install it on macOS using Homebrew:
$ brew tap slp/krunkit
$ brew install krunkitPlaintextTo use the krunkit driver we must install vmnet-helper. Here’s the command that downloads the latest release from GitHub and installs it to /opt/vmnet-helper. After installing both tools, we can create a Minikube cluster. It is best to use the following command to increase the default resources allocated to the Minikube machine.
minikube start --memory='16gb' --cpus='8' --driver krunkit --disk-size 50000mbPlaintextThen, we must install the Kubernetes Generic Device plugin. It enables allocating generic Linux devices, such as serial devices or video cameras, to Kubernetes Pods. In our case, it will allow us to assign GPUs to Pods running AI models. The plugin is installed as a DaemonSet. The following configuration allows us to use up to 4 GPUs in Kubernetes.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: generic-device-plugin
namespace: kube-system
labels:
app.kubernetes.io/name: generic-device-plugin
spec:
selector:
matchLabels:
app.kubernetes.io/name: generic-device-plugin
template:
metadata:
labels:
app.kubernetes.io/name: generic-device-plugin
spec:
priorityClassName: system-node-critical
tolerations:
- operator: "Exists"
effect: "NoExecute"
- operator: "Exists"
effect: "NoSchedule"
containers:
- image: squat/generic-device-plugin
args:
- --device
- |
name: dri
groups:
- count: 4
paths:
- path: /dev/dri
name: generic-device-plugin
resources:
requests:
cpu: 50m
memory: 10Mi
limits:
cpu: 50m
memory: 20Mi
ports:
- containerPort: 8080
name: http
securityContext:
privileged: true
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: dev
mountPath: /dev
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: dev
hostPath:
path: /dev
updateStrategy:
type: RollingUpdateYAMLAs before, we could use a ready-made image from the RamaLama registry. However, if we want to use the GPU support provided by the generic-device-plugin, we should mount the model as a volume to the ramalama container. First, let’s download the gemma-3-1b model from HuggingFace.
$ cd ~/models
$ curl -LO https://huggingface.co/google/gemma-3-1b-it-qat-q4_0-gguf/resolve/main/gemma-3-1b-it-q4_0.gguf?download=trueShellSessionThen, we can mount the model from the ~/models directory to Minikube with the following command:
minikube mount ~/models:/mnt/modelsShellSessionThe following Deployment uses the quay.io/ramalama/ramalama:latest image and mounts the gemma-3-1b-it-q4_0.gguf model from /mnt/models directory to that container. The model is launched internally using llama-server. The container with the model can use up to 1 GPU out of the 4 allowed across the entire cluster (squat.ai/dri: "1").
apiVersion: apps/v1
kind: Deployment
metadata:
name: gemma
spec:
replicas: 1
selector:
matchLabels:
app: gemma
template:
metadata:
labels:
app: gemma
name: gemma
spec:
containers:
- name: llama-server
image: quay.io/ramalama/ramalama:latest
command: [
llama-server,
--host, "0.0.0.0",
--port, "8080",
--model, /mnt/models/gemma-3-1b-it-q4_0.gguf,
--alias, "gemma",
--ctx-size, "4096",
--temp, "0.7",
--cache-reuse, "256",
-ngl, "999",
--threads, "8",
--no-warmup,
--log-colors, auto,
]
resources:
limits:
squat.ai/dri: "1"
volumeMounts:
- name: models
mountPath: /mnt/models
volumes:
- name: models
hostPath:
path: /mnt/models
---
apiVersion: v1
kind: Service
metadata:
labels:
app: gemma
name: gemma
spec:
selector:
app: gemma
ports:
- name: http
port: 8080
type: ClusterIPYAMLLet’s verify if the pod is running after it was deployed:
$ kubectl get po
NAME READY STATUS RESTARTS AGE
gemma-5466d666f7-d4wnv 1/1 Running 0 27sShellSessionThen, you can expose the gemma Service outside Minikube using port-forward on port 8080. Your application can remain enabled as before. Try repeating the exercise we did earlier and send a few test requests to it. Verify whether the GPU is being used in the graph or during requests.
kubectl port-forward svc/gemma 8080:8080ShellSessionConclusion
RamaLama makes AI models execution simple, reproducible, and container-native. It allows us to use various model registries, such as Ollama and HuggingFace, as well as ready-made images from various OCI registries. It also provides an easy path to run AI models on Podman, Docker, and even Kubernetes. Thanks to RamaLama, I was able to leverage Apple Silicon GPUs when running AI models in containers.

1 COMMENT