Speed Up Java Startup on Kubernetes with CRaC
In this article, you will learn how to leverage CRaC to reduce Java startup time and configure it for the app running on Kubernetes. The OpenJDK Coordinated Restore at Checkpoint (CRaC) project was introduced by Azul in 2020. As you probably know, Azul is an organization famous for the OpenJDK distribution called Azul Zulu. Azul shipped an OpenJDK 17 distribution with built-in support for CRaC. Its aim is to drastically reduce the startup time and time to peak performance of Java apps. Micronaut and Quarkus frameworks already support CRaC, while Spring Framework announced to provide support in November 2023.
What’s the idea behind CRaC? In fact, it is a pretty simple concept. CRaC takes a memory snapshot at the app runtime and then restores it in later executions. It is based on the Linux feature called Checkpoint/Restore In Userspace (CRIU). Unfortunately, there is no CRIU equivalent for Windows or Mac, so currently you can use CRaC just on Linux. In our case, it is not a problem, since we are going to build a container from Azul Zulu OpenJDK image and then run it on Kubernetes. However, before we do it, let’s analyze the steps required to achieve a checkpoint/restore mechanism with CRaC.
Some time ago I published an article Which JDK to Choose on Kubernetes. I compared all the most popular JDK implementations. There were no significant differences between them in my tests on Kubernetes. So, the features like CRaC can make a difference for Java on Kubernetes.
How It Works
For the purpose of that part of our exercise, let’s assume we have already installed Azul Zulu OpenJDK, we have Linux and an app supporting CRaC (for me the second point doesn’t work since I have macOS :)). The first step is to run our app with the -XX:CRaCCheckpointTo
parameter. It enables CRaC and indicates the location of the snapshot:
$ java -XX:CRaCCheckpointTo=/crac-files -jar target/sample-app.jar
Once our app is running, we can run the following command in another terminal:
$ jcmd target/sample-app.jar JDK.checkpoint
The jcmd
command triggers app checkpoint creation. After a while, our snapshot is ready. We can go to the /crac-files
directory and see a list of the files. The directory structure won’t tell us much, but there is a file called dump4.log
containing the logs from the operation. If the command finishes successfully, we can go to the next step. In order to restore our image and run the app from its saved state, we need to run the following command:
$ java -XX:CRaCRestoreFrom=/crac-files
Your app should start much faster than before. The difference is significant. Instead of seconds, you may have several milliseconds required for startup.
Source Code
If you would like to try it by yourself, you may always take a look at my source code. In order to do that you need to clone my GitHub repository. The sample app Spring Boot for the current exercise is available inside the callme-service
directory. You can go to that directory and then just follow my instructions 🙂
Enable CRaC for Spring Boot
As I mentioned before, Spring Boot currently won’t support CRaC. It will probably change in November, but for now, let’s see what it means. If we run the standard Spring Boot app and then execute the jcmd
command for creating a checkpoint you will see something similar to the following result:
jdk.crac.impl.CheckpointOpenSocketException: tcp6 localAddr :: localPort 8080 remoteAddr :: remotePort 0
at java.base/jdk.crac.Core.translateJVMExceptions(Core.java:80)
at java.base/jdk.crac.Core.checkpointRestore1(Core.java:137)
at java.base/jdk.crac.Core.checkpointRestore(Core.java:177)
at java.base/jdk.crac.Core.lambda$checkpointRestoreInternal$0(Core.java:194)
at java.base/java.lang.Thread.run(Thread.java:832)
Fortunately, we can bypass this problem. In the Maven Central repository, there is the Tomcat Embed version that supports CRaC. We can include that dependency and replace the default tomcat-embed-core
module used by the Spring Web project. Here’s the solution:
<dependency>
<groupId>io.github.crac.org.apache.tomcat.embed</groupId>
<artifactId>tomcat-embed-core</artifactId>
<version>10.1.7</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
<exclusions>
<exclusion>
<groupId>org.apache.tomcat.embed</groupId>
<artifactId>tomcat-embed-core</artifactId>
</exclusion>
</exclusions>
</dependency>
We have a pretty simple Spring Boot app. It exposes some REST endpoints including the following one that returns a value of the VERSION
environment variable.
@RestController
@RequestMapping("/callme")
public class CallmeController {
private static final Logger LOGGER = LoggerFactory.getLogger(CallmeController.class);
@Autowired
Optional<BuildProperties> buildProperties;
@Value("${VERSION}")
private String version;
@GetMapping("/ping")
public String ping() {
LOGGER.info("Ping: name={}, version={}", buildProperties.isPresent() ? buildProperties.get().getName() : "callme-service", version);
return "I'm callme-service " + version;
}
}
Once we replace the tomcat-embed-core
dependency we should rebuild the app. There is a custom Maven profile that activates the replacement of the tomcat-embed-core
dependency in my sample app code. So remember about enabling the crac
profile during the build:
$ mvn clean package -Pcrac
Java with CRaC as Container on Kubernetes
In the first step, we need to prepare the image of our Java app. In order to do that, we will create a Dockerfile
in the app’s root directory. We will use the latest version of Azul Java 17 with CRaC support as a base image. Our image will contain the app uber JAR file and a single script for making the checkpoint.
FROM azul/zulu-openjdk:17-jdk-crac-latest
COPY target/callme-service-1.1.0.jar /app/callme-service-1.1.0.jar
COPY src/scripts/entrypoint.sh /app/entrypoint.sh
RUN chmod 755 /app/entrypoint.sh
Here’s the content of the entrypoint.sh
script, which was copied to the target image in our Dockerfile
. As you see, we are running here the jcmd
command after starting the Java app. There is one important thing about CRaC that we need to mention here. Here’s the fragment from CRaC documentation: “CRaC implementation creates the checkpoint only if the whole Java instance state can be stored in the image. Resources like open files or sockets are cannot, so it is required to release them when checkpoint is made. “. As a result, the jcmd
command will stop our Java process, so we should not kill the container/pod after that. If we run the script in that way after starting the container it will first create a snapshot and then will stop the pod after 10 seconds.
#!/bin/bash
java -XX:CRaCCheckpointTo=/crac -jar /app/callme-service-1.1.0.jar&
sleep 10
jcmd /app/callme-service-1.1.0.jar JDK.checkpoint
sleep 10
Let’s build the image using the following command:
$ docker build -t callme-service:1.1.0 .
Now, let’s consider our scenario in the context of Kubernetes. First of all, we need to create the snapshot and save its state on the disk. It is a one-time activity. Or maybe to be more precise, a one-time activity per each release of the app. Therefore, we should perform it even before creating (or updating) the Deployment
. Of course, we need to provide storage and assign it to the pod that creates a snapshot, and all the pods that restore the app from the store using CRaC. Let’s begin with the PersistenceVolumeClaim
definition:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: crac-store
namespace: crac
spec:
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 10Gi
In the next step, we will create a Kubernetes Job
that performs the checkpoint operation. It will run our already built image (1), and then execute the entrypoint.sh
script responsible for making checkpoint (2). The CRaC checkpoint operation requires higher privileges, so we need to allow it in securityContext
section (3). We will also mount the crac-store
PVC to the job under the /crac
path (4).
apiVersion: batch/v1
kind: Job
metadata:
name: callme-service-snapshot-job
namespace: crac
spec:
template:
spec:
containers:
- name: callme-service
image: callme-service:1.1.0 # (1)
env:
- name: VERSION
value: "v1"
command: ["/bin/sh","-c", "/app/entrypoint.sh"] # (2)
volumeMounts:
- mountPath: /crac
name: crac
securityContext:
privileged: true # (3)
volumes:
- persistentVolumeClaim:
claimName: crac-store # (4)
name: crac
restartPolicy: Never
backoffLimit: 3
Let’s apply the job to the Kubernetes cluster:
$ kubectl apply -f job.yaml
Kubernetes starts a single pod related to the Job
. Once it changes the status to Completed
, it means that the checkpoint operation is finished.
$ kubectl get po -n crac
NAME READY STATUS RESTARTS AGE
callme-service-snapshot-job-j7wkz 0/1 Completed 0 43s
Now, we can proceed with our app deployment. We will run three pods (1) of the app. We will use exactly the same image as before (2), but his time we run the java -XX:CRaCRestoreFrom=/crac
command (3) instead of the entrypoint.sh
script. In order to measure how much time the pod requires to be ready, we will add the redinessRrobe
with the lowest possible periodSeconds
(4). Thanks to that we will be able to compare the startup time of the app with and without the CRaC mechanism enabled.
apiVersion: apps/v1
kind: Deployment
metadata:
name: callme-service
spec:
replicas: 3 # (1)
selector:
matchLabels:
app: callme-service
template:
metadata:
labels:
app: callme-service
spec:
containers:
- name: callme-service
image: callme-service:1.1.0 # (2)
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8080
env:
- name: VERSION
value: "v1"
command: ["java"] # (3)
args: ["-XX:CRaCRestoreFrom=/crac"]
volumeMounts:
- mountPath: /crac
name: crac
readinessProbe: # (4)
initialDelaySeconds: 0
periodSeconds: 1
httpGet:
path: /actuator/health/readiness
port: 8080
securityContext:
privileged: true
resources:
limits:
cpu: '1'
volumes:
- name: crac
persistentVolumeClaim:
claimName: crac-store
Let’s apply the Deployment
to the Kubernetes cluster:
$ kubectl apply -f deployment-crac.yaml
Just to clarify – here’s the visualization of our scenario:
Finally, we can display a list of running callme-service
pods.
$ kubectl get po -n crac
NAME READY STATUS RESTARTS AGE
callme-service-6fb68cbd5b-5wz6x 1/1 Running 0 2m38s
callme-service-6fb68cbd5b-pds8c 1/1 Running 0 3m3s
callme-service-6fb68cbd5b-zbf6h 1/1 Running 0 2m18s
Compare the Startup Time of Pods
In order to compare the startup time of our app with and without CRaC we just need to replace the following single line in the Deployment
manifest.
One thing is worth mentioning here. For me, if I try to measure the startup of the Spring Boot app restored using CRaC e.g. with the metric application.started.time
it will always print the value measured during making the snapshot. Here’s a fragment of logs from that operation performed by the callme-service-snapshot-job
Job.
So now, if I restore the app from the CRaC the value returned by the endpoint GET /actuator/metrics/application.started.time
would be exactly the same. What is obviously not valid, but quite logical. Therefore we will base our research on the time reported on the Kubernetes. Since there is no direct statistic that shows the pod startup time period, we need to calculate it as the difference between the time when the pod was scheduled and the time when it was reported to be ready. Of course, such a calculation contains not only app startup time but the time required for the pod initialization or readiness probe period (1s
).
What are the results? For the pod with a CPU limit equal to 1 core pod with the standard app starts 14s
(around 11s-12s
just for the Java), while the pod restored with CRaC 3s
(~1s
or less for the Java app).
Final Thoughts
CRaC can be treated as another way to achieve fast Java startup and warmup than the native compilation provided by GraalVM. GraalVM will additionally solve a problem with a large memory footprint. However, it has a price, because with GraalVM there are more constraints and a potentially more painful troubleshooting process. On the other hand, with CRaC we need to create a snapshot image and store it on the persistent volume. So each time, we need to mount a volume to the pod running on Kubernetes. Anyway, it is better to have one more option available.
The main goal of this article is to familiarize you with the CRaC approach and show how to adapt it to Java apps running Kubernetes. If you are also interested in native compilation with GraalVM you can read my post about Spring Boot native microservices with Knative. There is also an article about GraalVM and virtual threads on Kubernetes available here.
5 COMMENTS