Kubernetes Liveness and Readiness Probes in practice
The temporary 502 error
Although I had created the livenessProbe
for the web app, every time I deployed the app to my Kubernetes cluster with the rolling update, the website displayed the 502 error for a few seconds. After tested the Kubernetes behaviour and read the official docs, it turned out that the failure of livenessProbe
only triggered the recreation of the pod but could not prevent the pod from receiving traffic from Kubernetes, thus, the 502 error occurred. The readinessProbe
, on the other hand, is the probe that can prevent a pod from receiving traffic until the readinessProbe
is succeeded. Therefore, after adding the readinessProbe
, the temporary 502 error disappeared. Now I have a zero-downtime deployment.
The final livenessProbe
and readinessProbe
for the web app:
livenessProbe:
httpGet:
path: /api/health
port: 80
initialDelaySeconds: 5
failureThreshold: 4
periodSeconds: 20
readinessProbe:
httpGet:
path: /api/health
port: 80
initialDelaySeconds: 5
failureThreshold: 4
periodSeconds: 10
Why the livenessProbe failed
But I still didn't get why the livenessProbe
failed. I tried to use kubectl top pod
command and reviewed logs, then I found it was the CMD command in Dockerfile
caused the issue.
Here is the original entrypoint.sh
that was used as the default CMD of the container:
#!/bin/bash
set -xe
rm -rf var/cache/*
bin/console cache:clear
bin/console cache:warmup
chown -R www-data:www-data var
exec supervisord -c /etc/supervisor/supervisord.conf
I used it in the Dockerfile
as CMD ["./entrypoint.sh"]
These symfony commands need relatively high memory usage and running those heavy commands caused php-fpm unable to handle extra tasks. When those commands were done and memory usage was back to normal, the app started to serve more traffic. Actually, running these heavy commands in runtime does not make much sense, they should be run in the build time. Finally, I moved these commands from the entrypoint.sh
file to the Dockerfile
so they are run when we build the image.
RUN set -xe && \
rm -rf var/cache/* &&\
bin/console cache:clear && \
bin/console cache:warmup && \
chown -R www-data:www-data var
Lifecycle hook
During investigating the 502 error, I also found some good practices that could minimize downtime during deployments, especially when you run the k8s cluster with external services like Azure application gateway. In those cases, using the preStop
lifecycle hook will help with minimizing the chances of 502 error
For example
kind: Deployment
metadata:
name: x
labels:
app: y
spec:
...
template:
...
spec:
containers:
- name: ctr
...
lifecycle:
preStop:
exec:
command: ["sleep","5"]