Kubernetes Certified (CKAD)
CKAD

Kubernetes Liveness and Readiness Probes in practice

The temporary 502 error

Although I had created the livenessProbe for the web app, every time I deployed the app to my Kubernetes cluster with the rolling update, the website displayed the 502 error for a few seconds. After tested the Kubernetes behaviour and read the official docs, it turned out that the failure of livenessProbe only triggered the recreation of the pod but could not prevent the pod from receiving traffic from Kubernetes, thus, the 502 error occurred. The readinessProbe, on the other hand, is the probe that can prevent a pod from receiving traffic until the readinessProbe is succeeded. Therefore, after adding the readinessProbe  , the temporary 502 error disappeared. Now I have a zero-downtime deployment.

The final livenessProbe and readinessProbe for the web app:

livenessProbe:
  httpGet:
    path: /api/health
    port: 80
  initialDelaySeconds: 5
  failureThreshold: 4
  periodSeconds: 20
readinessProbe:
  httpGet:
    path: /api/health
    port: 80
  initialDelaySeconds: 5
  failureThreshold: 4
  periodSeconds: 10

Why the livenessProbe failed

But I still didn't get why the livenessProbe failed. I tried to use kubectl top pod command and reviewed logs, then I found it was the CMD command in Dockerfile caused the issue.

Here is the original entrypoint.sh that was used as the default CMD of the container:

#!/bin/bash

set -xe
rm -rf var/cache/*
bin/console cache:clear
bin/console cache:warmup
chown -R www-data:www-data var

exec supervisord -c /etc/supervisor/supervisord.conf

and in the Dockerfile I have CMD ["./entrypoint.sh"]

These symfony commands need relatively high memory usage and running those heavy commands caused php-fpm unable to handle extra tasks. When those commands were done and memory usage was back to normal, the app started to serve more traffic. Actually, running these heavy commands in runtime does not make much sense, they should be run in the build time. Finally, I moved these commands from the entrypoint.sh file to the Dockerfile so they are run when we build the image.

RUN set -xe && \
    rm -rf var/cache/* &&\
    bin/console cache:clear && \
    bin/console cache:warmup && \
    chown -R www-data:www-data var

Lifecycle hook

During investigating the 502 error, I also found some good practices that could minimize downtime during deployments, especially when you run the k8s cluster with external services like Azure application gateway. In those cases, using the preStop lifecycle hook will help with minimizing the chances of 502 error

For example

kind: Deployment
metadata:
  name: x
  labels:
    app: y
spec:
  ...
  template:
    ...
    spec:
      containers:
      - name: ctr
        ...
        lifecycle:
          preStop:
            exec:
              command: ["sleep","5"]

Reference

  1. minimize-downtime-during-deployments