Graceful Gunicorn Pod Shutdowns

When developing for Kubernetes, one of the top things you must consider is what happens when your pod is terminated. Stateful or stateless, long-lived or short-lived, your pod will eventually terminate. In a typical case, this would be due to a deployment update, but you could also have a node update leading to termination.

Web servers are susceptible to this, as they will be actively handling requests when the pod starts termination. Terminating a pod while it is handling requests will lead to failed API requests and HTTP 500s.

If you have good monitoring, you will see these 500s and know something is wrong. So, let’s talk about how to handle this gracefully.

Pod Termination

Here is the general order of events for pod termination:

The pod is set to be terminated. You could start this with kubectl delete pod or updating the owning deployment.
The pod is marked as terminating, and the configured grace period starts. The pod is sent SIGTERM so it can start shutting down.
The pod is removed from the service endpoints, so no new requests are sent to it. While this happens quickly in the Kubernetes control plane, it may take a while for upstream services to stop sending requests.
After the grace period, the pod is forcefully terminated with SIGKILL.

There are a few interesting things to note here.

Key Issue: Downstream Deregistration

It takes a while for downstream services to remove the pods from service. For example, if you use the AWS Application Load Balancer, reconciling the target group via AWS APIs will take a few seconds. Then, it is up to AWS control plane to update the internal load balancer configuration to stop sending traffic. So there will be a period AFTER receiving SIGTERM where you will still receive new requests.

Therefore, if your code simply “exits” when it receives SIGTERM, you will kill not only the in-flight requests but also new ones coming in.

Built-In Solution: Gunicorn Defaults Don’t Work

My preferred Python WSGI server is Gunicorn. It comes with built-in support for graceful shutdowns:

TERM: Graceful shutdown. Waits for workers to finish their current requests up to the graceful_timeout.

So Gunicorn will continue to process existing requests but not new ones. This is a good default, but it’s not enough for Kubernetes because of the downstream deregistration delay.

We need to wait for new connections before shutting down.

Lazy Solution: Just Wait Longer

If the problem is new connections are still coming in after the pod starts termination… what if we wait longer?

We can increase the terminationGracePeriodSeconds in the pod spec to — let’s say — 300 seconds while disabling the graceful_timeout in Gunicorn so it will not shut down at all. Then, we can just terminate on the final SIGKILL.

This way, pod termination will take 300 seconds and hard shutdown after that period. Our goal here is to wait out the deregistration delay.

But now pod termination takes 300 seconds, which is a long time. That’s like… forever. If your updateStrategy is RollingUpdate with a sequential update strategy and you have a large deployment, this could take a long time to update. 10 pods = 50 minutes of waiting.

You can tune the 300 seconds to be shorter, but you are still not addressing the underlying issue that your termination condition is not reactive to your workload.

Better Solution: Wait for a Gap in Requests

What I’ve found to work the best is to wait for a gap in requests before shutting down. The idea is that if you have a gap in requests over a certain fixed period of time, you can be confident that the upstream load balancers have removed the pod as a target.

We make two changes to Gunicorn:

Record the time of the last request start.
Modify the Arbiter to check if there has been a gap in requests for a certain period of time before terminating.

from datetime import datetime
from gunicorn.arbiter import Arbiter


class CustomArbiter(Arbiter):
    # Seconds to wait for a gap in requests before shutting down
    SHUTDOWN_REQUEST_GAP = 15

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        self.last_request_start = datetime.now()
        self.sigterm_recd = False

    def handle_request(self, worker):
        self.last_request_start = datetime.now()
        return super().handle_request(worker)

    def handle_term(self):
        """Called on SIGTERM"""

        if self.sigterm_recd:
            # Sending 2 SIGTERMs will immediately initiate shutdown
            raise StopIteration

         # Prevent immediate shutdown if no requests recently
        self.last_request_start = datetime.now()
        self.sigterm_recd = True

    def maybe_promote_master(self):
        """Runs every loop"""
        super().maybe_promote_master()
        return self.maybe_finish_termination()

    def maybe_finish_termination(self):
        """Finish the termination if needed"""

        if self.sigterm_recd:
            gap = (datetime.now() - self.last_request_start).total_seconds()
            if gap > self.SHUTDOWN_REQUEST_GAP:
                raise StopIteration

I have found that this approach works very well against a variety of workloads. After scanning logs, a 15 second gap is enough to be confident that the pod has been removed from the upstream load balancer with a confidence of 99.9%. In practice, I have never seen a request come in after the 15 second gap.

No more random 500s during deployments!

…well, sort of. There are still a few edge cases where this might not work. For example if you have a long-running request that takes more than 15 seconds, you will still get a 500. To fix this, we should consult OS to see if there are more connections coming in before shutting down. You can do this in /proc/net/tcp, but that’s a topic for another day.