Detecting API Pressure in Gunicorn Backends

Single-threaded backends like Gunicorn are notoriously tricky to autoscale. This is because they can only handle one request at a time and are often I/O-bound instead of CPU-bound. I have previously written about Gunicorn, the WSGI server that fronts our Python applications. It’s a great server, but it has a few limitations. Namely, it’s single-threaded and synchronous,¹ and so we experience this autoscaling pain too.

That is one of the reasons why we pay close attention² to performance³ and resource usage: we want to ensure that we can handle the load.

But how do we measure the load? How do we know when we are at capacity?

This is a critical question for a platform, and it’s hard to answer in a distributed system.

Guestimating Load Capacity

How do we estimate the load capacity of our backend application?

CPU Is Not a Good Metric for Gunicorn

A typical way to measure load involves looking at the CPU and memory usage of the server. If the CPU is at 100%, you are definitely at capacity. But at this point, you are already in trouble because your CPU is pegged and your response times are slow. The general industry advice is to keep your CPU usage below 80% or 90% to ensure you have some burst headroom.

This advice is good, but it’s not perfect. It catastrophically breaks down when it comes to single-threaded applications or I/O-bound applications like Gunicorn.

In our web application, most of the time is spent waiting for the database (or other services) to respond. This means that the CPU is mostly idle, but we are still at capacity because we are waiting for the database and we have only one thread. So scaling based on CPU usage is not a good metric for us.

QPS isn’t Reliable Either

If CPU is a bad metric, what about queries per second (QPS)? QPS is, broadly speaking, the number of requests your service can handle in a second. We can take the mean QPS response time, add a little buffer, and say that is our capacity per thread. Then, using the CPU load we can estimate core-time to wall-time ratio and calculate the number of threads we need.

First, we collect a few numbers:

The mean response time of a request (wall time). Let’s say it’s 80ms.
The mean cpu usage of a request (core time). Let’s say it’s 10ms.
The peak QPS of our service. Let’s say it’s 300 QPS.
Our desired buffer. Let’s say it’s 20%.

Our core-time to wall-time ratio is therefore 10ms / 80ms = 12.5%. Each thread can handle 1 second / (125% _ 80ms) = 10 QPS. Each core can handle 1 second / (125% _ 80ms * 12.5%) = 80 QPS. This translates to 8 threads per core.

So, if our service needs to handle 300 QPS, we need 300/80 = 4 cores and 300/10 = 30 threads. This is a rough estimate, but it’s a good starting point.

The issue with this method is that it’s a static calculation and highly sensitive on the mean response time and CPU ratio. If a spate of expensive requests comes in, we will be under capacity, the CPU will spike, and our requests will first slow down and then be dropped.

Better: TCP Accept Queue Depth

A better solution is to look at the TCP accept queue depth. Connections are placed in the accept queue after they are accepted by the kernel but before they are accept()ed by the application. These are connections waiting for your application to service them.

The main idea is that if the accept queue starts to fill up, you are at capacity and need to scale out. We can publish the accept queue depth as a metric and alert on it (or use it to scale out signal). The downside is that this detects load after it has already happened — at the point you can action it, you are already at capacity. ⁴ Nevertheless, this is an extremely high-value signal.

Linux exposes these connections in /proc/self/net/tcp. This file looks like this:

  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode
   0: 00000000:22B8 00000000:0000 0A 00000000:0000002D 00:00000000 00000000     0        0 176082260    46 ffff8881f065e000 100 0 0 10 0
   1: CBF80D0A:22B8 16FD0D0A:9D32 01 00000000:0000007B 00:00000000 00000000     0        0 0            1 ffff888237d5e800  20 4 30 10 -1
   2: CBF80D0A:22B8 16FD0D0A:9D0E 01 00000000:0000007B 00:00000000 00000000     0        0 0            1 ffff88841bce6800  20 4 30 10 -1
   3: CBF80D0A:22B8 16FD0D0A:9C80 06 00000000:00000000 03:00001471 00000000     0        0 0            3 ffff8883910bd8b8
   4: CBF80D0A:22B8 16FD0D0A:9D42 01 00000000:0000007B 00:00000000 00000000     0        0 0            1 ffff888100106800  20 4 30 10 -1
   5: CBF80D0A:22B8 16FD0D0A:9D72 01 00000000:0000007B 00:00000000 00000000     0        0 0            1 ffff8881ddf21800  20 4 30 10 -1

Here’s a breakdown of the columns:

Column	Description
`sl`	Connection entry number
`local_address`	Local IP address and port
`rem_address`	Remote IP address and port
`st`	*Connection state*
`tx_queue`	Transmit queue
`rx_queue`	Receive queue
`tr`	Time active?
`tm->when`	Timer data
`retrnsmt`	RTO timeouts
`uid`	User ID
`timeout`	Unanswered timeouts
`inode`	Inode number

We parse this file to find the line that matches our local listen address and is in the TCP_LISTEN state (st = 0A, as in the first line of the example). Next, we look at the rx_queue, which is the number of connections waiting to be accepted by the application.

Here, we see that the rx_queue is 0000002D, which is 45 in decimal. So, there are 45 connections waiting to be accepted. A single-threaded application like Gunicorn can only accept one connection at a time, so 45 connections being in the queue is a sign that we are under-scaled and need more capacity.

Putting it All Together

We once again customize our Gunicorn Arbiter. The Arbiter’s main loop is great place to emit metrics periodically.

from gunicorn.arbiter import Arbiter

class CustomArbiter(Arbiter):
    PROC_TCP_FILENAME = "/proc/self/net/tcp"

    def maybe_promote_master(self):
        """Runs every loop"""
        self.emit_accept_queue()
        return super().maybe_promote_master()

    def emit_accept_queue(self):
        """Get accept queue sizes"""
        with open(self.PROC_TCP_FILENAME) as f:
            content = f.readlines()

        gunicorn_ports = {sock.cfg_addr[1] for sock in self.LISTENERS}

        for line in content:
          _, ladd, _, st, _, rxq, _  = line.strip().split(maxsplit=6)

          if st != "0A":  # TCP_LISTEN
              continue

          port = int(ladd.split(":")[1], 16)

          if port not in gunicorn_ports:
              # Not a gunicorn port
              continue

          accept_q = int(rxq, 16)

          stats.gauge(
            "api.accept_queue",
            accept_q,
            tags={
              "port": str(port),
              "pod": os.environ["POD_NAME"],
            },
          )

Now, during the normal course of operation, we can see the accept-queue depth for each of our Gunicorn workers through the api.accept_queue metric. When this is 0, there is no accept-queue, and requests are being serviced immediately. When this is non-zero, we are under load and need to scale out. We can use a Horizontal Pod Autoscaler to scale out our Gunicorn workers based on this metric.

Tracking the accept-queue depth has been the most reliable way to detect load in our Gunicorn backends.

Django + Gunicorn requires running it in synchronous mode. Django is not quite ready for async yet. ↩
Rate Limiting ↩
SQS Performance ↩
There are some tricks to scale out before you hit capacity, but they are finicky and require a lot of tuning. For example, instead of looking for connections in the accept queue, you can measure active connections and compare it against the number of threads. If the ratio is too high, you can scale out preemptively. ↩