Simulating Queue Delay

I was having a discussion about queueing background work with an ex-coworker. He mentioned that when his system comes under heavy load, one of the things that increases is job wait times, leading to fun side effects. Really, what goes screwy is all the things that depend on jobs being finished without a way to verify the job is done. That’s a race condition!

This race condition is particularly nasty because during standard load patterns, you will very rarely (if ever!) see this issue crop up. That’s because the same aspects of a particular queue broker that make it desirable to use (fast message issuance and in-orderish issuance, to name a few) rule out encountering these race conditions under normal load.

One of the developer tools that I built into the stack at my company was the ability to inject queue delay into background jobs. This simulates the experience of your distributed system under heavy job load without actually having your whole system under load.

The Why

We see this race condition added to our code because we frequently move work to background tasks. In fact, in some cases, we dispatch a background task that itself dispatches multiple background tasks. And that’s just in one service, too! Once background tasks start being created in other microservices, the call graph becomes complex.

You may find that your API used to launch a background that did the thing directly. But now, the API launches a background task that launches another background task that does the thing — another level of indirection! Our API contract does not guarantee when a background job will complete, so this change is semantically-equivalent and is not a breaking change.

However, clients who programmed against the earlier, faster behavior now have a race condition that gets exercised more often. Your bugs get aggravated.

Truthfully, the race condition always existed. We just made it happen more often.

As an aside, if the API previously did the work in a blocking fashion but moved it to be asynchronous, this would be a semantic (and therefore breaking!) change. To accommodate this without breaking the API, we would have to either:

bump the API level, which is a massive pain, or
add a new optional call parameter to opt-in to the new asynchronous background behavior.

Because both would require client changes, often it’s easier to go with (2) so that you are only burdened with maintaining one API endpoint. In either case, you’re probably going to end up returning a job token so the client has a sync point to ensure the background work is done before proceeding.

The How

Simulated queue delay injections work by sticking a qdelay claim into the authentication token. The claim value is the number of milliseconds to delay jobs by. Whenever the client calls our API, the backend sees this claim. It simply adds the specified milliseconds to the sqs:SendMessage call via a combination of the DelaySeconds SQS parameter and celery eta parameter.

Since microservices forward authentication tokens in interservice calls (part of our zero-trust model), downstream RPC calls will also see a properly-set qdelay claim.

Passing call context between microservices is underrated. We have implemented a context map that is propagated down the RPC call chain completely. This includes hopping across services and hopping into background jobs. Initially, this was added for distributed tracing, but more tools were bolted onto it.

Caught Bugs

After implementing this, my UI partner-in-crime and I immediately exercised our product with queue delays. And, oh boy, were things… weird. We had a product that was sluggish and broken at times.

We found places all over our UI where the code expected work to be completed before it actually was. In certain cases, the API did not return a sync token. The UI developers worked around this by reading related data as a heuristic indicator of the actual work being finished. These are API deficiencies! The UI should not need to do crazy hacks like that. We added sync tokens to these APIs so the UI had a proper way to poll for the expected state.

Future Work

In the future, I may add methods to add “jitter” to the delay and the ability to set a delay on a per-microservice level. This will help simulate failure conditions when only one microservice is overloaded.

Finally, I want to add this to our automated end-to-end UI testing framework. It would be pretty simple to cover 80% of the test cases. All logins would get some queue delay, and some of the wait timeouts would be raised.

Closing Thoughts

These race conditions seem simple when written down after many, many weeks of debugging using very, very sporadic and disconnected logs. But these bugs are emergent in a distributed system. You could make the argument that these should never have occurred in the first place. While I agree with that sentiment, I think these bugs are inevitable. It’s better to have systems in place for catching and debugging these problems when they arise. And this is a great case where developer tooling can help.