Who Should Cache?

In the past I have mentioned our Authorization services, which all other microservices talk to on each request to verify the user can access the resource they are requesting. This is for each and every request, and it’s a synchronous call. It’s very, very hot path.

So what do you do when you have a hot path like this? You cache it.

But who caches it? The authorization service? The microservices? There are very interesting trade-offs and semantics to consider.

10k Foot View

Here’s how our services formulate an authorization request:

A request comes in from a user to a microservice to access a resource. The resource can be any of hundreds of different types.
The microservice determines which core Authz resource the requested resource maps to (typically this is an owned-by relation).
The microservice formulates an authorization query.
The microservice sends the query to the Authz service.
The Authz service responds with a yes or no.
The microservice either allows or denies the request.

Where to Cache?

There are two clear places we would immediately consider placing the cache set/get:

Authz Service: The Authz service could cache the results of the authorization queries. Then, next time the same query comes in, it can check the cache and respond immediately.
Microservice: The microservice could cache the results of the authorization queries. Next time a user requests the same resource, the microservice can omit the authorization query and respond immediately.

Now, onto the tricky parts: trade-offs and semantics.

Cache Invalidation

The first thing to consider is the notoriously hard problem of cache invalidation.

The Authz service is the source of truth. It also knows when a resource is updated and when a user’s permissions are updated. If the Authz service owns the caches, it can invalidate it immediately. This is a huge win.

Imagine if we do not have this property: A user could check if it could access a resource, then get denied. Another user could then grant them access. If the first user tries again they would still be denied because the cache is stale. It’s really bad user experience and threatens the goal of read-after-write API semantics.

You could come up with a scheme to invalidate the cache, but there be dragons. ¹

Traffic Amplification

If we have the microservices cache the results, we can completely avoid the authorization query. Since a user request typically turns into 1-4 requests to the Authz service, this can actually reduce load on the Authz service by a significant amount.

Once we factor in the fact that the Authz check is a network call, and avoiding it can shave off a few milliseconds from every request, it’s clear this is a very desirably property.

Shared Caching

We have a lot of microservices. The Authz queries are normalized and that is used to determine the cache key.

For performance reasons, we would like to have a shared cache. This means that if microservice A checks if a user can access a resource, and then microservice B checks the same thing, B should get the result from the cache from the earlier call to A.

Isolation

We like our microservices to be as independent as possible. They do not share a database. More abstractly, they don’t share state at all. We would be violating this principle if we had them share a cache, which is a form of state.

Fighting Formats

If the microservices share a caches, they have to agree on the format of the cache. This is a non-trivial problem. If we change the format of the cache, we have to coordinate the change across all microservices all at once.

Failing to syncronize the format would result in a cache miss at best which could result in thundering herd. This is an outage waiting to happen.

I have done many database migrations in my time and I can tell you that coordinating a change like this is a nightmare. This would not bring me zen. I try not to design things that don’t bring me zen.²

Hybrid Approach

When designing this caching mechanism… we kind of wanted all the benefits. We wanted the Authz service to own the cache to be able to invalidate it, but we also wanted the microservices to cache the results to be able to avoid the network call.

We ended up with a hybrid approach where the Authz service sets the cache, but the microservices read from the cache.

This deliberately breaks our no-shared-state rule, but we think it’s worth it in this case. Usually we allow engineers to set cache keys and values as they see fit (within some guidelines), but in this case we have a strict format that the Authz service sets and the microservices read.

This effectively becomes the third supported communication protocol between services. First we had HTTP REST API calls, then we added an event bus, and now we have a shared cache.

Since this is a protocol, it gets a lot of scrutiny (and love and care). The format is well specified with a schema, it’s versioned, interop tested, and it has a predefined migration path forward.

Outcome

It’s been a few years since we implemented this and it’s been rock solid. We get to have our cake and eat it too: great caching and great cache invalidation.

We have evolved the format twice and it’s been a non-event. We have also added a few more microservices without issue.

In hindsight, I am glad that I emphasized cache invalidation as much as I did. Our frontend team later used authz checks as a make-shift polling mechanism (which is another story…), and it worked because they could rely on the cache invalidating as soon as a background task was finished. This is an emergent property that we did not anticipate, but meshed well with our design.

https://martinfowler.com/bliki/TwoHardThings.html ↩
PDD: Pager Driven Development? ↩