How we invalidate cache for resource-heavy & long-running requests

Cache invalidation

Performance

Scalability

SRE

28th June, 2024

How to cache refresh for popular, heavy and long-running API requests

What problem did we encounter

Before I detail our cache invalidation strategy, let’s look at the problem we had at hand:

We cache some of our API responses on Redis (AWS Elasticache)
These API responses are cached with a fixed TTL expiry duration (this expiry duration varies from one API to another depending on the business case).
Whenever a cached API response expires, the next API request:
- formulates a new response
- sets it within the cache for the consecutive API requests to consume
- returns the fresh response for the API request
Some of the APIs whose responses we cache are:
- long-running (10 seconds to formulate a response when not cached)
- resource-heavy (cause a significant database load)
- popular (are frequently requested)

Now, for long-running, resource-heavy and popular API requests, the above caching setup caused the following issue whenever the cached response expired:

Before the first API request can formulate a new response & add it to the cache, consecutive API requests are received.
In the absence of any item in the cache, these consecutive requests also reach our database to formulate a fresh response.
Because of the large number of such requests reaching our database, the 10 seconds time now increases to up to 2 minutes (due to the increased database load).
During this two-minute window, the increased database load also affects other services in our setup.

With the above issue, our goal was to change our cache-invalidation mechanism to avoid the two-minute window of system instability. We wanted to be able to cache a fresh response without a significant impact on our overall system.

What solutions did we evaluate

Below are some of the solutions we attempted to achieve cache-refresh-without-system-overload-during-expired-cache:

Optimize the query

Asking around the forums, this was one of the recommendations.

We were aware that a permanent resolution would be to optimize the query so that it runs faster and doesn’t cause a substantial database load. However, query optimization in this case didn’t just require setting up additional indexes. It required schema changes (breaking up tables and adding new tables). And this meant substantial changes to one of our legacy sub-systems.

So, while optimizing the queries was the most permanent solution, we decided to currently not take this route considering the effort involved because of legacy complications.

Out-of-band cache refresh

If we could have a background job that could periodically populate our cache at a predetermined interval, we would never have our API requests experience an expired-cache scenario. Say if our background job could run every 8 hours and cache populate our API responses expiring every 24 hours. In this way, the actual API requests would never encounter an expired-cache scenario. And, with complete control over the background job, this solution could help us avoid system overload during cache refresh.

However, this approach was not suited to our scenario. This is because our API in question had a very large number of possible parameters that needed to be part of the cache key. So, the number of API responses the out-of-band job would have to cache would be large. Alternatively, we could use the out-of-band job only for the most popular parameters. But doing so would:

require us to maintain the popularity of the parameters for the out-of-band job to consume
complicate our caching strategy to debug any issues (since the cache refresh mechanism would differ for different API parameters)

Due to the above complications, we could not leverage an out-of-band cache refresh mechanism.

Stale-while-revalidate

With this approach, we consider a cached item stale for a predetermined number of seconds / minutes / hours before its expiry. Let’s call this duration a stale window. During this stale window an API request can:

trigger the cache refresh
and serve the soon-to-expire content from the cache

While this mechanism ensures a fast API response during cache-refresh, it still doesn’t solve the high system load during cache-refresh issue. This is because multiple incoming API requests during the stale window would trigger the cache refresh.

As a result, we could not leverage a plain stale-while-revalidate approach.

The solution that we implemented

While the stale-while-revalidate strategy looked lucrative (simple to implement, easy to debug for any issues in the future), we could not use it owing to the system load issue during cache-refresh. To resolve this, we leveraged a modified version of the stale-while-revalidate strategy:

Modified stale-while-revalidate

To avoid high system load during cache refresh, we wanted an approach where only one of the incoming API requests during the stale window could trigger the cache revalidation. To achieve this, we implemented a stale-while-revalidate with a lock mechanism:

Say we had an API response cached with an expiry duration (TTL) of 24 hours and a max-age of 23 hours.
Every incoming API request would check if the cached object was in a stale window.
On detecting a stale cached object, it would check for the presence of a lock to determine if that API request should trigger a cache refresh.
If it does not find the lock, the API request would:
- Create a new lock item in the cache with a short TTL (say, 10 minutes)
- Trigger an asynchronous task to fetch a fresh response from the database and then store it in the cache.
- Proceed with serving the response from the cache (that is stale but not yet expired) without waiting for the cache refresh to finish.
If it finds a lock present in the cache, the API request would:
- Proceed with serving the response from the cache (that is stale but not yet expired)

With the above approach:

The lock item stored in the cache with a short TTL would prevent multiple API requests from performing a cache refresh.
In case of errors / failures during asynchronous cache refresh, the short TTL for the lock item would expire, enabling the consecutive API requests to retry cache refresh.

In this way, the lock item would ensure all the incoming API requests do not trigger a cache refresh during the stale-cache window.

Why the modified stale-while-revalidate approach suited our setup

This solution worked well for our setup because:

It would work well without overloading our system during the stale / expiry window.
It was simple and elegant. This meant it would not complicate debugging API issues in the future.
It would work seamlessly to cache responses for the APIs, irrespective of the number of parameter combinations (number of distinct cache keys).
It would not require setting up and maintaining a separate out-of-bound job.

A downside of this cache-invalidation approach is:

If we do not receive any API requests during the stale window but receive numerous API requests after the cache expiry, we would experience the same system-overload-during-cache-expiry situation.
Having an adequate stale window can help minimize such situations, but it does not guarantee the non-occurrence of such scenarios.
The possibility of such scenarios is higher for APIs with a spiking workload (say, a ticket booking system with no requests till 10 AM and thousands of requests then onwards once the booking window opens). Our APIs in question do not experience such a spiking workload.

Punit Sethi

My fascination with caching

I'm a big beleiver in caching when it comes to delivering scalability and performance. Over the years, I've tackled various caching challenges, including optimizing cache-hit ratios, designing invalidation mechanisms, and implementing policies. My experience spans CDN (caching@edge), application layer (Redis, Varnish), and database (materialized views).

If there's a caching problem you'd like to chat about (even if it’s just for a good discussion and not a paid gig), email me on punit AT tezify.com.

Or, read my other posts on Site Reliability Engineering