Jach's personal blog

(Largely containing a mind-dump to myselves: past, present, and future)
Current favorite quote: "Supposedly smart people are weirdly ignorant of Bayes' Rule." William B Vogt, 2010

Caches are evil

There's an old joke in programming culture: "There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors."

Up until this year, I don't think I've fully appreciated the first issue of cache invalidation, or cache behavior in general. But over the last few months I keep running into issues caused by poorly thought out cache systems. Some of them my own fault, most of them other people's.

I'm at the point where I'm going to treat anyone who proposes a cache with suspicion, and make sure they can answer all of the following questions clearly. Depending on their answers, there might be no good reason to make a cache right now, or at least not to make the simplest cache with no control mechanisms. As one example, memoization is a neat trick to show off function decorators / macros, but it offers no control over the cache, so you need to know if you're just using it for a trick or for something useful! So, before making a cache of any sort, consider:
  1. Is the code path as fast as it can possibly be without the cache? Do you have numbers?
  2. Will the cache have a maximum size, or could it grow without bound?
  3. If it grows without bound, either because of unbounded entries or because of unbounded memory for any particular entry, under what conditions will it consume all system memory?
  4. If it has a maximum size, how does it evict things when it reaches that size?
  5. Are you trying to cache something that could change?
  6. If so, how is the cache invalidated?
  7. How can it be invalidated / evicted manually by another thread or signal?
  8. Is there a race condition possibility for concurrent stores, retrieves, and evicts?
  9. How constrained are your cache keys, that is, what does it need to know about to create one?
  10. Do they need to take into account global info?
  11. Do they need to take into account contextual information (like organization ID if your application server runs in a multi-tenant system, or user ID, or browser type, or requested-language)?
  12. Or do they only depend on direct inputs to that code path?
  13. (New) Ok you went with a cache. Are you collecting cache statistics -- hit rate, percent of system time spent in cache related code? After all is said and done, what's the overall system improvement compared to the numbers from 1?
  14. (New) Related to 7, how debuggable is the cache? How testable is it? Can you inspect it live? Do you get notifications about evictions and causes (Guava cache, Caffeine library)? Any other monitoring?
  15. (New) Is it only going to live on one machine / app instance? Or do you need to look into a distributed memcached/redis/etc. solution?
  16. (New) Does your cache bypass security?
  17. (New) For a 'big cache' proposal, like adding memcached for everything, read this blog and ask "What's the risk of these issues?"

(Edit: While most of these questions are centered around addressing the question "Do you even need a cache?", I may add more questions here as I find other objectionable cases / common bugs especially if you've answered "yes" to go with a cache. They'll be marked with New...)

(Edit 2022-02-11: added explicit item about security, which should have been here before, but also because of this. But also remembered degrading security once years ago on a bcrypt using service. Work factor was high, but to mitigate potential ddos/plus offer faster re-auth (e.g. API users passing their password up instead of a token) I cached successful logins into a sha256 map in memory. If database were compromised, bcrypt hashes would be there, but if application memory was compromised, sha256 hashes would be there.)

From the line of questioning, you can see that I generally won't have a problem for caches on things that are already fast but need the extra cache benefits, and on things that can't change, e.g. on pure functions. As long as the memory behavior is accounted for. Memoize can still be useful! But most caches that result in bugs aren't like that. Most are used to paper over slow code even before trying to make the slow code faster, and most things that become slow are impure.

"This thing is slow, let's make it fast with a cache!" is a common mentality. But profiling isn't even done every time to see what is slow about the thing, people just throw caching layers on top to avoid the harder work of profiling and real optimization analysis. I argue this should be done first.

By asking the question about what your key needs, it could be possible to avoid a bug that came about from missing something. In a cache bug I fixed several months ago that was my own fault, I had missed the fact that my keys needed to take into account organization context for some of the values. There weren't any security issues because things later in the pipeline would catch that, but there could have been mysterious errors to the end user if they had by chance named an object the same as some other tenant on the application.

The caching issue I've been battling for the last week and a half has to do with localized values not being displayed immediately. Could have been avoided if whoever designed the thing in the first place had taken into account requested-language into their cache key. Once I had a workaround on the server side, I found another cache on the client side that also wasn't evicting things when the requested language changed (or having the language as part of its keys in the first place).

This blog has a simple form of page cache. Basically when a guest user visits, I look to see if the URL has been cached, and if so, I read the cache contents from disk and serve that. If not, the request goes through my PHP logic to generate the page content, serves it, and saves it as a new cache entry. I didn't answer all these questions when I first made it, but let me try now.
  1. My PHP mini-framework is pretty fast. But I haven't done profiling. I suspect any major perf gains would involve one of rewriting into a new framework (you can see the original skeleton of the thing here) or changing databases away from MySQL since I do a number of queries per render. MySQL connection contention on my old shared host was the main reason I made a cache.

  2. The cache is limited in entry maximum size by how many unique URLs I have. I could generate a sitemap and count them, but haven't yet. Each cache is relatively small, it's exactly what I serve to the client. Right now the cache for the home page is 31k. If I wanted I could add compression, that would save a lot. Curl says that uncompressed my home page is 31234 bytes, but compressed it's only 8527 bytes.

  3. Since the cache doesn't restrict the size of the entry, it's possible I could deplete my system's file storage by making blog posts that are many megabytes / gigabytes in size.

  4. Since there is no size checking, there is no eviction step if it gets close.

  5. The things I'm caching are HTML pages, of course they can change! But the cache only applies to explicitly set GET request handlers. Some GET request handlers, like my About page, aren't explicitly set, so aren't cached.

  6. The cache is invalidated when someone creates a new comment, or I create a new blog post, edit a post, or edit/delete someone's comment. The invalidation step is pretty stupid, I just nuke every file in the cache store. It could be smarter. But being smart might be slower. The issue is it would need to know which other cached files reference data and evict only those ones. So if I edited someone's comment, if it was a recent comment, I'd need to evict every page anyway because recent comments appear on the sidebar of every page. If it was an old comment though, I could just evict the post it appeared on. I don't edit / delete comments often enough (my pseudo-captcha with a JS-derived submission endpoint keeps out spam pretty effectively) for that to be worth it. Most activity on the blog is new posts and new comments, both of which need to nuke the whole cache to update the sidebar. Changing that would require the sidebar to update asynchronously and not be served in the original response. That's where I'll go first if the blog ever becomes popular and struggles with performance, and it would allow me to go back to a simpler cache invalidation strategy.

  7. I have a cache service that other parts of the code can call into. Its methods are a helper wrapper that other service handlers can pass their context into so the cache can do the normal handle flow and cache the results itself (if this was Python it'd be a decorator), a method to get the cache file given the cache key, a method to check for existence of such a file, a method to store a new entry, a method to get the contents of a cache key's file, and a method to invalidate that just deletes all caches. Outside of the code, I can always SSH into my server and rm the cache folders manually.

  8. I'm trusting the file system to handle concurrency/race-condition problems.

  9. Cache keys are built from the URL request pattern values. If you follow the github link earlier, you'll see my routing matcher looks like 'get:/home/page/@page' => 'display_home_page'. Only routes that have 'get' as the request type are eligible for caching (in general if they lack get or post or both, both are accepted). The cache key is then built to be something like "HomeService_display_home_page_/home/page/2". I hash the params so the file system doesn't freak out.

  10. Global info for the key is not needed!

  11. Context info is not needed! At least within the cache. Whether the cache gets invoked at all depends on context (get request, not-logged-in-user) but the cache itself doesn't know about that. If I wanted to change that, and have the cache depend on the logged in user context (because my side panel is slightly different), I could, but it would be important for me to update the cache key or else I'm going to have a hard time.

  12. Thus, the cache itself only depends on the direct inputs, which are the service class, service call, and service call params.

Despite not doing this exercise explicitly when I made my cache (and I know I had some bugs with it, e.g. my first implementation of invalidate was just to invalidate the post where changes happened, but that didn't help the sidebar for other pages) I think my cache system turned out ok. I sort of wonder if there's a possibility for a race condition when trying to read a file's contents as it's getting deleted, but I'm kind of naively trusting the filesystem to do the right thing. I should test it...

Still, if every cache went through this exercise before it was implemented, even as a check to see if it should be implemented, I think systems would be a lot better designed and more robust. Plus you'd have some good documentation on the cache itself that can be referenced by future maintainers!

What's not so evil are caches themselves, caches are great, but poorly thought out and undocumented caches are demonic spawn. In our haste to get the nice benefits of caches in general, sloppiness ensues, and eventually someone has to deal with the consequences.

Posted on 2017-06-17 by Jach

Tags: programming


Trackback URL:

Back to the top

Back to the first comment

Comment using the form below

(Only if you want to be notified of further responses, never displayed.)

Your Comment:

LaTeX allowed in comments, use $$\$\$...\$\$$$ to wrap inline and $$[math]...[/math]$$ to wrap blocks.