ZpqrtBnk

Distributed Cache TLC

Posted on March 30, 2015 in umbraco and edited on April 1, 2015

Using heavy memory caching is what gives Umbraco its good performances, but it comes at a price: in a load-balanced environment, Umbraco cannot simply rely on a central database, but also has to take care of distributing its caches.

Each node (ie, server) in the environment maintains its own caches, and anytime a node changes something, e.g. anytime a node publishes some content, that node needs to notify all other nodes of the change, so they can invalidate their cache and refresh its content.

In the old times (by Internet standards), this was achieved by shipping the umbraco.config file, containing the site's content, to each node. It worked, to some extent, because at that time all a rendering node needed to render content was the big Xml cache. Once domains, property value converters, pre-values, macros, etc. were introduced it could not work anymore because they all have their own caches that also need to be taken care of.

Distributed Cache

To address that issue, Umbraco supports a distributed cache mechanism. The high-level picture looks like this: whenever some service changes something that is cached, it notifies the DistributedCache class of the change. That class uses the ServerMessengerResolver to identify the configured IServerMessenger implementation, and notifies the messenger of the change.

The messenger is responsible for

delivering the notification to the local server, immediately
carrying the notifications to all nodes in the LB environment. On each node, notifications are received and delivered to the proper ICacheRefresher instances—each cache has its own refresher that processes the notifications and manages the cache accordingly.

Eventually, the refresher triggers the CacheRefreshed event, so that end-users can add their own code to the process.

WebService messenger

The original default messenger would issue web service requests to each server configured in umbracoSettings.config, for each change.

Eventually, a "batched" messenger was built on top of the original messenger, which would store notifications in a batch and send them all as one message per server, at the end of the current request, in order to reduce the amount of traffic between servers.

This batched messenger is the default messenger in Umbraco today.

It has, however, some limitations:

The "master" node has to know about all "slave" nodes, i.e. they need to be in the configuration file, else they are not notified. However, if one slave is down, the master still tries to reach it, a situation which usually triggers timeouts, errors, etc.
The slaves have to be accessible from the master, using the default Umbraco authentication. Which means that if you plug additional security on top of it, you'll likely break the web service. And the slaves need to have stable IPs that can go into the config file, something that is not always easy to obtain in cloud environments.
If a node is down for a few minutes, it will not receive notifications during that time. When the node goes up again, most of its caches will be refreshed entirely—except for a few caches that are persisted on the node itself, such as the Examine indexes or the Xml cache. One surely does not want to rebuild these caches each time a node restarts, yet this means that they can easily go out-of-sync.

Grand theory of things

There are plenty of very clever papers about the theory of distributed caches, service buses, ACID and CAP and all those interesting things, that make you want to build this shiny new architecture... We're not ignoring them. Yet at the moment we're trying a very pragmatic approach, which also takes care of backward compatibility.

And so, in order to address these issues, Morten and Shannon worked on "just" a database-based messenger, which I am going to describe here and now, as part of our new agilesque methodology: some of us write new code in a branch, and then someone else tries to understand, document, and merge the whole thing.

Database messenger

The database (and its batched equivalent) messenger writes change notifications to a database table. Then, it periodically checks that table for notifications originating from remote nodes, and delivers them to the proper cache refreshers. Each node keeps a local index of which notifications have been processed.

Nodes do not talk to each other anymore, and do not need to be configured anywhere either. The only thing that needs to be configured, is to tell the node that it is in a LB environment. Then, any node having access to the database automatically joins the LB environment and starts writing to, and reading from, the notifications table.

If a node goes down for a few minutes, thanks to its local index it will be able to catch up with notifications and make sure it's eventually synced with the other nodes.

That new messenger should be part of version 7.3.0 (it's currently being merged), but will most probably not be enabled by default. In order to enable the messenger, you'll want to drop the following code (or equivalent) somewhere in your solution (or ~/App_Code):

public class ServerMessengerConfiguration : ApplicationEventHandler
{
  protected override void ApplicationStarting(UmbracoApplicationBase umbracoApplication, ApplicationContext applicationContext)
  {
    // replace the server messenger
    ServerMessengerResolver.Current.SetServerMessenger(new BatchedDatabaseServerMessenger(
      applicationContext,
      UmbracoConfig.For.UmbracoSettings().DistributedCall.Enabled,
      // you can customize some options by setting the options parameters here
      new DatabaseServerMessengerOptions
      {
        // these callbacks will be executed if the server has not been synced
        // (i.e. it is a new server or the lastsynced.txt file has been removed)
        InitializingCallbacks = new Action[]
        {
          // rebuild the xml cache file if the server is not synced
          () => global::umbraco.content.Instance.RefreshContentFromDatabase(),
          // rebuild indexes if the server is not synced
          // NOTE: this will rebuild ALL indexes including the members, if developers
          // want to target specific indexes then they can adjust this logic themselves.
          () => Examine.ExamineManager.Instance.RebuildIndex()
        }
      }));

    // replace the server registrar (this is optional but allows you to track 
    // active servers participating in your load balanced environment)
    ServerRegistrarResolver.Current.SetServerRegistrar(new DatabaseServerRegistrar(
      new Lazy<ServerRegistrationService>(() => applicationContext.Services.ServerRegistrationService),
      // you can customize some options by setting the options parameters here:
      new DatabaseServerRegistrarOptions()));
}

Yes, it ain't pretty—yet. Consider the whole thing as beta. But it works pretty well by our tests.

Edit (Apr. 1st, 2015): the code has now been merged into 7.3.0.

The ServerRegistrar thing is there for information only. A server registrar is what provides the list of servers to the WebService messenger. The default one reads the umbracoSettings.config file. When the database-based one is activated, each server will automatically register itself—so the registrar can be used to list all currently active servers. But that list is not used anymore, since all a server needs to do is to read the notifications table.

Caveats

Does it mean that all nodes in a LB environment are equivalent?

Not entirely. Unfortunately, Umbraco's back-office still contains bits of code that rely on application-level (C# code) locks to synchronize their work, and so one and only node needs to be dedicated to running the back-end. In the future, the plan is to switch to database-level locks so that the back-end can be distributed, too.

Edit (Apr. 1st, 2015): some sort of diagram is available here

There used to be Disqus-powered comments here. They got very little engagement, and I am not a big fan of Disqus. So, comments are gone. If you want to discuss this article, your best bet is to ping me on Mastodon.