Of ThreadAbortException, deadlocks, and WinDBG

Thursday, April 6, 2017 4:35 PM dotnet umbraco

For some time now, users have been reporting that their entire Umbraco site would go down (as in, just return YSODs) until it was force-restarted, and the only thing that we could find in their logs was an exception similar to:

2017-04-04 14:56:56,238 [P10384/D16/T80] ERROR Umbraco.ModelsBuilder.Umbraco.PureLiveModelFactory - Failed to build models.
System.Threading.ThreadAbortException: Thread was being aborted.
   at System.Threading.Monitor.ReliableEnter(Object obj, Boolean& lockTaken)
   at System.Web.Compilation.BuildManager.GetBuildResultFromCacheInternal(String cacheKey, Boolean keyFromVPP, VirtualPath virtualPath, Int64 hashCode, Boolean ensureIsUpToDate)
   at System.Web.Compilation.BuildManager.GetVPathBuildResultFromCacheInternal(VirtualPath virtualPath, Boolean ensureIsUpToDate)
   at System.Web.Compilation.BuildManager.GetVPathBuildResultInternal(VirtualPath virtualPath, Boolean noBuild, Boolean allowCrossApp, Boolean allowBuildInPrecompile, Boolean throwIfNotFound, Boolean ensureIsUpToDate)
   at System.Web.Compilation.BuildManager.GetVPathBuildResultWithNoAssert(HttpContext context, VirtualPath virtualPath, Boolean noBuild, Boolean allowCrossApp, Boolean allowBuildInPrecompile, Boolean throwIfNotFound, Boolean ensureIsUpToDate)
   at System.Web.Compilation.BuildManager.GetVPathBuildResult(HttpContext context, VirtualPath virtualPath, Boolean noBuild, Boolean allowCrossApp, Boolean allowBuildInPrecompile, Boolean ensureIsUpToDate)
   at System.Web.Compilation.BuildManager.GetCompiledAssembly(String virtualPath)
   at Umbraco.ModelsBuilder.Umbraco.PureLiveModelFactory.GetModelsAssembly(Boolean forceRebuild)
   at Umbraco.ModelsBuilder.Umbraco.PureLiveModelFactory.EnsureModels()

And then, it seemed no view would compile, nothing would render—basically, everything was CompletelyBroken(TM). Which is slightly(!) annoying when the site is live.

ThreadAbortException

What is a ThreadAbortException exactly? It is an exception that is thrown in a running thread, when that thread is aborted via thread.Abort(). It is a somewhat special exception. Take the following code:

try
{
  ...
}
catch (ThreadAbortException e)
{
  // swallow
}

This does not work: the CLR re-throws the exception at the end of the try/catch block, whatever happens1 in the catch block—which makes sense since the thread is really stopping.

So, what would abort our ModelsBuilder code? The answer is: ASP.NET. More precisely: ASP.NET timeout control. Indeed, when a request runs for more than the permitted time, ASP.NET aborts the corresponding thread.

To be sure of this, we have extended Core's ILogger implementation to handle ThreadAbortException when writing the exception to Umbraco's log file. First, we had to detect that the exception is indeed caused by a timeout:

private static bool IsTimeoutThreadAbortException(Exception exception)
{
  var abort = exception as ThreadAbortException;
  if (abort == null) return false;

  if (abort.ExceptionState == null) return false;

  var stateType = abort.ExceptionState.GetType();
  if (stateType.FullName != "System.Web.HttpApplication+CancelModuleException") return false;

  var timeoutField = stateType.GetField("_timeout", BindingFlags.Instance | BindingFlags.NonPublic);
  if (timeoutField == null) return false;

  return (bool) timeoutField.GetValue(abort.ExceptionState);
}

There is a bit of reflection magic here, but it does what we want: return true if the exception is a ThreadAbortException that was caused by ASP.NET aborting the thread due to a timeout.

Whith this code in place, Umbraco's log reports:

2017-04-04 14:56:56,238 [P10384/D16/T80] ERROR Umbraco.ModelsBuilder.Umbraco.PureLiveModelFactory - Failed to build models.
The thread has been aborted, because the request has timed out.
System.Threading.ThreadAbortException: Thread was being aborted.
   at System.Threading.Monitor.ReliableEnter(Object obj, Boolean& lockTaken)
   at System.Web.Compilation.BuildManager.GetBuildResultFromCacheInternal(String cacheKey, Boolean keyFromVPP, VirtualPath virtualPath, Int64 hashCode, Boolean ensureIsUpToDate)
   at System.Web.Compilation.BuildManager.GetVPathBuildResultFromCacheInternal(VirtualPath virtualPath, Boolean ensureIsUpToDate)
   at System.Web.Compilation.BuildManager.GetVPathBuildResultInternal(VirtualPath virtualPath, Boolean noBuild, Boolean allowCrossApp, Boolean allowBuildInPrecompile, Boolean throwIfNotFound, Boolean ensureIsUpToDate)
   at System.Web.Compilation.BuildManager.GetVPathBuildResultWithNoAssert(HttpContext context, VirtualPath virtualPath, Boolean noBuild, Boolean allowCrossApp, Boolean allowBuildInPrecompile, Boolean throwIfNotFound, Boolean ensureIsUpToDate)
   at System.Web.Compilation.BuildManager.GetVPathBuildResult(HttpContext context, VirtualPath virtualPath, Boolean noBuild, Boolean allowCrossApp, Boolean allowBuildInPrecompile, Boolean ensureIsUpToDate)
   at System.Web.Compilation.BuildManager.GetCompiledAssembly(String virtualPath)
   at Umbraco.ModelsBuilder.Umbraco.PureLiveModelFactory.GetModelsAssembly(Boolean forceRebuild)
   at Umbraco.ModelsBuilder.Umbraco.PureLiveModelFactory.EnsureModels()

This confirms our hypothesis: the exception actually means that the request has timed out. And the stacktrace tells us where it has timed out:

at System.Threading.Monitor.ReliableEnter(Object obj, Boolean& lockTaken)
at System.Web.Compilation.BuildManager.GetBuildResultFromCacheInternal(String cacheKey, Boolean keyFromVPP, VirtualPath virtualPath, Int64 hashCode, Boolean ensureIsUpToDate)

A quick look at BuildManager source code shows that this quite probably corresponds to the lock (this) statement in GetBuildResultFromCacheInternal. In other words, the PureLiveModelFactory is asking the BuildManager to compile its generated models, and the BuildManager tries to get a lock on itself, and waits forever until it is aborted.

Which means it cannot get the lock, and the only reason it cannot, is because another thread is holding the lock and not releasing it—and quite probably, both threads have entered a deadlock. However, after spending a few hours staring both at BuildManager and PureLiveModelFactory code, no obvious cause for a deadlock could be identified.

If only we had a dump!

At that point, what would really help is a memory dump of the w3wp.exe process at the time of the exception. With a dump, we could look into each thread and find out what is locking what. However, creating a dump at the right time means closely monitoring the server, and at "some point in time", use a tool to trigger a dump creation.

This is not something we can ask our users to do. This, and we have no idea how to identify that "point in time" anyways. Instead, we have taken a more proactive approach. First, detect if the exception is "probably" happening while waiting for a lock:

private static bool IsMonitorEnterThreadAbortException(Exception exception)
{
  var abort = exception as ThreadAbortException;
  if (abort == null) return false;

  var stacktrace = abort.StackTrace;
  return stacktrace.Contains("System.Threading.Monitor.ReliableEnter");
}

And when that is true... create a memory dump from within Umbraco:

var dumped = MiniDump.Dump(withException: true);

The MiniDump class is a bit too long to fit in a blog post, but its code is part of Core. It basically writes a minidump of the current process to ~/App_Data/MiniDump, but no more that 8 of them (just to be sure we do not flood the disk).

And then we shipped version 7.5.12.

Look: dumps!

A few days later, we had our first reports of sites crashing with the dreaded ThreadAbortException.... and dumps! Which takes us to the next step: dump analysis.

This post is in no ways a WinDBG tutorial. Let us just say that you want to open the dump file with WinDBG, analyze it with !analyze -v, load the SOS extension with .load sos, list the running threads with !threads, dump the stacktrace of every thread with ~*e!clrstak (~42e!clrstack would dump the stacktrace of thread 42), and examine locks with!syncblk. Which gives us:

0:000> !syncblk
Index SyncBlock MonitorHeld Recursion Owning Thread Info  SyncBlock Owner
  286 1980a87c            1         1 15b02468 9ae8  42   02b73ba8 System.Object
  428 1980a9e8            3         1 15ac58c8 9794  31   01df1200 System.Web.Compilation.BuildManager
-----------------------------
Total           458
CCW             5
RCW             5
ComClassFactory 0
Free            0

This tells us that object 01df1200, of type BuildManager, is locked (counts for 1 in MonitorHeld) by thread 31, and there is one thread waiting for the lock (counts for 2 in MonitorHeld). Hence the total value of MonitorHeld: 3. In other words, BuildManager is locked by thread 31 and another thread is waiting for it.

Thread 31 has this (abbreviated) stacktrace:

0:000> ~31e!clrstack
OS Thread Id: 0x9794 (31)
Child SP       IP Call Site
171ad5f8 76f5c7ec [HelperMethodFrame_1OBJ: 171ad5f8] System.Threading.WaitHandle.WaitOneNative(System.Runtime.InteropServices.SafeHandle, UInt32, Boolean, Boolean)
...
System.Threading.ReaderWriterLockSlim.TryEnterReadLock(TimeoutTracker)
171ad7b8 151f41cf Umbraco.ModelsBuilder.Umbraco.PureLiveModelFactory.RazorBuildProvider_CodeGenerationStarted(System.Object, System.EventArgs)
171ad7f0 151f4167 System.Web.WebPages.Razor.RazorBuildProvider.OnCodeGenerationStarted(System.Web.WebPages.Razor.IAssemblyBuilder)
171ad7fc 151f407d System.Web.WebPages.Razor.RazorBuildProvider.GenerateCodeCore(System.Web.WebPages.Razor.IAssemblyBuilder)
171ad814 151f3ff4
...
System.Web.Compilation.BuildManager.CompileWebFile(System.Web.VirtualPath)
171ad9b4 186d12f6 System.Web.Compilation.BuildManager.GetVPathBuildResultInternal(System.Web.VirtualPath, Boolean, Boolean, Boolean, Boolean, Boolean)
...
Umbraco.Core.Profiling.ProfilingView.Render(System.Web.Mvc.ViewContext, System.IO.TextWriter)
171adac0 13c9ff3c System.Web.Mvc.ViewResultBase.ExecuteResult(System.Web.Mvc.ControllerContext)
...

This thread is rendering a view, and first it needs to compile the view. It reaches the point where we have to inject the pure live models assembly into the compilation, and tries to enter the lock which protects this assembly (so we do not use it while it is being re-generated).

Meanwhile, another thread, thread 42, has this (abbreviated) stacktrace:

0:000> ~42e!clrstack
OS Thread Id: 0x9ae8 (42)
Child SP       IP Call Site
18f5da68 76f5d04c [InlinedCallFrame: 18f5da68] 
18f5da64 1563ff4e DomainBoundILStubClass.IL_STUB_PInvoke(IntPtr, UInt32, System.Runtime.InteropServices.SafeHandle, UInt32, IntPtr, IntPtr, IntPtr)
18f5da68 1563fa04 [InlinedCallFrame: 18f5da68] Umbraco.Core.Diagnostics.MiniDump.MiniDumpWriteDump(IntPtr, UInt32, System.Runtime.InteropServices.SafeHandle, UInt32, IntPtr, IntPtr, IntPtr)
18f5dacc 1563fa04 Umbraco.Core.Diagnostics.MiniDump.Write(System.Runtime.InteropServices.SafeHandle, Option, Boolean)
18f5daf4 1563f901 Umbraco.Core.Diagnostics.MiniDump.Dump(Option, Boolean)
18f5db40 1563f452 Umbraco.Core.Logging.Logger.Error(System.Type, System.String, System.Exception)
18f5db70 1563f39e Umbraco.Core.Logging.LoggerExtensions.Error[[System.__Canon, mscorlib]](Umbraco.Core.Logging.ILogger, System.String, System.Exception)
18f5db8c 10a242e6 Umbraco.ModelsBuilder.Umbraco.PureLiveModelFactory.EnsureModels()
18f5dca4 10a23fdb Umbraco.ModelsBuilder.Umbraco.PureLiveModelFactory.CreateModel(Umbraco.Core.Models.IPublishedContent)
18f5dcb8 10a23f3b Umbraco.Core.Models.PublishedContent.PublishedContentExtensionsForModels.CreateModel(Umbraco.Core.Models.IPublishedContent)
18f5dcc8 10a23ed7 Umbraco.Web.PublishedCache.XmlPublishedCache.XmlPublishedContent+c__DisplayClass8.b__7()
18f5dce0 1f8e7aa0 Umbraco.Core.Cache.DictionaryCacheProviderBase+c__DisplayClass1.b__0()
18f5dd00 10449535 System.Lazy`1[[System.__Canon, mscorlib]].CreateValue()
18f5dd2c 104491a5 System.Lazy`1[[System.__Canon, mscorlib]].LazyInitValue()
18f5dd68 10448fee System.Lazy`1[[System.__Canon, mscorlib]].get_Value()
18f5dd78 10a23cd0 Umbraco.Core.Cache.HttpRequestCacheProvider.GetCacheItem(System.String, System.Func`1)
18f5ddb0 10a23b84 Umbraco.Web.PublishedCache.XmlPublishedCache.XmlPublishedContent.Get(System.Xml.XmlNode, Boolean)
18f5ddc4 10a23acd Umbraco.Web.PublishedCache.XmlPublishedCache.PublishedContentCache.GetById(Umbraco.Web.UmbracoContext, Boolean, Int32)
18f5dddc 10a236c6 Umbraco.Web.PublishedCache.XmlPublishedCache.PublishedContentCache.DetermineIdByRoute(Umbraco.Web.UmbracoContext, Boolean, System.String, Boolean)
18f5de04 10a2354c Umbraco.Web.PublishedCache.XmlPublishedCache.PublishedContentCache.GetByRoute(Umbraco.Web.UmbracoContext, Boolean, System.String, System.Nullable`1)
18f5de30 10a233c8 Umbraco.Web.Routing.ContentFinderByNiceUrl.FindContent(Umbraco.Web.Routing.PublishedContentRequest, System.String)
...

This thread is currently writing the minidump we are analyzing—because it has been aborted. Combining this with the exception stacktrace in Umbraco's log, we understand that it was aborted while trying to get BuildManager to compile models, so that a content finder could return a proper model class.

This thread has acquired the lock that protects the models assembly, and then it expects BuildManager to compile models, but in order to do so, BuildManager needs to lock itself—but it cannot, because it is already locked in thread 31, where it is waiting for... oh, deadlock indeed.

To make it short, it turns out that BuildManager also locks itself when compiling pretty much anything, including a view—something we had overlooked. Look for CompilationLock.GetLock(ref gotLock) statements in BuildManager if you are curious.

Fix it

Time to apply the usual deadlock-prevention solution: always lock things in the same order. We modify PureLiveModelsFactory so that EnsureModels first locks BuildManager before acquiring the other lock.

// always take the BuildManager lock *before* taking the _locker lock
// to avoid possible deadlock situations (see notes above)
Monitor.Enter(TheBuildManager, ref buildManagerLocked);

_locker.EnterUpgradeableReadLock();

Funny enough, the "always take the BuildManager lock before taking [...] to avoid possible deadlock situations" statement also appears in the source code for CompilationLock, with a reference to an old Whidbey issue.

Of course, this requires a reference to the BuildManager singleton instance. Which is nowhere to be found... and requires a bit more reflection:

// gets "the" build manager
private BuildManager TheBuildManager
{
  get
  {
    if (_theBuildManager != null) return _theBuildManager;
    var prop = typeof (BuildManager).GetProperty("TheBuildManager", BindingFlags.NonPublic | BindingFlags.Static);
    if (prop == null)
      throw new InvalidOperationException("Could not get BuildManager.TheBuildManager property.");
    _theBuildManager = (BuildManager) prop.GetValue(null);
    return _theBuildManager;
  }
}

And with this in place, we will always lock BuildManager before acquiring our own lock, if there is any chance that we might need BuildManager—thus preventing deadlocks.

Ship it

The fix will be part of Core 7.5.13 minor release. Enjoy!


  1. To a point. See ThreadAbortException on MSDN for more details, including using Thread.ResetAbort in order to not-abort an aborted thread. At your own risks. 

comments powered by Disqus