Friday, July 23, 2010

Generating SIGILL and Performance Improvements

This post is about two weeks late. I'll come to what I did in that period in a later post.

Until now the patching phase involved replacing the instruction to be patched with an instruction which generated a SIGTRAP. This is neither desirable nor practical for three reasons:
  • The entire mono code-base essentially becomes un-debuggable; a user generated SIGTRAP completely throws GDB off the guard, even when not explicitly using breakpoints.
  • The soft-debugger will (probably) be switching to generating a SIGTRAP instead of a SIGSEGV soon. This will become very difficult if safe points continue to use SIGTRAP.
  • The current solution is a gross, ugly hack.
A better idea was, as Kumpera pointed out, to insert code that generates a SIGILL. This can be done with the two byte instruction sequence 0x0F 0x04 (or any of the other possible invalid byte sequences).

A few trivial details, like inserting a NOP after single byte instructions which may be a safe point (which, on AMD64, is only RET) had to be taken care of.

Independently of changing the SIGTRAP to a SIGILL, I made some basic improvements in my code, the result being a small (~ 3 %) improvement in Pystones.

I've forked the mono repository on GitHub (http://github.com/sanjoy/mono) which now hosts most of my (stable) work. - I should be able to get my work done at a much faster pace now. Hopefully this will get me some public review too.

Thursday, June 24, 2010

GC Safe Points by Patching

The initial idea for safe points (the one actually in my GSoC proposal) was to implement it via polling. This method, as already outlined in the previous post, involved emitting polling code at certain points in the code. This polling code then checked a condition and suspended the current thread based on the same. However, this approach has the bad effect of imposing a runtime overhead or about 2 % in best cases. Note that this is as opposed to a collection time overhead - the polling slowed down the executing code, even when there no collections.

One solution to this problem is implementing safe points via patching. Patching incurs an slight overhead as well, but that is O(n) on the number of threads. I observed a slowdown of 0.6 % to 1.6 % (details later) with 50 threads.

The core idea is to record the places where a thread is allowed to halt and, when a thread needs to be stopped safely, patch just enough of those sites to ensure that the code executes at least one of them. The patched (modified) instructions halt the thread when executed (how that happens is described below). When the thread needs to resume execution, the instructions that were patched are 'un-patched' by replacing the new halting code with the old instructions it initially replaced.

Implementing safe points by patching is, at best, non-trivial and personally it is one of the most difficult things I've done so far. There are several issues that that complicate the situation:

  1. Multiple threads executing the same method.
  2. Variable length instructions.
  3. Deciding which site(s) to actually patch.

Combine this with the fact that debugging, as we know it, is not possible (more so because of certain implementation decisions, mentioned below) and you have nice little task to keep you busy for two weeks. My initial approach towards the third point was incorrect - I was trying to get around by patching just one site. This complicated matters since I then had to make various exceptions for methods in which execution could not safely halt (managed allocators, write barriers et al). My mentor, Mark, then showed my how things would simplify if I patched multiple sites. It did simplify things by an order of magnitude and that is the approach the current implementation uses.

Firstly, all threads need to be stopped before any patching is done - it is possible, otherwise, for a particular thread to attempt execution of instructions currently being modified. This may lead to incorrect behavior. Parallel patching may be worth looking at some point in the future, especially if it has the potential to improve performance. Nevertheless, this implementation follows the following stop-world logic:

  1. Stop all threads.
  2. If a thread is currently executing native code, halt it right away. Otherwise patch the required sites and restart the thread.
  3. The managed threads then subsequently halt themselves as they 'hit' a patched site.
This is more or less straightforward to implement - the current implementation uses POSIX signals and two semaphores for synchronization. I could not figure out a way to get this to work predictably using only one semaphore.

Coming back to the 'issues' mentioned earlier, issues (1) and (2) synergise (in a very weird use of the term) and cause a problem. Initially the patching involved emitting code that would cause a SIGSEGV and this is what triggered the problem. Upon executing such code, the thread would receive a SIGSEGV and then the signal could be caught and the thread halted. Or so I thought. Now, this instruction takes up 13 bytes (8 bytes if the page to be dereferenced is mapped to a 32 bit address, but let's keep things simple here). Suppose we have two threads, one with an IP 70, the other with 80. There is a safe point at 75, all numbers mentioned being in base 10. For the first thread, the obvious choice will be to patch the instruction at address 75 (rather to at least patch that instruction). Since the SIGSEGV takes up 13 bytes, everything from 70 to 83 is overwritten. When the execution is resumed (for the patches to actually take effect and the threads to halt), the second thread executes a malformed instruction and receives a SIGILL.

The easiest solution I could find was to insert a x86 trap instruction (0xCC). This generates a SIGTRAP instead of a SIGSEGV. Since a trap instruction is only one byte long it automatically takes care of issues (1) and (2). It does have the disadvantage of debugging difficult - I intend to ultimately remove SIGTRAPS and make the patches generate SIGSEGVs instead.

The are several issues with the current approach:

  1. AOT is not supported. While it may take a while before I can support, I might just be able to hack up support for pre-compiled assemblies. Perhaps this is something I should pursue post GSoC. Currently it might make sense to have a policy which detects if a method has safe points and then decides appropriately.
  2. Bugfixes. While I've tested the patch as much as I can, I'm sure it has bugs. Dirty, hairy, ugly, non-trivial, multi-threaded, throw-your-monitor-out-of-the-window bugs.
I will elaborate further on any progress (and more on the current state) in later posts.

Thursday, May 27, 2010

Baby Steps

You can get a quick introduction to garbage collector safe points at http://xiao-feng.blogspot.com/2008/01/gc-safe-point-and-safe-region.html. Beware of GC-specific terminology though - use http://www.memorymanagement.org/ as a reference :). You may also want to have a quick look at the SGen page at http://www.mono-project.com/Compacting_GC before starting.

SGen is a stopping garbage collector - all threads are stopped prior to a collection. The collection is performed and then the threads are restarted. The first part of my job involves modifying how the world is stopped, the idea being to ensure that as many threads as possible are parked in the safe-points before a collection. The last stack frame of threads not parked in safe-points cannot be precisely scanned. A basic patch which supports safe-points via polling is ready (http://code.google.com/p/mono-soc-2010/source/browse/trunk/gc-safe-points/safe-points.patch). I've provided a short overview of what the patch actually does below.

The above can (roughly) be broken up into two parts.

The first part involves working with the JIT. Polling code is inserted before every backward jump and return statement. This is done inside the mono_method_to_ir function (whose job is to translate the CIL instructions to the Mono Linear Intermediate Representation (http://www.mono-project.com/Linear_IL)). IR instructions are emitted which try to dereference a specially mapped page of memory, called, say, safe_point_page. This is only done for managed code.

The second part involves modifying the stop_world routine which is called by the thread running the collection to stop every other thread. SGen already has a signal based system in place. Stopping the world starts by the stopping thread sending a suspend signal to all other threads. The corresponding signal handler, on receiving the suspend signal, prepares the current thread for a garbage collection by populating a few structures with information like the current stack pointer. It then enters a loop, waiting for a restart signal. The stopping thread can then run the collection (using a semaphore to resolve threading issues). Once the collection is over threads are restarted by sending them the restart signal they are waiting for.

With safe points, this routine now changes to, firstly, changing the protection level of safe_point_page to PROT_NONE (MONO_PROT_NONE rather, we all like to be platform independent :)). All managed threads automatically segfault at the next safe point, on account of the dereference instruction that was inserted. The SIGSEGV handler then figures out that the cause of the segfault was, indeed, encountering a safe point and prepares the current thread for a GC. It then suspends the thread - exactly like the handler for the suspend signal. This is, however, not enough.

There are two situations where the above approach will not stop all threads. The first one involves one or more threads executing native code. Since no dereference instructions have been emitted into native code, native code will not seg-fault. The second occurs when a thread does not reach a safe point before the collection begins, despite running managed code. We cannot do much about threads executing native code, except perhaps falling back to the old suspend-via-signal scheme. For managed threads we may consider waiting till it encounters a safe point and segfaults. We can't wait forever, though, and we might need to fall back to using the suspend-via-signal method for them as well.

This problem is currently solved by first waiting for a small timeout (50 us tentatively, will need tuning) and sending a suspend signal to all threads. The signal handlers of threads already stopped do nothing while the ones of threads still running prepare the current thread for a GC and enter the wait loop. Depending on how a thread was stopped (through a safe point segfault or a suspend signal) a thread is marked to be parked at a safe point or otherwise. Synchronization issues are resolved using a CAS (Compare and Swap). The timeout is not required for correctness, but it allows a larger number of threads to converge to a safe point before a collection.

This method does not allow for AOT (ahead of time) compilation. The safe_point_page is allocated while JITting the CIL instructions - when the AOT image is saved, we have a random pointer embedded in the pre-compiled instructions. The next time the AOT instructions are fetched and executed, they try to dereference the same old pointer (which essentially has no meaning in the new execution context), leading to spurious segmentation faults. Currently, no polling code is emitted when AOT-compiling.

The next checkpoint would involve benchmarking and tweaking for better performance. I will also try to look into trying to implementing support for AOT-compiling, though that is likely to be slower and trickier to implement.

Wednesday, May 26, 2010

Introduction

I've been selected to work on the Mono project for the Google Summer of Code, 2010. I will be working on their new simple generational garbage collector (SGen - http://www.mono-project.com/Compacting_GC) and implement garbage collector safe points. I will then go on to implement (at least) partial precise scanning of the stack frames, starting with the locals. Mark Probst (schani) will be mentoring me in the regard, though I've been bugging pretty much everyone on #mono-dev of late.

I will use use this blog as a place to both document my progress and (informally) document whatever it is that I'm trying to do.