You can get a quick introduction to garbage collector safe points at http://xiao-feng.blogspot.com/2008/01/gc-safe-point-and-safe-region.html. Beware of GC-specific terminology though - use http://www.memorymanagement.org/ as a reference :). You may also want to have a quick look at the SGen page at http://www.mono-project.com/Compacting_GC before starting.
SGen is a stopping garbage collector - all threads are stopped prior to a collection. The collection is performed and then the threads are restarted. The first part of my job involves modifying how the world is stopped, the idea being to ensure that as many threads as possible are parked in the safe-points before a collection. The last stack frame of threads not parked in safe-points cannot be precisely scanned. A basic patch which supports safe-points via polling is ready (http://code.google.com/p/mono-soc-2010/source/browse/trunk/gc-safe-points/safe-points.patch). I've provided a short overview of what the patch actually does below.
The above can (roughly) be broken up into two parts.
The first part involves working with the JIT. Polling code is inserted before every backward jump and return statement. This is done inside the mono_method_to_ir function (whose job is to translate the CIL instructions to the Mono Linear Intermediate Representation (http://www.mono-project.com/Linear_IL)). IR instructions are emitted which try to dereference a specially mapped page of memory, called, say, safe_point_page. This is only done for managed code.
The second part involves modifying the stop_world routine which is called by the thread running the collection to stop every other thread. SGen already has a signal based system in place. Stopping the world starts by the stopping thread sending a suspend signal to all other threads. The corresponding signal handler, on receiving the suspend signal, prepares the current thread for a garbage collection by populating a few structures with information like the current stack pointer. It then enters a loop, waiting for a restart signal. The stopping thread can then run the collection (using a semaphore to resolve threading issues). Once the collection is over threads are restarted by sending them the restart signal they are waiting for.
With safe points, this routine now changes to, firstly, changing the protection level of safe_point_page to PROT_NONE (MONO_PROT_NONE rather, we all like to be platform independent :)). All managed threads automatically segfault at the next safe point, on account of the dereference instruction that was inserted. The SIGSEGV handler then figures out that the cause of the segfault was, indeed, encountering a safe point and prepares the current thread for a GC. It then suspends the thread - exactly like the handler for the suspend signal. This is, however, not enough.
There are two situations where the above approach will not stop all threads. The first one involves one or more threads executing native code. Since no dereference instructions have been emitted into native code, native code will not seg-fault. The second occurs when a thread does not reach a safe point before the collection begins, despite running managed code. We cannot do much about threads executing native code, except perhaps falling back to the old suspend-via-signal scheme. For managed threads we may consider waiting till it encounters a safe point and segfaults. We can't wait forever, though, and we might need to fall back to using the suspend-via-signal method for them as well.
This problem is currently solved by first waiting for a small timeout (50 us tentatively, will need tuning) and sending a suspend signal to all threads. The signal handlers of threads already stopped do nothing while the ones of threads still running prepare the current thread for a GC and enter the wait loop. Depending on how a thread was stopped (through a safe point segfault or a suspend signal) a thread is marked to be parked at a safe point or otherwise. Synchronization issues are resolved using a CAS (Compare and Swap). The timeout is not required for correctness, but it allows a larger number of threads to converge to a safe point before a collection.
This method does not allow for AOT (ahead of time) compilation. The safe_point_page is allocated while JITting the CIL instructions - when the AOT image is saved, we have a random pointer embedded in the pre-compiled instructions. The next time the AOT instructions are fetched and executed, they try to dereference the same old pointer (which essentially has no meaning in the new execution context), leading to spurious segmentation faults. Currently, no polling code is emitted when AOT-compiling.
The next checkpoint would involve benchmarking and tweaking for better performance. I will also try to look into trying to implementing support for AOT-compiling, though that is likely to be slower and trickier to implement.
SGen is a stopping garbage collector - all threads are stopped prior to a collection. The collection is performed and then the threads are restarted. The first part of my job involves modifying how the world is stopped, the idea being to ensure that as many threads as possible are parked in the safe-points before a collection. The last stack frame of threads not parked in safe-points cannot be precisely scanned. A basic patch which supports safe-points via polling is ready (http://code.google.com/p/mono-soc-2010/source/browse/trunk/gc-safe-points/safe-points.patch). I've provided a short overview of what the patch actually does below.
The above can (roughly) be broken up into two parts.
The first part involves working with the JIT. Polling code is inserted before every backward jump and return statement. This is done inside the mono_method_to_ir function (whose job is to translate the CIL instructions to the Mono Linear Intermediate Representation (http://www.mono-project.com/Linear_IL)). IR instructions are emitted which try to dereference a specially mapped page of memory, called, say, safe_point_page. This is only done for managed code.
The second part involves modifying the stop_world routine which is called by the thread running the collection to stop every other thread. SGen already has a signal based system in place. Stopping the world starts by the stopping thread sending a suspend signal to all other threads. The corresponding signal handler, on receiving the suspend signal, prepares the current thread for a garbage collection by populating a few structures with information like the current stack pointer. It then enters a loop, waiting for a restart signal. The stopping thread can then run the collection (using a semaphore to resolve threading issues). Once the collection is over threads are restarted by sending them the restart signal they are waiting for.
With safe points, this routine now changes to, firstly, changing the protection level of safe_point_page to PROT_NONE (MONO_PROT_NONE rather, we all like to be platform independent :)). All managed threads automatically segfault at the next safe point, on account of the dereference instruction that was inserted. The SIGSEGV handler then figures out that the cause of the segfault was, indeed, encountering a safe point and prepares the current thread for a GC. It then suspends the thread - exactly like the handler for the suspend signal. This is, however, not enough.
There are two situations where the above approach will not stop all threads. The first one involves one or more threads executing native code. Since no dereference instructions have been emitted into native code, native code will not seg-fault. The second occurs when a thread does not reach a safe point before the collection begins, despite running managed code. We cannot do much about threads executing native code, except perhaps falling back to the old suspend-via-signal scheme. For managed threads we may consider waiting till it encounters a safe point and segfaults. We can't wait forever, though, and we might need to fall back to using the suspend-via-signal method for them as well.
This problem is currently solved by first waiting for a small timeout (50 us tentatively, will need tuning) and sending a suspend signal to all threads. The signal handlers of threads already stopped do nothing while the ones of threads still running prepare the current thread for a GC and enter the wait loop. Depending on how a thread was stopped (through a safe point segfault or a suspend signal) a thread is marked to be parked at a safe point or otherwise. Synchronization issues are resolved using a CAS (Compare and Swap). The timeout is not required for correctness, but it allows a larger number of threads to converge to a safe point before a collection.
This method does not allow for AOT (ahead of time) compilation. The safe_point_page is allocated while JITting the CIL instructions - when the AOT image is saved, we have a random pointer embedded in the pre-compiled instructions. The next time the AOT instructions are fetched and executed, they try to dereference the same old pointer (which essentially has no meaning in the new execution context), leading to spurious segmentation faults. Currently, no polling code is emitted when AOT-compiling.
The next checkpoint would involve benchmarking and tweaking for better performance. I will also try to look into trying to implementing support for AOT-compiling, though that is likely to be slower and trickier to implement.
No comments:
Post a Comment