# Conservative GC Algorithmic Overview

This is a description of the algorithms and data structures used in our
conservative garbage collector. I expect the level of detail to increase with
time. For a survey of GC algorithms, e.g. see Paul Wilson's
["Uniprocessor Garbage Collection Techniques"](ftp://ftp.cs.utexas.edu/pub/garbage/gcsurvey.ps)
excellent paper. For an overview of the collector interface, see
[here](gcinterface.md).

This description is targeted primarily at someone trying to understand the
source code. It specifically refers to variable and function names. It may
also be useful for understanding the algorithms at a higher level.

The description here assumes that the collector is used in default mode.
In particular, we assume that it used as a garbage collector, and not just
a leak detector. We initially assume that it is used in stop-the-world,
non-incremental mode, though the presence of the incremental collector will
be apparent in the design. We assume the default finalization model, but the
code affected by that is very localized.

Table of Contents
  * [Introduction](#introduction)
  * [Allocation](#allocation)
  * [Mark phase](#mark-phase)
  * [Sweep phase](#sweep-phase)
  * [Finalization](#finalization)
  * [Generational Collection and Dirty Bits](#generational-collection-and-dirty-bits)
  * [Black-listing](#black-listing)
  * [Thread support](#thread-support)
  * [Thread-local allocation](#thread-local-allocation)

## Introduction

The garbage collector uses a modified mark-sweep algorithm. Conceptually
it operates roughly in four phases, which are performed occasionally as part
of a memory allocation:

  1. _Preparation phase_

     Each object has an associated mark bit. Clear all mark
     bits, indicating that all objects are potentially unreachable.

  2. _Mark phase_

     Marks all objects that can be reachable via chains of pointers from
     variables. Often the collector has no real information about the location
     of pointer variables in the heap, so it views all static data areas,
     stacks and registers as potentially containing pointers. Any bit
     patterns that represent addresses inside heap objects managed by the
     collector are viewed as pointers. Unless the client program has made heap
     object layout information available to the collector, any heap objects
     found to be reachable from variables are again scanned similarly.

  3. _Sweep phase_

     Scans the heap for inaccessible, and hence unmarked,
     objects, and returns them to an appropriate free list for reuse. This is
     not really a separate phase; even in non-incremental mode this operation
     is usually performed on demand during an allocation that discovers an
     empty free list. Thus the sweep phase is very unlikely to touch a page
     that would not have been touched shortly thereafter anyway.

  4. _Finalization phase_

     Unreachable objects which had been registered for
     finalization are enqueued for finalization outside the collector.

The remaining sections describe the memory allocation data structures, and
then the last 3 collection phases in more detail. We conclude by outlining
some of the additional features implemented in the collector.

## Allocation

The collector includes its own memory allocator. The allocator obtains memory
from the system in a platform-dependent way. Under UNIX, it uses either
`malloc`, `sbrk`, or `mmap`.

Most static data used by the allocator, as well as that needed by the rest
of the garbage collector is stored inside the `_GC_arrays` structure. This
allows the garbage collector to easily ignore the collectors own data
structures when it searches for root pointers. Other allocator and collector
internal data structures are allocated dynamically with `GC_scratch_alloc`.
The latter does not allow for deallocation, and is therefore used only for
permanent data structures.

The allocator returns objects of different _kinds_. Different _kinds_ are
handled somewhat differently by certain parts of the garbage collector.
Certain kinds are scanned for pointers, others are not. Some may have
per-object type descriptors that determine pointer locations. Or a specific
kind may correspond to one specific object layout. Two built-in kinds are
uncollectible. In spite of that, it is very likely that most C clients of the
collector currently use at most two kinds: `NORMAL` and `PTRFREE` objects. The
[GCJ](https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcj/) runtime also makes heavy
use of a kind (allocated with `GC_gcj_malloc`) that stores type information
at a known offset in method tables.

The collector uses a two level allocator. A large block is defined to be one
larger than half of `HBLKSIZE`, which is a power of 2, typically on the order
of the page size.

Large block sizes are rounded up to the next multiple of `HBLKSIZE` and then
allocated by `GC_allochblk`. The collector use an approximate best fit
algorithm by keeping free lists for several large block sizes. The actual
implementation of `GC_allochblk` is significantly complicated by black-listing
issues (see below).

Small blocks are allocated in chunks of size `HBLKSIZE`. Each chunk
is dedicated to only one object size and kind.

The allocator maintains separate free lists for each size and kind of object.
Associated with each kind is an array of free list pointers, with entry
`freelist[i]` pointing to a free list of size 'i' objects. Index `i` is
expressed in granules, which are the minimum allocatable unit, typically 8 or
16 bytes. The free lists themselves are linked through the first word in each
object (see `obj_link` macro).

Once a large block is split for use in smaller objects, it can only be used
for objects of that size, unless the collector discovers a completely empty
chunk. Completely empty chunks are restored to the appropriate large block
free list.

In order to avoid allocating blocks for too many distinct object sizes, the
collector normally does not directly allocate objects of every possible
request size. Instead, the request is rounded up to one of a smaller number
of allocated sizes, for which free lists are maintained. The exact allocated
sizes are computed on demand, but subject to the constraint that they increase
roughly in geometric progression. Thus objects requested early in the
execution are likely to be allocated with exactly the requested size, subject
to alignment constraints. See `GC_init_size_map` for details.

The actual size rounding operation during small object allocation
is implemented as a table lookup in `GC_size_map` which maps a requested
allocation size in bytes to a number of granules.

Both collector initialization and computation of allocated sizes are handled
carefully so that they do not slow down the small object fast allocation path.
An attempt to allocate before the collector is initialized, or before the
appropriate `GC_size_map` entry is computed, will take the same path as an
allocation attempt with an empty free list. This results in a call to the slow
path code (`GC_generic_malloc_inner`) which performs the appropriate
initialization checks.

In non-incremental mode, we make a decision about whether to garbage collect
whenever an allocation would otherwise have failed with the current heap size.
If the total amount of allocation since the last collection is less than the
heap size divided by `GC_free_space_divisor`, we try to expand the heap.
Otherwise, we initiate a garbage collection. This ensures that the amount
of garbage collection work per allocated byte remains constant.

The above is in fact an oversimplification of the real heap expansion and GC
triggering heuristic, which adjusts slightly for root size and certain kinds
of fragmentation. In particular:

  * Programs with a large root set size and little live heap memory will
  expand the heap to amortize the cost of scanning the roots.
  * GC actually collects more frequently in non-incremental mode. The large
  block allocator usually refuses to split large heap blocks once the garbage
  collection threshold is reached. This often has the effect of collecting
  well before the heap fills up, thus reducing fragmentation and working set
  size at the expense of GC time. (If the collector is configured not to unmap
  unused pages, GC chooses an intermediate strategy depending on how much
  large object allocation has taken place in the past.)
  * In calculating the amount of allocation since the last collection we give
  partial credit for objects we expect to be explicitly deallocated. Even
  if all objects are explicitly managed, it is often desirable to collect
  on rare occasion, since that is our only mechanism for coalescing completely
  empty chunks.

It has been suggested that this should be adjusted so that we favor expansion
if the resulting heap still fits into physical memory. In many cases, that
would no doubt help. But it is tricky to do this in a way that remains robust
if multiple application are contending for a single pool of physical memory.

## Mark phase

At each collection, the collector marks all objects that are possibly
reachable from pointer variables. Since it cannot generally tell where pointer
variables are located, it scans the following _root segments_ for pointers:

  * The registers. Depending on the architecture, this may be done using
  assembly code, or by calling a `setjmp`-like function which saves register
  contents on the stack.
  * The stack(s). In the case of a single-threaded application, on most
  platforms this is done by scanning the memory between (an approximation of)
  the current stack pointer and `GC_stackbottom`. (For Intel Itanium, the
  register stack scanned separately.) The `GC_stackbottom` variable is set in
  a highly platform-specific way depending on the appropriate configuration
  information in `gcconfig.h`. Note that the currently active stack needs
  to be scanned carefully, since callee-save registers of client code may
  appear inside collector stack frames, which may change during the mark
  process. This is addressed by scanning some sections of the stack _eagerly_,
  effectively capturing a snapshot at one point in time.
  * Static data region(s). In the simplest case, this is the region between
  `DATASTART` and `DATAEND`, as defined in `gcconfig.h`. However, in most
  cases, this will also involve static data regions associated with dynamic
  libraries. These are identified by the mostly platform-specific code
  in `dyn_load.c`.  The marker maintains an explicit stack of memory regions
  that are known to be accessible, but that have not yet been searched for
  contained pointers. Each stack entry contains the starting address of the
  block to be scanned, as well as a descriptor of the block. If no layout
  information is available for the block, then the descriptor is simply
  a length. (For other possibilities, see `gc_mark.h`.)

At the beginning of the mark phase, all root segments (as described above) are
pushed on the stack by `GC_push_roots`. (Registers and eagerly scanned stack
sections are processed by pushing the referenced objects instead of the stack
section itself.) If `ALL_INTERIOR_POINTERS` is not defined, then stack roots
require special treatment. In this case, the normal marking code ignores
interior pointers, but `GC_push_all_stack` explicitly checks for interior
pointers and pushes descriptors for target objects.

The marker is structured to allow incremental marking. Each call
to `GC_mark_some` performs a small amount of work towards marking the heap.
It maintains explicit state in the form of `GC_mark_state`, which identifies
a particular sub-phase. Some other pieces of state, most notably the mark
stack, identify how much work remains to be done in each sub-phase. The normal
progression of mark states for a stop-the-world collection is:

  1. `MS_INVALID` indicating that there may be accessible unmarked objects.
  In this case `GC_objects_are_marked` will simultaneously be false, so the
  mark state is advanced to
  2. `MS_PUSH_UNCOLLECTABLE` indicating that it suffices to push uncollectible
  objects, roots, and then mark everything reachable from them. `GC_scan_ptr`
  is advanced through the heap until all uncollectible objects are pushed, and
  objects reachable from them are marked. At that point, the next call
  to `GC_mark_some` calls `GC_push_roots` to push the roots. It, then,
  advances the mark state to
  3. `MS_ROOTS_PUSHED` asserting that once the mark stack is empty, all
  reachable objects are marked. Once in this state, we work only on emptying
  the mark stack. Once this is completed, the state changes to
  4. `MS_NONE` indicating that reachable objects are marked.

The core mark routine `GC_mark_from`, is called repeatedly by several of the
sub-phases when the mark stack starts to fill up. It is also called repeatedly
in `MS_ROOTS_PUSHED` state to empty the mark stack. The routine is designed
to only perform a limited amount of marking at each call, so that it can
also be used by the incremental collector. It is fairly carefully tuned,
since it usually consumes a large majority of the garbage collection time.

The fact that it performs only a small amount of work per call also allows
it to be used as the core routine of the parallel marker. In that case it is
normally invoked on thread-private mark stacks instead of the global mark
stack. More details can be found [here](scale.md).

The marker correctly handles mark stack overflows. Whenever the mark stack
overflows, the mark state is reset to `MS_INVALID`. Since there are already
marked objects in the heap, this eventually forces a complete scan of the
heap, searching for pointers, during which any unmarked objects referenced
by marked objects are again pushed on the mark stack. This process is repeated
until the mark phase completes without a stack overflow. Each time the stack
overflows, an attempt is made to grow the mark stack. All pieces of the
collector that push regions onto the mark stack have to be careful to ensure
forward progress, even in case of repeated mark stack overflows. Every mark
attempt results in additional marked objects.

Each mark stack entry is processed by examining all candidate pointers in the
range described by the entry. If the region has no associated type information
then this typically requires that each 4-byte aligned quantity (8-byte aligned
if 64-bit pointers) be considered a candidate pointer.

We determine whether a candidate pointer is actually the address of a heap
block. This is done in the following steps:

  * The candidate pointer is checked against rough heap bounds. These heap
  bounds are maintained such that all actual heap objects fall between them.
  In order to facilitate black-listing (see below) we also include address
  regions that the heap is likely to expand into. Most non-pointers fail this
  initial test.
  * The candidate pointer is divided into two pieces; the most significant
  bits identify a `HBLKSIZE`-sized page in the address space, and the least
  significant bits specify an offset within that page. (A hardware page may
  actually consist of multiple such pages. Normally, HBLKSIZE is usually the
  page size divided by a small power of two. Alternatively, if the collector
  is built with `-DLARGE_CONFIG`, such a page may consist of multiple hardware
  pages.)
  * The page address part of the candidate pointer is looked up in
  a [table](tree.md). Each table entry contains either 0, indicating that
  the page is not part of the garbage collected heap, a small integer _n_,
  indicating that the page is part of large object, starting at least _n_
  pages back, or a pointer to a descriptor for the page. In the first case,
  the candidate pointer is not a true pointer and can be safely ignored.
  In the last two cases, we can obtain a descriptor for the page containing
  the beginning of the object.
  * The starting address of the referenced object is computed. The page
  descriptor contains the size of the object(s) in that page, the object kind,
  and the necessary mark bits for those objects. The size information can be
  used to map the candidate pointer to the object starting address.
  To accelerate this process, the page header also contains a pointer to
  a precomputed map of page offsets to displacements from the beginning of an
  object. The use of this map avoids a potentially slow integer remainder
  operation in computing the object start address.
  * The mark bit for the target object is checked and set. If the object was
  previously unmarked, the object is pushed on the mark stack. The descriptor
  is read from the page descriptor. (This is computed from information stored
  in `GC_obj_kinds` when the page is first allocated.)

At the end of the mark phase, mark bits for left-over free lists are cleared,
in case a free list was accidentally marked due to a stray pointer.

## Sweep phase

At the end of the mark phase, all blocks in the heap are examined. Unmarked
large objects are immediately returned to the large object free list. Each
small object page is checked to see if all mark bits are clear. If so, the
entire page is returned to the large object free list. Small object pages
containing some reachable object are queued for later sweeping, unless
we determine that the page contains very little free space, in which case
it is not examined further.

This initial sweep pass touches only block headers, not the blocks themselves.
Thus it does not require significant paging, even if large sections of the
heap are not in physical memory.

Nonempty small object pages are swept when an allocation attempt encounters
an empty free list for that object size and kind. Pages for the correct size
and kind are repeatedly swept until at least one empty block is found.
Sweeping such a page involves scanning the mark bit array in the page header,
and building a free list linked through the first words in the objects
themselves. This does involve touching the appropriate data page, but in most
cases it will be touched only just before it is used for allocation. Hence any
paging is essentially unavoidable.

Except in the case of pointer-free objects, we maintain the invariant that any
object in a small object free list is cleared (except possibly for the link
field). Thus it becomes the burden of the small object sweep routine to clear
objects. This has the advantage that we can easily recover from accidentally
marking a free list, though that could also be handled by other means. The
collector currently spends a fair amount of time clearing objects, and this
approach should probably be revisited. In most configurations, we use
specialized sweep routines to handle common small object sizes. Since
we allocate one mark bit per word, it becomes easier to examine the relevant
mark bits if the object size divides the word length evenly. We also suitably
unroll the inner sweep loop in each case. (It is conceivable that
profile-based procedure cloning in the compiler could make this unnecessary
and counterproductive. I know of no existing compiler to which this applies.)

The sweeping of small object pages could be avoided completely at the expense
of examining mark bits directly in the allocator. This would probably be more
expensive, since each allocation call would have to reload a large amount
of state (e.g. next object address to be swept, position in mark bit table)
before it could do its work. The current scheme keeps the allocator simple and
allows useful optimizations in the sweeper.

## Finalization

Both `GC_register_disappearing_link` and `GC_register_finalizer` add the
request to a corresponding hash table. The hash table is allocated out of
collected memory, but the reference to the finalizable object is hidden from
the collector. Currently finalization requests are processed non-incrementally
at the end of a mark cycle.

The collector makes an initial pass over the table of finalizable objects,
pushing the contents of unmarked objects onto the mark stack. After pushing
each object, the marker is invoked to mark all objects reachable from it. The
object itself is not explicitly marked. This assures that objects on which
a finalizer depends are neither collected nor finalized.

If in the process of marking from an object the object itself becomes marked,
we have uncovered a cycle involving the object. This usually results in
a warning from the collector. Such objects are not finalized, since it may be
unsafe to do so. See the more detailed discussion of
[finalization semantics](finalization.md).

Any objects remaining unmarked at the end of this process are added to a queue
of objects whose finalizers can be run. Depending on collector configuration,
finalizers are dequeued and run either implicitly during allocation calls,
or explicitly in response to a user request. (Note that the former
is unfortunately both the default and not generally safe. If finalizers
perform synchronization, it may result in deadlocks. Nontrivial finalizers
generally need to perform synchronization, and thus require a different
collector configuration.)

The collector provides a mechanism for replacing the procedure that is used
to mark through objects. This is used both to provide support for Java-style
unordered finalization, and to ignore certain kinds of cycles, e.g. those
arising from C++ implementations of virtual inheritance.

## Generational Collection and Dirty Bits

We basically use the concurrent and generational GC algorithm described in
["Mostly Parallel Garbage Collection"](http://www.hboehm.info/gc/papers/pldi91.ps.Z),
by Boehm, Demers, and Shenker.

The most significant modification is that the collector always starts running
in the allocating thread. There is no separate garbage collector thread. (If
parallel GC is enabled, helper threads may also be woken up.) If an allocation
attempt either requests a large object, or encounters an empty small object
free list, and notices that there is a collection in progress, it immediately
performs a small amount of marking work as described above.

This change was made both because we wanted to easily accommodate
single-threaded environments, and because a separate GC thread requires very
careful control over the scheduler to prevent the mutator from out-running the
collector, and hence provoking unneeded heap growth.

In incremental mode, the heap is always expanded when we encounter
insufficient space for an allocation. Garbage collection is triggered whenever
we notice that more than `GC_heap_size / 2 * GC_free_space_divisor` bytes
of allocation have taken place. After `GC_full_freq` minor collections a major
collection is started.

All collections initially run uninterrupted until a predetermined amount
of time (15 ms by default) has expired. If this allows the collection
to complete entirely, we can avoid correcting for data structure modifications
during the collection. If it does not complete, we return control to the
mutator, and perform small amounts of additional GC work during those later
allocations that cannot be satisfied from small object free lists. When
marking completes, the set of modified pages is retrieved, and we mark once
again from marked objects on those pages, this time with the mutator stopped.

We keep track of modified pages using one of several distinct mechanisms:

  * (`MPROTECT_VDB`) By write-protecting physical pages and catching write
  faults. This is implemented for many Unix-like systems and for Win32. It is
  not possible in a few environments.
  * (`GWW_VDB`) By using the Win32 `GetWriteWatch` function to read dirty
  bits.
  * (`PROC_VDB`) By retrieving dirty bit information from /proc. (Currently
  only Sun's Solaris supports this. Though this is considerably cleaner,
  performance may actually be better with `mprotect` and signals.)
  * (`SOFT_VDB`) By retrieving Linux soft-dirty bit information from /proc.
  * (`PCR_VDB`) By relying on an external dirty bit implementation, in this
  case the one in Xerox PCR.
  * Through explicit mutator cooperation. This enabled by
  `GC_set_manual_vdb_allowed(1)` call, and requires the client code to call
  `GC_ptr_store_and_dirty` or `GC_end_stubborn_change` (followed by a number
  of `GC_reachable_here` calls), and is rarely used.
  * (`DEFAULT_VDB`) By treating all pages as dirty. This is the default
  if none of the other techniques is known to be usable. (Practical only for
  testing.)

## Black-listing

The collector implements _black-listing_ of pages, as described in
["Space Efficient Conservative Collection", PLDI'93](https://dl.acm.org/doi/10.1145/173262.155109)
by Boehm, also available
[here](https://www.cs.rice.edu/~javaplt/311/Readings/pldi93.pdf).

During the mark phase, the collector tracks _near misses_, i.e. attempts
to follow a _pointer_ to just outside the garbage-collected heap, or to
a currently unallocated page inside the heap. Pages that have been the targets
of such near misses are likely to be the targets of misidentified _pointers_
in the future. To minimize the future damage caused by such misidentification,
they will be allocated only to small pointer-free objects.

The collector understands two different kinds of black-listing. A page may be
black listed for interior pointer references (`GC_add_to_black_list_stack`),
if it was the target of a near miss from a location that requires interior
pointer recognition, e.g. the stack, or the heap if `GC_all_interior_pointers`
is set. In this case, we also avoid allocating large blocks that include this
page.

If the near miss came from a source that did not require interior pointer
recognition, it is black-listed with `GC_add_to_black_list_normal`. A page
black-listed in this way may appear inside a large object, so long as it is
not the first page of a large object.

The `GC_allochblk` routine respects black-listing when assigning a block to
a particular object kind and size. It occasionally drops (i.e. allocates and
forgets) blocks that are completely black-listed in order to avoid excessively
long large block free lists containing only unusable blocks. This would
otherwise become an issue if there is low demand for small pointer-free
objects.

## Thread support

We support several different threading models. Unfortunately Pthreads, the
only reasonably well standardized thread model, supports too narrow
an interface for conservative garbage collection. There appears to be no
completely portable way to allow the collector to coexist with various
Pthreads implementations. Hence we currently support only the more common
Pthreads implementations.

In particular, it is very difficult for the collector to stop all other
threads in the system and examine the register contents. This is currently
accomplished with very different mechanisms for some Pthreads implementations.
For Linux/HPUX/OSF1, Solaris and Irix it sends signals to individual Pthreads
and has them wait in the signal handler.

The Linux and Irix implementations use only documented Pthreads calls, but
rely on extensions to their semantics. The Linux implementation
`pthread_stop_world.c` relies on only very mild extensions to the pthreads
semantics, and already supports a large number of other Unix-like pthreads
implementations. Our goal is to make this the only pthread support in the
collector.

All implementations must intercept thread creation and a few other
thread-specific calls to allow enumeration of threads and location of thread
stacks. This is current accomplished with `#define`'s in `gc.h` (really
`gc_pthread_redirects.h`), or optionally by using `ld`'s function call
wrapping mechanism under Linux.

The collector support several facilities to enhance the processor-scalability
and thread performance of the collector. These are discussed in more detail
[here](scale.md). We briefly outline the data approach to thread-local
allocation in the next section.

## Thread-local allocation

If thread-local allocation is enabled (which is true in the default
configuration for most supported platforms), the collector keeps separate
arrays of free lists for each thread.

The free list arrays associated with each thread are only used to satisfy
requests for objects that are both very small, and belong to one of a small
number of well-known kinds. These include _normal_, pointer-free, _gcj_ and
_disclaim_ objects.

Thread-local free list entries contain either a pointer to the first element
of a free list, or they contain a counter of the number of allocation
granules, corresponding to objects of this size, allocated so far. Initially
they contain the value one, i.e. a small counter value.

Thread-local allocation goes directly through the global allocator if the
object is of a size or kind not covered by the local free lists.

If there is an appropriate local free list, the allocator checks whether
it contains a sufficiently small counter value. If so, the counter is simply
incremented by a value, and the global allocator is used. In this way,
the initial few allocations of a given size bypass the local allocator.
A thread that only allocates a handful of objects of a given size will not
build up its own free list for that size. This avoids wasting space for
unpopular objects sizes or kinds.

Once the counter passes a threshold, `GC_malloc_many` is called to allocate
roughly `HBLKSIZE` space and put it on the corresponding local free list.
Further allocations of that size and kind then use this free list, and no
longer need to acquire the allocation lock. The allocation procedure
is otherwise similar to the global free lists. The local free lists are also
linked using the first word in the object. In most cases this means they
require considerably less time.

Local free lists are treated by most of the rest of the collector as though
they were in-use reachable data. This requires some care, since pointer-free
objects are not normally traced, and hence a special tracing procedure
is required to mark all objects on pointer-free and gcj local free lists.

On thread exit, any remaining thread-local free list entries are transferred
back to the global free list.

Note that if the collector is configured for thread-local allocation (the
default for most platforms), `GC_malloc` and friends only use thread-local
allocation.

For some more details see [here](scale.md), and the technical report entitled
["Fast Multiprocessor Memory Allocation and Garbage Collection"](http://www.hpl.hp.com/techreports/2000/HPL-2000-165.html).