summaryrefslogtreecommitdiff
path: root/docs/users_guide/runtime_control.rst
diff options
context:
space:
mode:
Diffstat (limited to 'docs/users_guide/runtime_control.rst')
-rw-r--r--docs/users_guide/runtime_control.rst50
1 files changed, 50 insertions, 0 deletions
diff --git a/docs/users_guide/runtime_control.rst b/docs/users_guide/runtime_control.rst
index 19135c61ce..1ae51ddc49 100644
--- a/docs/users_guide/runtime_control.rst
+++ b/docs/users_guide/runtime_control.rst
@@ -643,6 +643,56 @@ performance.
``-F`` parameter will be reduced in order to avoid exceeding the
maximum heap size.
+.. rts-flag:: --numa
+ --numa=<mask>
+
+ .. index::
+ single: NUMA, enabling in the runtime
+
+ Enable NUMA-aware memory allocation in the runtime (only available
+ with ``-threaded``, and only on Linux currently).
+
+ Background: some systems have a Non-Uniform Memory Architecture,
+ whereby main memory is split into banks which are "local" to
+ specific CPU cores. Accessing local memory is faster than
+ accessing remote memory. The OS provides APIs for allocating
+ local memory and binding threads to particular CPU cores, so that
+ we can ensure certain memory accesses are using local memory.
+
+ The ``--numa`` option tells the RTS to tune its memory usage to
+ maximize local memory accesses. In particular, the RTS will:
+
+ - Determine the number of NUMA nodes (N) by querying the OS.
+ - Manage separate memory pools for each node.
+ - Map capabilities to NUMA nodes. Capability C is mapped to
+ NUMA node C mod N.
+ - Bind worker threads on a capability to the appropriate node.
+ - Allocate the nursery from node-local memory.
+ - Perform other memory allocation, including in the GC, from
+ node-local memory.
+ - When load-balancing, we prefer to migrate threads to another
+ Capability on the same node.
+
+ The ``--numa`` flag is typically beneficial when a program is
+ using all cores of a large multi-core NUMA system, with a large
+ allocation area (``-A``). All memory accesses to the allocation
+ area will go to local memory, which can save a significant amount
+ of remote memory access. A runtime speedup on the order of 10%
+ is typical, but can vary a lot depending on the hardware and the
+ memory behaviour of the program.
+
+ Note that the RTS will not set CPU affinity for bound threads and
+ threads entering Haskell from C/C++, so if your program uses bound
+ threads you should ensure that each bound thread calls the RTS API
+ `rts_setInCallCapability(c,1)` from C/C++ before calling into
+ Haskell. Otherwise there could be a mismatch between the CPU that
+ the thread is running on and the memory it is using while running
+ Haskell code, which will negate any benefits of ``--numa``.
+
+ If given an explicit <mask>, the <mask> is interpreted as a bitmap
+ that indicates the NUMA nodes on which to run the program. For
+ example, ``--numa=3`` would run the program on NUMA nodes 0 and 1.
+
.. _rts-options-statistics:
RTS options to produce runtime statistics