summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMatti Picus <matti.picus@gmail.com>2020-02-19 08:15:52 +0200
committerGitHub <noreply@github.com>2020-02-18 22:15:52 -0800
commit5eff78bb16df99ffe2e9cad86c4ec649893c0646 (patch)
tree41181e6fa03959787935bf415c8c888d3f476280
parent4d2b5850488013319ff8354a1e764a0a2064fe63 (diff)
downloadnumpy-5eff78bb16df99ffe2e9cad86c4ec649893c0646.tar.gz
NEP: edit and move NEP 38 to accepted status (#15543)
-rw-r--r--doc/neps/Makefile9
-rw-r--r--doc/neps/nep-0038-SIMD-optimizations.rst143
2 files changed, 118 insertions, 34 deletions
diff --git a/doc/neps/Makefile b/doc/neps/Makefile
index 3c023ae9b..799e86888 100644
--- a/doc/neps/Makefile
+++ b/doc/neps/Makefile
@@ -2,15 +2,18 @@
#
# You can set these variables from the command line.
-SPHINXOPTS = -W
-SPHINXBUILD = sphinx-build
+SPHINXOPTS ?=
+SPHINXBUILD ?= LANG=C sphinx-build
+
+# Internal variables
SPHINXPROJ = NumPyEnhancementProposals
SOURCEDIR = .
BUILDDIR = _build
+ALLSPHINXOPTS = -WT --keep-going -n -d $(SPHINXOPTS)
# Put it first so that "make" without argument is like "make help".
help:
- @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+ @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(ALLSPHINXOPTS) $(O)
.PHONY: help Makefile index
diff --git a/doc/neps/nep-0038-SIMD-optimizations.rst b/doc/neps/nep-0038-SIMD-optimizations.rst
index a19cded6b..ab16868e4 100644
--- a/doc/neps/nep-0038-SIMD-optimizations.rst
+++ b/doc/neps/nep-0038-SIMD-optimizations.rst
@@ -3,10 +3,10 @@ NEP 38 — Using SIMD optimization instructions for performance
=============================================================
:Author: Sayed Adel, Matti Picus, Ralf Gommers
-:Status: Draft
+:Status: Accepted
:Type: Standards
:Created: 2019-11-25
-:Resolution: none
+:Resolution: http://numpy-discussion.10968.n7.nabble.com/NEP-38-Universal-SIMD-intrinsics-td47854.html
Abstract
@@ -15,7 +15,7 @@ Abstract
While compilers are getting better at using hardware-specific routines to
optimize code, they sometimes do not produce optimal results. Also, we would
like to be able to copy binary optimized C-extension modules from one machine
-to another with the same base architecture (x86, ARM, PowerPC) but with
+to another with the same base architecture (x86, ARM, or PowerPC) but with
different capabilities without recompiling.
We have a mechanism in the ufunc machinery to `build alternative loops`_
@@ -37,17 +37,15 @@ architectures. The steps proposed are to:
Motivation and Scope
--------------------
-Traditionally NumPy has counted on the compilers to generate optimal code
+Traditionally NumPy has depended on compilers to generate optimal code
specifically for the target architecture.
However few users today compile NumPy locally for their machines. Most use the
binary packages which must provide run-time support for the lowest-common
denominator CPU architecture. Thus NumPy cannot take advantage of
more advanced features of their CPU processors, since they may not be available
-on all users' systems. The ufunc machinery already has a loop-selection
-protocol based on dtypes, so it is easy to extend this to also select an
-optimal loop for specifically available CPU features at runtime.
+on all users' systems.
-Traditionally, these features have been exposed through `intrinsics`_ which are
+Traditionally, CPU features have been exposed through `intrinsics`_ which are
compiler-specific instructions that map directly to assembly instructions.
Recently there were discussions about the effectiveness of adding more
intrinsics (e.g., `gh-11113`_ for AVX optimizations for floats). In the past,
@@ -60,6 +58,7 @@ Recently, OpenCV moved to using `universal intrinsics`_ in the Hardware
Abstraction Layer (HAL) which provided a nice abstraction for common shared
Single Instruction Multiple Data (SIMD) constructs. This NEP proposes a similar
mechanism for NumPy. There are three stages to using the mechanism:
+
- Infrastructure is provided in the code for abstract intrinsics. The ufunc
machinery will be extended using sets of these abstract intrinsics, so that
a single ufunc will be expressed as a set of loops, going from a minimal to
@@ -78,6 +77,12 @@ The current NEP proposes only to use the runtime feature detection and optimal
loop selection mechanism for ufuncs. Future NEPS may propose other uses for the
proposed solution.
+The ufunc machinery already has the ability to select an optimal loop for
+specifically available CPU features at runtime, currently used for ``avx2``,
+``fma`` and ``avx512f`` loops (in the generated ``__umath_generated.c`` file);
+universal intrinsics would extend the generated code to include more loop
+variants.
+
Usage and Impact
----------------
@@ -123,7 +128,7 @@ which instruction sets can be used at *runtime* via environment variables.
Diagnostics
```````````
-A new dictionary `__cpu_features__` will be available to python. The keys are
+A new dictionary ``__cpu_features__`` will be available to python. The keys are
the available features, the value is a boolean whether the feature is available
or not. Various new private
C functions will be used internally to query available features. These
@@ -136,10 +141,54 @@ Workflow for adding a new CPU architecture-specific optimization
NumPy will always have a baseline C implementation for any code that may be
a candidate for SIMD vectorization. If a contributor wants to add SIMD
support for some architecture (typically the one of most interest to them),
-this is the proposed workflow:
-
-TODO (see https://github.com/numpy/numpy/pull/13516#issuecomment-558859638,
-needs to be worked out more)
+this comment is the beginning of a tutorial on how to do so:
+https://github.com/numpy/numpy/pull/13516#issuecomment-558859638
+
+.. _tradeoffs:
+
+As of this moment, NumPy has a number of ``avx512f`` and ``avx2`` and ``fma``
+SIMD loops for many ufuncs. These would likely be the first candidates
+to be ported to universal intrinsics. The expectation is that the new
+implementation may cause a regression in benchmarks, but not increase the
+size of the binary. If the regression is not minimal, we may choose to keep
+the X86-specific code for that platform and use the universal intrisic code
+for other platforms.
+
+Any new PRs to implement ufuncs using intrinsics will be expected to use the
+universal intrinsics. If it can be demonstrated that the use of universal
+intrinsics is too awkward or is not performant enough, platform specific code
+may be accepted as well. In rare cases, a single-platform only PR may be
+accepted, but it would have to be examined within the framework of preferring
+a solution using universal intrinsics.
+
+The subjective criteria for accepting new loops are:
+
+- correctness: the new code must not decrease accuracy by more than 1-3 ULPs
+ even at edge points in the algorithm.
+- code bloat: both source code size and especially binary size of the compiled
+ wheel.
+- maintainability: how readable is the code
+- performance: benchmarks must show a significant performance boost
+
+.. _new-intrinsics:
+
+Adding a new intrinsic
+~~~~~~~~~~~~~~~~~~~~~~
+
+If a contributor wants to use a platform-specific SIMD instruction that is not
+yet supported as a universal intrinsic, then:
+
+1. It should be added as a universal intrinsic for all platforms
+2. If it does not have an equivalent instruction on other platforms (e.g.
+ ``_mm512_mask_i32gather_ps`` in ``AVX512``), then no universal intrinsic
+ should be added and a platform-specific ``ufunc`` or a short helper fuction
+ should be written instead. If such a helper function is used, it must be
+ wrapped with the feature macros, and a reasonable non-intrinsic fallback to
+ be used by default.
+
+We expect (2) to be the exception. The contributor and maintainers should
+consider whether that single-platform intrinsic is worth it compared to using
+the best available universal intrinsic based implementation.
Reuse by other projects
```````````````````````
@@ -157,19 +206,8 @@ There should be no impact on backwards compatibility.
Detailed description
--------------------
-Two new build options are available to ``runtests.py`` and ``setup.py build``.
-The absolute minimum required features to compile are defined by
-``--cpu-baseline``. For instance, on ``x86_64`` this defaults to ``SSE3``. The
-set of additional intrinsics that can be detected and used as sets of
-requirements to dispatch on are set by ``--cpu-dispatch``. For instance, on
-``x86_64`` this defaults to ``[SSSE3, SSE41, POPCNT, SSE42, AVX, F16C, XOP,
-FMA4, FMA3, AVX2, AVX512F, AVX512CD, AVX512_KNL, AVX512_KNM, AVX512_SKX,
-AVX512_CLX, AVX512_CNL, AVX512_ICL]``. These features are all mapped to a
-c-level boolean array ``npy__cpu_have``, and a c-level convenience function
-``npy_cpu_have(int feature_id)`` queries this array.
-
-The CPU-specific features are then mapped to unversal intrinsics which are
-for all x86 SIMD variants, ARM SIMD variants etc. For example, the
+The CPU-specific are mapped to unversal intrinsics which are
+similar for all x86 SIMD variants, ARM SIMD variants etc. For example, the
NumPy universal intrinsic ``npyv_load_u32`` maps to:
* ``vld1q_u32`` for ARM based NEON
@@ -180,20 +218,46 @@ Anyone writing a SIMD loop will use the ``npyv_load_u32`` macro instead of the
architecture specific intrinsic. The code also supplies guard macros for
compilation and runtime, so that the proper loops can be chosen.
+Two new build options are available to ``runtests.py`` and ``setup.py``:
+``--cpu-baseline`` and ``--cpu-dispatch``.
+The absolute minimum required features to compile are defined by
+``--cpu-baseline``. For instance, on ``x86_64`` this defaults to ``SSE3``. The
+minimum features will be enabled if the compiler support it. The
+set of additional intrinsics that can be detected and used as sets of
+requirements to dispatch on are set by ``--cpu-dispatch``. For instance, on
+``x86_64`` this defaults to ``[SSSE3, SSE41, POPCNT, SSE42, AVX, F16C, XOP,
+FMA4, FMA3, AVX2, AVX512F, AVX512CD, AVX512_KNL, AVX512_KNM, AVX512_SKX,
+AVX512_CLX, AVX512_CNL, AVX512_ICL]``. These features are all mapped to a
+c-level boolean array ``npy__cpu_have``, and a c-level convenience function
+``npy_cpu_have(int feature_id)`` queries this array, and the results are stored
+in ``__cpu_features__`` at runtime.
+
+When importing the ufuncs, the available compiled loops' required features are
+matched to the ones discovered. The loop with the best match is marked to be
+called by the ufunc.
+
Related Work
------------
-- PIXMAX TBD: what is it?
+- `Pixman`_ is the library used by Cairo and X to manipulate pixels. It uses
+ a technique like the one described here to fill a structure with function
+ pointers at runtime. These functions are similar to ufunc loops.
- `Eigen`_ is a C++ template library for linear algebra: matrices, vectors,
numerical solvers, and related algorithms. It is a higher level-abstraction
than the intrinsics discussed here.
- `xsimd`_ is a header-only C++ library for x86 and ARM that implements the
mathematical functions used in the algorithms of ``boost.SIMD``.
+- `Simd`_ is a high-level image processing and machine learning library with
+ optimizations for different platforms.
- OpenCV used to have the one-implementation-per-architecture design, but more
recently moved to a design that is quite similar to what is proposed in this
NEP. The top-level `dispatch code`_ includes a `generic header`_ that is
`specialized at compile time`_ by the CMakefile system.
-
+- `VOLK`_ is a GPL3 library used by gnuradio and others to abstract SIMD
+ intrinsics. They offer a set of high-level operations which have been
+ optimized for each architecture.
+- The C++ Standards Committee has proposed `class templates`_ for portable
+ SIMD programming via vector types, and `namespaces`_ for the templates.
Implementation
--------------
@@ -223,15 +287,27 @@ implementing and maintaining that platform's loop code.
Discussion
----------
-.. note::
- Will include a summary of the discussion once the NEP is published
+Most of the discussion took place on the PR `gh-15228`_ to accecpt this NEP.
+Discussion on the mailing list mentioned `VOLK`_ which was added to
+the section on related work. The question of maintainability also was raised
+both on the mailing list and in `gh-15228`_ and resolved as follows:
+
+- If contributors want to leverage a specific SIMD instruction, will they be
+ expected to add software implementation of this instruction for all other
+ architectures too? (see the `new-intrinsics`_ part of the workflow).
+- On whom does the burden lie to verify the code and benchmarks for all
+ architectures? What happens if adding a universal ufunc in place of
+ architecture-specific code helps one architecture but harms performance
+ on another? (answered in the tradeoffs_ part of the workflow).
References and Footnotes
------------------------
.. _`build alternative loops`: https://github.com/numpy/numpy/blob/v1.17.4/numpy/core/code_generators/generate_umath.py#L50
.. _`is chosen`: https://github.com/numpy/numpy/blob/v1.17.4/numpy/core/code_generators/generate_umath.py#L1038
-.. _`gh-11113"`: https://github.com/numpy/numpy/pull/11113
+.. _`gh-11113`: https://github.com/numpy/numpy/pull/11113
+.. _`gh-15228`: https://github.com/numpy/numpy/pull/15228
+.. _`gh-13516`: https://github.com/numpy/numpy/pull/13516
.. _`fast avx512 routines`: https://github.com/numpy/numpy/pulls?q=is%3Apr+avx512+is%3Aclosed
.. [1] Each NEP must either be explicitly labeled as placed in the public domain (see
@@ -240,12 +316,17 @@ References and Footnotes
.. _Open Publication License: https://www.opencontent.org/openpub/
.. _`xsimd`: https://xsimd.readthedocs.io/en/latest/
+.. _`Pixman`: https://gitlab.freedesktop.org/pixman
+.. _`VOLK`: https://www.libvolk.org/doxygen/index.html
.. _`Eigen`: http://eigen.tuxfamily.org/index.php?title=Main_Page
+.. _`Simd`: https://github.com/ermig1979/Simd
.. _`dispatch code`: https://github.com/opencv/opencv/blob/4.1.2/modules/core/src/arithm.dispatch.cpp
.. _`generic header`: https://github.com/opencv/opencv/blob/4.1.2/modules/core/src/arithm.simd.hpp
.. _`specialized at compile time`: https://github.com/opencv/opencv/blob/4.1.2/modules/core/CMakeLists.txt#L3-#L13
.. _`intrinsics`: https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-intrinsics
.. _`universal intrinsics`: https://docs.opencv.org/master/df/d91/group__core__hal__intrin.html
+.. _`class templates`: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0214r8.pdf
+.. _`namespaces`: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/n4808.pdf
Copyright
---------