NEP: edit and move NEP 38 to accepted status (#15543)

author: Matti Picus <matti.picus@gmail.com> 2020-02-19 08:15:52 +0200
committer: GitHub <noreply@github.com> 2020-02-18 22:15:52 -0800
commit: 5eff78bb16df99ffe2e9cad86c4ec649893c0646 (patch)
tree: 41181e6fa03959787935bf415c8c888d3f476280
parent: 4d2b5850488013319ff8354a1e764a0a2064fe63 (diff)
download: numpy-5eff78bb16df99ffe2e9cad86c4ec649893c0646.tar.gz
2 files changed, 118 insertions, 34 deletions
diff --git a/doc/neps/Makefile b/doc/neps/Makefile
index 3c023ae9b..799e86888 100644
--- a/doc/neps/Makefile
+++ b/doc/neps/Makefile
@@ -2,15 +2,18 @@
 #
 
 # You can set these variables from the command line.
-SPHINXOPTS    = -W
-SPHINXBUILD   = sphinx-build
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= LANG=C sphinx-build
+
+# Internal variables
 SPHINXPROJ    = NumPyEnhancementProposals
 SOURCEDIR     = .
 BUILDDIR      = _build
+ALLSPHINXOPTS = -WT --keep-going -n -d $(SPHINXOPTS)
 
 # Put it first so that "make" without argument is like "make help".
 help:
-	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(ALLSPHINXOPTS) $(O)
 
 .PHONY: help Makefile index
 
diff --git a/doc/neps/nep-0038-SIMD-optimizations.rst b/doc/neps/nep-0038-SIMD-optimizations.rst
index a19cded6b..ab16868e4 100644
--- a/doc/neps/nep-0038-SIMD-optimizations.rst
+++ b/doc/neps/nep-0038-SIMD-optimizations.rst
@@ -3,10 +3,10 @@ NEP 38 — Using SIMD optimization instructions for performance
 =============================================================
 
 :Author: Sayed Adel, Matti Picus, Ralf Gommers
-:Status: Draft
+:Status: Accepted
 :Type: Standards
 :Created: 2019-11-25
-:Resolution: none
+:Resolution: http://numpy-discussion.10968.n7.nabble.com/NEP-38-Universal-SIMD-intrinsics-td47854.html
 
 
 Abstract
@@ -15,7 +15,7 @@ Abstract
 While compilers are getting better at using hardware-specific routines to
 optimize code, they sometimes do not produce optimal results. Also, we would
 like to be able to copy binary optimized C-extension modules from one machine
-to another with the same base architecture (x86, ARM, PowerPC) but with
+to another with the same base architecture (x86, ARM, or PowerPC) but with
 different capabilities without recompiling.
 
 We have a mechanism in the ufunc machinery to `build alternative loops`_
@@ -37,17 +37,15 @@ architectures.  The steps proposed are to:
 Motivation and Scope
 --------------------
 
-Traditionally NumPy has counted on the compilers to generate optimal code
+Traditionally NumPy has depended on compilers to generate optimal code
 specifically for the target architecture.
 However few users today compile NumPy locally for their machines. Most use the
 binary packages which must provide run-time support for the lowest-common
 denominator CPU architecture. Thus NumPy cannot take advantage of 
 more advanced features of their CPU processors, since they may not be available
-on all users' systems. The ufunc machinery already has a loop-selection
-protocol based on dtypes, so it is easy to extend this to also select an
-optimal loop for specifically available CPU features at runtime.
+on all users' systems.
 
-Traditionally, these features have been exposed through `intrinsics`_ which are
+Traditionally, CPU features have been exposed through `intrinsics`_ which are
 compiler-specific instructions that map directly to assembly instructions.
 Recently there were discussions about the effectiveness of adding more
 intrinsics (e.g., `gh-11113`_ for AVX optimizations for floats).  In the past,
@@ -60,6 +58,7 @@ Recently, OpenCV moved to using `universal intrinsics`_ in the Hardware
 Abstraction Layer (HAL) which provided a nice abstraction for common shared
 Single Instruction Multiple Data (SIMD) constructs. This NEP proposes a similar
 mechanism for NumPy. There are three stages to using the mechanism:
+
 - Infrastructure is provided in the code for abstract intrinsics. The ufunc
   machinery will be extended using sets of these abstract intrinsics, so that
   a single ufunc will be expressed as a set of loops, going from a minimal to
@@ -78,6 +77,12 @@ The current NEP proposes only to use the runtime feature detection and optimal
 loop selection mechanism for ufuncs. Future NEPS may propose other uses for the
 proposed solution.
 
+The ufunc machinery already has the ability to select an optimal loop for
+specifically available CPU features at runtime, currently used for ``avx2``,
+``fma`` and ``avx512f`` loops (in the generated ``__umath_generated.c`` file);
+universal intrinsics would extend the generated code to include more loop
+variants.
+
 Usage and Impact
 ----------------
 
@@ -123,7 +128,7 @@ which instruction sets can be used at *runtime* via environment variables.
 Diagnostics
 ```````````
 
-A new dictionary `__cpu_features__` will be available to python. The keys are
+A new dictionary ``__cpu_features__`` will be available to python. The keys are
 the available features, the value is a boolean whether the feature is available
 or not. Various new private
 C functions will be used internally to query available features. These
@@ -136,10 +141,54 @@ Workflow for adding a new CPU architecture-specific optimization
 NumPy will always have a baseline C implementation for any code that may be
 a candidate for SIMD vectorization.  If a contributor wants to add SIMD
 support for some architecture (typically the one of most interest to them),
-this is the proposed workflow:
-
-TODO (see https://github.com/numpy/numpy/pull/13516#issuecomment-558859638,
-needs to be worked out more)
+this comment is the beginning of a tutorial on how to do so:
+https://github.com/numpy/numpy/pull/13516#issuecomment-558859638
+
+.. _tradeoffs:
+
+As of this moment, NumPy has a number of ``avx512f`` and ``avx2`` and ``fma``
+SIMD loops for many ufuncs. These would likely be the first candidates
+to be ported to universal intrinsics. The expectation is that the new
+implementation may cause a regression in benchmarks, but not increase the
+size of the binary. If the regression is not minimal, we may choose to keep
+the X86-specific code for that platform and use the universal intrisic code
+for other platforms.
+
+Any new PRs to implement ufuncs using intrinsics will be expected to use the
+universal intrinsics. If it can be demonstrated that the use of universal
+intrinsics is too awkward or is not performant enough, platform specific code
+may be accepted as well. In rare cases, a single-platform only PR may be
+accepted, but it would have to be examined within the framework of preferring
+a solution using universal intrinsics.
+
+The subjective criteria for accepting new loops are:
+
+- correctness: the new code must not decrease accuracy by more than 1-3 ULPs
+  even at edge points in the algorithm.
+- code bloat: both source code size and especially binary size of the compiled
+  wheel.
+- maintainability: how readable is the code
+- performance: benchmarks must show a significant performance boost
+
+.. _new-intrinsics:
+
+Adding a new intrinsic
+~~~~~~~~~~~~~~~~~~~~~~
+
+If a contributor wants to use a platform-specific SIMD instruction that is not
+yet supported as a universal intrinsic, then:
+
+1. It should be added as a universal intrinsic for all platforms
+2. If it does not have an equivalent instruction on other platforms (e.g.
+   ``_mm512_mask_i32gather_ps`` in ``AVX512``), then no universal intrinsic
+   should be added and a platform-specific ``ufunc`` or a short helper fuction
+   should be written instead. If such a helper function is used, it must be
+   wrapped with the feature macros, and a reasonable non-intrinsic fallback to
+   be used by default.
+
+We expect (2) to be the exception. The contributor and maintainers should
+consider whether that single-platform intrinsic is worth it compared to using
+the best available universal intrinsic based implementation.
 
 Reuse by other projects
 ```````````````````````
@@ -157,19 +206,8 @@ There should be no impact on backwards compatibility.
 Detailed description
 --------------------
 
-Two new build options are available to ``runtests.py`` and ``setup.py build``.
-The absolute minimum required features to compile are defined by
-``--cpu-baseline``.  For instance, on ``x86_64`` this defaults to ``SSE3``. The
-set of additional intrinsics that can be detected and used as sets of
-requirements to dispatch on are set by ``--cpu-dispatch``. For instance, on
-``x86_64`` this defaults to ``[SSSE3, SSE41, POPCNT, SSE42, AVX, F16C, XOP,
-FMA4, FMA3, AVX2, AVX512F, AVX512CD, AVX512_KNL, AVX512_KNM, AVX512_SKX,
-AVX512_CLX, AVX512_CNL, AVX512_ICL]``. These features are all mapped to a
-c-level boolean array ``npy__cpu_have``, and a c-level convenience function
-``npy_cpu_have(int feature_id)`` queries this array.
-
-The CPU-specific features are then mapped to unversal intrinsics which are
-for all x86 SIMD variants, ARM SIMD variants etc. For example, the
+The CPU-specific are mapped to unversal intrinsics which are
+similar for all x86 SIMD variants, ARM SIMD variants etc. For example, the
 NumPy universal intrinsic ``npyv_load_u32`` maps to:
 
 *  ``vld1q_u32`` for ARM based NEON
@@ -180,20 +218,46 @@ Anyone writing a SIMD loop will use the ``npyv_load_u32`` macro instead of the
 architecture specific intrinsic. The code also supplies guard macros for
 compilation and runtime, so that the proper loops can be chosen.
 
+Two new build options are available to ``runtests.py`` and ``setup.py``:
+``--cpu-baseline`` and ``--cpu-dispatch``.
+The absolute minimum required features to compile are defined by
+``--cpu-baseline``.  For instance, on ``x86_64`` this defaults to ``SSE3``. The
+minimum features will be enabled if the compiler support it. The
+set of additional intrinsics that can be detected and used as sets of
+requirements to dispatch on are set by ``--cpu-dispatch``. For instance, on
+``x86_64`` this defaults to ``[SSSE3, SSE41, POPCNT, SSE42, AVX, F16C, XOP,
+FMA4, FMA3, AVX2, AVX512F, AVX512CD, AVX512_KNL, AVX512_KNM, AVX512_SKX,
+AVX512_CLX, AVX512_CNL, AVX512_ICL]``. These features are all mapped to a
+c-level boolean array ``npy__cpu_have``, and a c-level convenience function
+``npy_cpu_have(int feature_id)`` queries this array, and the results are stored
+in ``__cpu_features__`` at runtime.
+
+When importing the ufuncs, the available compiled loops' required features are
+matched to the ones discovered. The loop with the best match is marked to be
+called by the ufunc.
+
 Related Work
 ------------
 
-- PIXMAX TBD: what is it?
+- `Pixman`_ is the library used by Cairo and X to manipulate pixels. It uses
+  a technique like the one described here to fill a structure with function
+  pointers at runtime. These functions are similar to ufunc loops.
 - `Eigen`_ is a C++ template library for linear algebra: matrices, vectors,
   numerical solvers, and related algorithms. It is a higher level-abstraction
   than the intrinsics discussed here.
 - `xsimd`_ is a header-only C++ library for x86 and ARM that implements the
   mathematical functions used in the algorithms of ``boost.SIMD``.
+- `Simd`_ is a high-level image processing and machine learning library with
+  optimizations for different platforms.
 - OpenCV used to have the one-implementation-per-architecture design, but more
   recently moved to a design that is quite similar to what is proposed in this
   NEP. The top-level `dispatch code`_ includes a `generic header`_ that is
   `specialized at compile time`_ by the CMakefile system.
-
+- `VOLK`_ is a GPL3 library used by gnuradio and others to abstract SIMD
+  intrinsics. They offer a set of high-level operations which have been
+  optimized for each architecture.
+- The C++ Standards Committee has proposed `class templates`_ for portable
+  SIMD programming via vector types, and `namespaces`_ for the templates.
 
 Implementation
 --------------
@@ -223,15 +287,27 @@ implementing and maintaining that platform's loop code.
 Discussion
 ----------
 
-.. note::
-    Will include a summary of the discussion once the NEP is published
+Most of the discussion took place on the PR `gh-15228`_ to accecpt this NEP.
+Discussion on the mailing list mentioned `VOLK`_ which was added to
+the section on related work. The question of maintainability also was raised
+both on the mailing list and in `gh-15228`_ and resolved as follows:
+
+- If contributors want to leverage a specific SIMD instruction, will they be
+  expected to add software implementation of this instruction for all other
+  architectures too? (see the `new-intrinsics`_ part of the workflow).
+- On whom does the burden lie to verify the code and benchmarks for all
+  architectures? What happens if adding a universal ufunc in place of
+  architecture-specific code helps one architecture but harms performance
+  on another? (answered in the tradeoffs_ part of the workflow).
 
 References and Footnotes
 ------------------------
 
 .. _`build alternative loops`: https://github.com/numpy/numpy/blob/v1.17.4/numpy/core/code_generators/generate_umath.py#L50
 .. _`is chosen`: https://github.com/numpy/numpy/blob/v1.17.4/numpy/core/code_generators/generate_umath.py#L1038
-.. _`gh-11113"`: https://github.com/numpy/numpy/pull/11113
+.. _`gh-11113`: https://github.com/numpy/numpy/pull/11113
+.. _`gh-15228`: https://github.com/numpy/numpy/pull/15228
+.. _`gh-13516`: https://github.com/numpy/numpy/pull/13516
 .. _`fast avx512 routines`: https://github.com/numpy/numpy/pulls?q=is%3Apr+avx512+is%3Aclosed
 
 .. [1] Each NEP must either be explicitly labeled as placed in the public domain (see
@@ -240,12 +316,17 @@ References and Footnotes
 .. _Open Publication License: https://www.opencontent.org/openpub/
 
 .. _`xsimd`: https://xsimd.readthedocs.io/en/latest/
+.. _`Pixman`: https://gitlab.freedesktop.org/pixman
+.. _`VOLK`: https://www.libvolk.org/doxygen/index.html
 .. _`Eigen`: http://eigen.tuxfamily.org/index.php?title=Main_Page
+.. _`Simd`: https://github.com/ermig1979/Simd
 .. _`dispatch code`: https://github.com/opencv/opencv/blob/4.1.2/modules/core/src/arithm.dispatch.cpp
 .. _`generic header`: https://github.com/opencv/opencv/blob/4.1.2/modules/core/src/arithm.simd.hpp
 .. _`specialized at compile time`: https://github.com/opencv/opencv/blob/4.1.2/modules/core/CMakeLists.txt#L3-#L13
 .. _`intrinsics`: https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-intrinsics
 .. _`universal intrinsics`: https://docs.opencv.org/master/df/d91/group__core__hal__intrin.html
+.. _`class templates`: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0214r8.pdf
+.. _`namespaces`: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/n4808.pdf
 
 Copyright
 ---------
author	Matti Picus <matti.picus@gmail.com>	2020-02-19 08:15:52 +0200
committer	GitHub <noreply@github.com>	2020-02-18 22:15:52 -0800
commit	5eff78bb16df99ffe2e9cad86c4ec649893c0646 (patch)
tree	41181e6fa03959787935bf415c8c888d3f476280
parent	4d2b5850488013319ff8354a1e764a0a2064fe63 (diff)
download	numpy-5eff78bb16df99ffe2e9cad86c4ec649893c0646.tar.gz