37 files changed, 1207 insertions, 664 deletions
diff --git a/doc/source/reference/arrays.classes.rst b/doc/source/reference/arrays.classes.rst
index 3a4ed2168..92c271f6b 100644
--- a/doc/source/reference/arrays.classes.rst
+++ b/doc/source/reference/arrays.classes.rst
@@ -330,6 +330,8 @@ NumPy provides several hooks that classes can customize:
    returned by :func:`__array__`. This practice will return ``TypeError``.
 
 
+.. _matrix-objects:
+
 Matrix objects
 ==============
 
diff --git a/doc/source/reference/arrays.datetime.rst b/doc/source/reference/arrays.datetime.rst
index c5947620e..e3b8d270d 100644
--- a/doc/source/reference/arrays.datetime.rst
+++ b/doc/source/reference/arrays.datetime.rst
@@ -13,16 +13,15 @@ support datetime functionality. The data type is called "datetime64",
 so named because "datetime" is already taken by the datetime library
 included in Python.
 
-.. note:: The datetime API is *experimental* in 1.7.0, and may undergo changes
-   in future versions of NumPy.
 
 Basic Datetimes
 ===============
 
-The most basic way to create datetimes is from strings in
-ISO 8601 date or datetime format. The unit for internal storage
-is automatically selected from the form of the string, and can
-be either a :ref:`date unit <arrays.dtypes.dateunits>` or a
+The most basic way to create datetimes is from strings in ISO 8601 date 
+or datetime format. It is also possible to create datetimes from an integer by 
+offset relative to the Unix epoch (00:00:00 UTC on 1 January 1970).
+The unit for internal storage is automatically selected from the 
+form of the string, and can be either a :ref:`date unit <arrays.dtypes.dateunits>` or a
 :ref:`time unit <arrays.dtypes.timeunits>`. The date units are years ('Y'),
 months ('M'), weeks ('W'), and days ('D'), while the time units are
 hours ('h'), minutes ('m'), seconds ('s'), milliseconds ('ms'), and
@@ -36,6 +35,11 @@ letters, for a "Not A Time" value.
 
     >>> np.datetime64('2005-02-25')
     numpy.datetime64('2005-02-25')
+    
+    From an integer and a date unit, 1 year since the UNIX epoch:
+
+    >>> np.datetime64(1, 'Y')
+    numpy.datetime64('1971')   
 
     Using months for the unit:
 
diff --git a/doc/source/reference/arrays.ndarray.rst b/doc/source/reference/arrays.ndarray.rst
index 191367058..f2204752d 100644
--- a/doc/source/reference/arrays.ndarray.rst
+++ b/doc/source/reference/arrays.ndarray.rst
@@ -567,10 +567,8 @@ Matrix Multiplication:
 .. note::
 
    Matrix operators ``@`` and ``@=`` were introduced in Python 3.5
-   following PEP465. NumPy 1.10.0 has a preliminary implementation of ``@``
-   for testing purposes. Further documentation can be found in the
-   :func:`matmul` documentation.
-
+   following :pep:`465`, and the ``@`` operator has been introduced in NumPy
+   1.10.0. Further information can be found in the :func:`matmul` documentation.
 
 Special methods
 ===============
diff --git a/doc/source/reference/arrays.scalars.rst b/doc/source/reference/arrays.scalars.rst
index 4b5da2e13..abef66692 100644
--- a/doc/source/reference/arrays.scalars.rst
+++ b/doc/source/reference/arrays.scalars.rst
@@ -94,112 +94,180 @@ Python Boolean scalar.
 .. tip:: The default data type in NumPy is :class:`float_`.
 
 .. autoclass:: numpy.generic
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.number
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 Integer types
 ~~~~~~~~~~~~~
 
 .. autoclass:: numpy.integer
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
+
+.. note::
+
+   The numpy integer types mirror the behavior of C integers, and can therefore
+   be subject to :ref:`overflow-errors`.
 
 Signed integer types
 ++++++++++++++++++++
 
 .. autoclass:: numpy.signedinteger
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.byte
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.short
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.intc
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.int_
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.longlong
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 Unsigned integer types
 ++++++++++++++++++++++
 
 .. autoclass:: numpy.unsignedinteger
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.ubyte
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.ushort
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.uintc
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.uint
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.ulonglong
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 Inexact types
 ~~~~~~~~~~~~~
 
 .. autoclass:: numpy.inexact
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
+
+.. note::
+
+   Inexact scalars are printed using the fewest decimal digits needed to
+   distinguish their value from other values of the same datatype,
+   by judicious rounding. See the ``unique`` parameter of
+   `format_float_positional` and `format_float_scientific`.
+
+   This means that variables with equal binary values but whose datatypes are of
+   different precisions may display differently::
+
+       >>> f16 = np.float16("0.1")
+       >>> f32 = np.float32(f16)
+       >>> f64 = np.float64(f32)
+       >>> f16 == f32 == f64
+       True
+       >>> f16, f32, f64
+       (0.1, 0.099975586, 0.0999755859375)
+
+   Note that none of these floats hold the exact value :math:`\frac{1}{10}`;
+   ``f16`` prints as ``0.1`` because it is as close to that value as possible,
+   whereas the other types do not as they have more precision and therefore have
+   closer values.
+   
+   Conversely, floating-point scalars of different precisions which approximate
+   the same decimal value may compare unequal despite printing identically:
+   
+       >>> f16 = np.float16("0.1")
+       >>> f32 = np.float32("0.1")
+       >>> f64 = np.float64("0.1")
+       >>> f16 == f32 == f64
+       False
+       >>> f16, f32, f64
+       (0.1, 0.1, 0.1)
 
 Floating-point types
 ++++++++++++++++++++
 
 .. autoclass:: numpy.floating
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.half
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.single
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.double
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.longdouble
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 Complex floating-point types
 ++++++++++++++++++++++++++++
 
 .. autoclass:: numpy.complexfloating
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.csingle
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.cdouble
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.clongdouble
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 Other types
 ~~~~~~~~~~~
 
 .. autoclass:: numpy.bool_
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.datetime64
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.timedelta64
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.object_
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. note::
 
@@ -222,16 +290,20 @@ arrays. (In the character codes ``#`` is an integer denoting how many
 elements the data type consists of.)
 
 .. autoclass:: numpy.flexible
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.bytes_
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.str_
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 .. autoclass:: numpy.void
-   :exclude-members:
+   :members: __init__
+   :exclude-members: __init__
 
 
 .. warning::
@@ -246,6 +318,8 @@ elements the data type consists of.)
    convention more consistent with other Python modules such as the
    :mod:`struct` module.
 
+.. _sized-aliases:
+
 Sized aliases
 ~~~~~~~~~~~~~
 
@@ -278,8 +352,8 @@ are also provided.
                uint32
                uint64
 
-   Alias for the unsigned integer types (one of `numpy.byte`, `numpy.short`,
-   `numpy.intc`, `numpy.int_` and `numpy.longlong`) with the specified number
+   Alias for the unsigned integer types (one of `numpy.ubyte`, `numpy.ushort`,
+   `numpy.uintc`, `numpy.uint` and `numpy.ulonglong`) with the specified number
    of bits.
 
    Compatible with the C99 ``uint8_t``, ``uint16_t``, ``uint32_t``, and
@@ -297,8 +371,8 @@ are also provided.
 
 .. attribute:: uintp
 
-   Alias for the unsigned integer type (one of `numpy.byte`, `numpy.short`,
-   `numpy.intc`, `numpy.int_` and `np.longlong`) that is the same size as a
+   Alias for the unsigned integer type (one of `numpy.ubyte`, `numpy.ushort`,
+   `numpy.uintc`, `numpy.uint` and `np.ulonglong`) that is the same size as a
    pointer.
 
    Compatible with the C ``uintptr_t``.
diff --git a/doc/source/reference/c-api/array.rst b/doc/source/reference/c-api/array.rst
index 3aa541b79..26a8f643d 100644
--- a/doc/source/reference/c-api/array.rst
+++ b/doc/source/reference/c-api/array.rst
@@ -22,8 +22,8 @@ Array structure and data access
 
 These macros access the :c:type:`PyArrayObject` structure members and are
 defined in ``ndarraytypes.h``. The input argument, *arr*, can be any
-:c:type:`PyObject *<PyObject>` that is directly interpretable as a
-:c:type:`PyArrayObject *` (any instance of the :c:data:`PyArray_Type`
+:c:expr:`PyObject *` that is directly interpretable as a
+:c:expr:`PyArrayObject *` (any instance of the :c:data:`PyArray_Type`
 and its sub-types).
 
 .. c:function:: int PyArray_NDIM(PyArrayObject *arr)
@@ -151,6 +151,16 @@ and its sub-types).
 
     `numpy.ndarray.item` is identical to PyArray_GETITEM.
 
+.. c:function:: int PyArray_FinalizeFunc(PyArrayObject* arr, PyObject* obj)
+
+    The function pointed to by the CObject
+    :obj:`~numpy.class.__array_finalize__`.
+    The first argument is the newly created sub-type. The second argument
+    (if not NULL) is the "parent" array (if the array was created using
+    slicing or some other operation where a clearly-distinguishable parent
+    is present). This routine can do anything it wants to. It should
+    return a -1 on error and 0 otherwise.
+
 
 Data access
 ^^^^^^^^^^^
@@ -825,7 +835,7 @@ General check of Python Type
     Evaluates true if *op* is an instance of (a subclass of)
     :c:data:`PyArray_Type` and has 0 dimensions.
 
-.. c:function:: PyArray_IsScalar(op, cls)
+.. c:macro:: PyArray_IsScalar(op, cls)
 
     Evaluates true if *op* is an instance of ``Py{cls}ArrType_Type``.
 
@@ -864,8 +874,8 @@ Data-type checking
 
 For the typenum macros, the argument is an integer representing an
 enumerated array data type. For the array type checking macros the
-argument must be a :c:type:`PyObject *<PyObject>` that can be directly interpreted as a
-:c:type:`PyArrayObject *`.
+argument must be a :c:expr:`PyObject *` that can be directly interpreted as a
+:c:expr:`PyArrayObject *`.
 
 .. c:function:: int PyTypeNum_ISUNSIGNED(int num)
 
@@ -1022,7 +1032,7 @@ argument must be a :c:type:`PyObject *<PyObject>` that can be directly interpret
 
 .. c:function:: int PyArray_EquivByteorders(int b1, int b2)
 
-    True if byteorder characters ( :c:data:`NPY_LITTLE`,
+    True if byteorder characters *b1* and *b2* ( :c:data:`NPY_LITTLE`,
     :c:data:`NPY_BIG`, :c:data:`NPY_NATIVE`, :c:data:`NPY_IGNORE` ) are
     either equal or equivalent as to their specification of a native
     byte order. Thus, on a little-endian machine :c:data:`NPY_LITTLE`
@@ -1250,8 +1260,8 @@ Converting data types
     function returns :c:data:`NPY_FALSE`.
 
 
-New data types
-^^^^^^^^^^^^^^
+User-defined data types
+^^^^^^^^^^^^^^^^^^^^^^^
 
 .. c:function:: void PyArray_InitArrFuncs(PyArray_ArrFuncs* f)
 
@@ -1295,6 +1305,13 @@ New data types
     *descr* can be cast safely to a data-type whose type_number is
     *totype*.
 
+.. c:function:: int PyArray_TypeNumFromName( \
+        char const *str)
+
+   Given a string return the type-number for the data-type with that string as
+   the type-object name.
+   Returns ``NPY_NOTYPE`` without setting an error if no type can be found.
+   Only works for user-defined data-types.
 
 Special functions for NPY_OBJECT
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -2647,6 +2664,12 @@ cost of a slight overhead.
     - If the position of iter is changed, any subsequent call to
       PyArrayNeighborhoodIter_Next is undefined behavior, and
       PyArrayNeighborhoodIter_Reset must be called.
+    - If the position of iter is not the beginning of the data and the
+      underlying data for iter is contiguous, the iterator will point to the
+      start of the data instead of position pointed by iter.
+      To avoid this situation, iter should be moved to the required position
+      only after the creation of iterator, and PyArrayNeighborhoodIter_Reset
+      must be called.
 
     .. code-block:: c
 
@@ -2656,7 +2679,7 @@ cost of a slight overhead.
 
        /*For a 3x3 kernel */
        bounds = {-1, 1, -1, 1};
-       neigh_iter = (PyArrayNeighborhoodIterObject*)PyArrayNeighborhoodIter_New(
+       neigh_iter = (PyArrayNeighborhoodIterObject*)PyArray_NeighborhoodIterNew(
             iter, bounds, NPY_NEIGHBORHOOD_ITER_ZERO_PADDING, NULL);
 
        for(i = 0; i < iter->size; ++i) {
@@ -2781,14 +2804,14 @@ Data-type descriptors
     Data-type objects must be reference counted so be aware of the
     action on the data-type reference of different C-API calls. The
     standard rule is that when a data-type object is returned it is a
-    new reference.  Functions that take :c:type:`PyArray_Descr *` objects and
+    new reference.  Functions that take :c:expr:`PyArray_Descr *` objects and
     return arrays steal references to the data-type their inputs
     unless otherwise noted. Therefore, you must own a reference to any
     data-type object used as input to such a function.
 
 .. c:function:: int PyArray_DescrCheck(PyObject* obj)
 
-    Evaluates as true if *obj* is a data-type object ( :c:type:`PyArray_Descr *` ).
+    Evaluates as true if *obj* is a data-type object ( :c:expr:`PyArray_Descr *` ).
 
 .. c:function:: PyArray_Descr* PyArray_DescrNew(PyArray_Descr* obj)
 
@@ -2815,12 +2838,14 @@ Data-type descriptors
     (recursively).
 
     The value of *newendian* is one of these macros:
+..
+    dedent the enumeration of flags to avoid missing references sphinx warnings 
 
-    .. c:macro:: NPY_IGNORE
-                 NPY_SWAP
-                 NPY_NATIVE
-                 NPY_LITTLE
-                 NPY_BIG
+.. c:macro:: NPY_IGNORE
+             NPY_SWAP
+             NPY_NATIVE
+             NPY_LITTLE
+             NPY_BIG
 
     If a byteorder of :c:data:`NPY_IGNORE` is encountered it
     is left alone. If newendian is :c:data:`NPY_SWAP`, then all byte-orders
@@ -3485,10 +3510,6 @@ Miscellaneous Macros
 
     Evaluates as True if arrays *a1* and *a2* have the same shape.
 
-.. c:var:: a
-
-.. c:var:: b
-
 .. c:macro:: PyArray_MAX(a,b)
 
     Returns the maximum of *a* and *b*. If (*a*) or (*b*) are
@@ -3547,22 +3568,22 @@ Miscellaneous Macros
 Enumerated Types
 ^^^^^^^^^^^^^^^^
 
-.. c:type:: NPY_SORTKIND
+.. c:enum:: NPY_SORTKIND
 
     A special variable-type which can take on different values to indicate
     the sorting algorithm being used.
 
-    .. c:var:: NPY_QUICKSORT
+    .. c:enumerator:: NPY_QUICKSORT
 
-    .. c:var:: NPY_HEAPSORT
+    .. c:enumerator:: NPY_HEAPSORT
 
-    .. c:var:: NPY_MERGESORT
+    .. c:enumerator:: NPY_MERGESORT
 
-    .. c:var:: NPY_STABLESORT
+    .. c:enumerator:: NPY_STABLESORT
 
         Used as an alias of :c:data:`NPY_MERGESORT` and vica versa.
 
-    .. c:var:: NPY_NSORTS
+    .. c:enumerator:: NPY_NSORTS
 
        Defined to be the number of sorts. It is fixed at three by the need for
        backwards compatibility, and consequently :c:data:`NPY_MERGESORT` and
@@ -3570,90 +3591,90 @@ Enumerated Types
        of several stable sorting algorithms depending on the data type.
 
 
-.. c:type:: NPY_SCALARKIND
+.. c:enum:: NPY_SCALARKIND
 
     A special variable type indicating the number of "kinds" of
     scalars distinguished in determining scalar-coercion rules. This
     variable can take on the values:
 
-    .. c:var:: NPY_NOSCALAR
+    .. c:enumerator:: NPY_NOSCALAR
 
-    .. c:var:: NPY_BOOL_SCALAR
+    .. c:enumerator:: NPY_BOOL_SCALAR
 
-    .. c:var:: NPY_INTPOS_SCALAR
+    .. c:enumerator:: NPY_INTPOS_SCALAR
 
-    .. c:var:: NPY_INTNEG_SCALAR
+    .. c:enumerator:: NPY_INTNEG_SCALAR
 
-    .. c:var:: NPY_FLOAT_SCALAR
+    .. c:enumerator:: NPY_FLOAT_SCALAR
 
-    .. c:var:: NPY_COMPLEX_SCALAR
+    .. c:enumerator:: NPY_COMPLEX_SCALAR
 
-    .. c:var:: NPY_OBJECT_SCALAR
+    .. c:enumerator:: NPY_OBJECT_SCALAR
 
-    .. c:var:: NPY_NSCALARKINDS
+    .. c:enumerator:: NPY_NSCALARKINDS
 
        Defined to be the number of scalar kinds
        (not including :c:data:`NPY_NOSCALAR`).
 
-.. c:type:: NPY_ORDER
+.. c:enum:: NPY_ORDER
 
     An enumeration type indicating the element order that an array should be
     interpreted in. When a brand new array is created, generally
     only **NPY_CORDER** and **NPY_FORTRANORDER** are used, whereas
     when one or more inputs are provided, the order can be based on them.
 
-    .. c:var:: NPY_ANYORDER
+    .. c:enumerator:: NPY_ANYORDER
 
         Fortran order if all the inputs are Fortran, C otherwise.
 
-    .. c:var:: NPY_CORDER
+    .. c:enumerator:: NPY_CORDER
 
         C order.
 
-    .. c:var:: NPY_FORTRANORDER
+    .. c:enumerator:: NPY_FORTRANORDER
 
         Fortran order.
 
-    .. c:var:: NPY_KEEPORDER
+    .. c:enumerator:: NPY_KEEPORDER
 
         An order as close to the order of the inputs as possible, even
         if the input is in neither C nor Fortran order.
 
-.. c:type:: NPY_CLIPMODE
+.. c:enum:: NPY_CLIPMODE
 
     A variable type indicating the kind of clipping that should be
     applied in certain functions.
 
-    .. c:var:: NPY_RAISE
+    .. c:enumerator:: NPY_RAISE
 
         The default for most operations, raises an exception if an index
         is out of bounds.
 
-    .. c:var:: NPY_CLIP
+    .. c:enumerator:: NPY_CLIP
 
         Clips an index to the valid range if it is out of bounds.
 
-    .. c:var:: NPY_WRAP
+    .. c:enumerator:: NPY_WRAP
 
         Wraps an index to the valid range if it is out of bounds.
 
-.. c:type:: NPY_SEARCHSIDE
+.. c:enum:: NPY_SEARCHSIDE
 
     A variable type indicating whether the index returned should be that of
     the first suitable location (if :c:data:`NPY_SEARCHLEFT`) or of the last
     (if :c:data:`NPY_SEARCHRIGHT`).
 
-    .. c:var:: NPY_SEARCHLEFT
+    .. c:enumerator:: NPY_SEARCHLEFT
 
-    .. c:var:: NPY_SEARCHRIGHT
+    .. c:enumerator:: NPY_SEARCHRIGHT
 
-.. c:type:: NPY_SELECTKIND
+.. c:enum:: NPY_SELECTKIND
 
     A variable type indicating the selection algorithm being used.
 
-    .. c:var:: NPY_INTROSELECT
+    .. c:enumerator:: NPY_INTROSELECT
 
-.. c:type:: NPY_CASTING
+.. c:enum:: NPY_CASTING
 
     .. versionadded:: 1.6
 
@@ -3661,25 +3682,25 @@ Enumerated Types
     be. This is used by the iterator added in NumPy 1.6, and is intended
     to be used more broadly in a future version.
 
-    .. c:var:: NPY_NO_CASTING
+    .. c:enumerator:: NPY_NO_CASTING
 
         Only allow identical types.
 
-    .. c:var:: NPY_EQUIV_CASTING
+    .. c:enumerator:: NPY_EQUIV_CASTING
 
        Allow identical and casts involving byte swapping.
 
-    .. c:var:: NPY_SAFE_CASTING
+    .. c:enumerator:: NPY_SAFE_CASTING
 
        Only allow casts which will not cause values to be rounded,
        truncated, or otherwise changed.
 
-    .. c:var:: NPY_SAME_KIND_CASTING
+    .. c:enumerator:: NPY_SAME_KIND_CASTING
 
        Allow any safe casts, and casts between types of the same kind.
        For example, float64 -> float32 is permitted with this rule.
 
-    .. c:var:: NPY_UNSAFE_CASTING
+    .. c:enumerator:: NPY_UNSAFE_CASTING
 
        Allow any cast, no matter what kind of data loss may occur.
 
diff --git a/doc/source/reference/c-api/coremath.rst b/doc/source/reference/c-api/coremath.rst
index 338c584a1..e129fdd77 100644
--- a/doc/source/reference/c-api/coremath.rst
+++ b/doc/source/reference/c-api/coremath.rst
@@ -46,30 +46,30 @@ Floating point classification
     corresponding single and extension precision macro are available with the
     suffix F and L.
 
-.. c:function:: int npy_isnan(x)
+.. c:macro:: npy_isnan(x)
 
     This is a macro, and is equivalent to C99 isnan: works for single, double
-    and extended precision, and return a non 0 value is x is a NaN.
+    and extended precision, and return a non 0 value if x is a NaN.
 
-.. c:function:: int npy_isfinite(x)
+.. c:macro:: npy_isfinite(x)
 
     This is a macro, and is equivalent to C99 isfinite: works for single,
-    double and extended precision, and return a non 0 value is x is neither a
+    double and extended precision, and return a non 0 value if x is neither a
     NaN nor an infinity.
 
-.. c:function:: int npy_isinf(x)
+.. c:macro:: npy_isinf(x)
 
     This is a macro, and is equivalent to C99 isinf: works for single, double
-    and extended precision, and return a non 0 value is x is infinite (positive
+    and extended precision, and return a non 0 value if x is infinite (positive
     and negative).
 
-.. c:function:: int npy_signbit(x)
+.. c:macro:: npy_signbit(x)
 
     This is a macro, and is equivalent to C99 signbit: works for single, double
-    and extended precision, and return a non 0 value is x has the signbit set
+    and extended precision, and return a non 0 value if x has the signbit set
     (that is the number is negative).
 
-.. c:function:: double npy_copysign(double x, double y)
+.. c:macro:: npy_copysign(x, y)
 
     This is a function equivalent to C99 copysign: return x with the same sign
     as y. Works for any value, including inf and nan. Single and extended
diff --git a/doc/source/reference/c-api/dtype.rst b/doc/source/reference/c-api/dtype.rst
index a1a53cdb6..382e45dc0 100644
--- a/doc/source/reference/c-api/dtype.rst
+++ b/doc/source/reference/c-api/dtype.rst
@@ -25,157 +25,157 @@ select the precision desired.
 Enumerated Types
 ----------------
 
-.. c:var:: NPY_TYPES
+.. c:enumerator:: NPY_TYPES
 
 There is a list of enumerated types defined providing the basic 24
 data types plus some useful generic names. Whenever the code requires
 a type number, one of these enumerated types is requested. The types
 are all called ``NPY_{NAME}``:
 
-.. c:var:: NPY_BOOL
+.. c:enumerator:: NPY_BOOL
 
     The enumeration value for the boolean type, stored as one byte.
     It may only be set to the values 0 and 1.
 
-.. c:var:: NPY_BYTE
-.. c:var:: NPY_INT8
+.. c:enumerator:: NPY_BYTE
+.. c:enumerator:: NPY_INT8
 
     The enumeration value for an 8-bit/1-byte signed integer.
 
-.. c:var:: NPY_SHORT
-.. c:var:: NPY_INT16
+.. c:enumerator:: NPY_SHORT
+.. c:enumerator:: NPY_INT16
 
     The enumeration value for a 16-bit/2-byte signed integer.
 
-.. c:var:: NPY_INT
-.. c:var:: NPY_INT32
+.. c:enumerator:: NPY_INT
+.. c:enumerator:: NPY_INT32
 
     The enumeration value for a 32-bit/4-byte signed integer.
 
-.. c:var:: NPY_LONG
+.. c:enumerator:: NPY_LONG
 
     Equivalent to either NPY_INT or NPY_LONGLONG, depending on the
     platform.
 
-.. c:var:: NPY_LONGLONG
-.. c:var:: NPY_INT64
+.. c:enumerator:: NPY_LONGLONG
+.. c:enumerator:: NPY_INT64
 
     The enumeration value for a 64-bit/8-byte signed integer.
 
-.. c:var:: NPY_UBYTE
-.. c:var:: NPY_UINT8
+.. c:enumerator:: NPY_UBYTE
+.. c:enumerator:: NPY_UINT8
 
     The enumeration value for an 8-bit/1-byte unsigned integer.
 
-.. c:var:: NPY_USHORT
-.. c:var:: NPY_UINT16
+.. c:enumerator:: NPY_USHORT
+.. c:enumerator:: NPY_UINT16
 
     The enumeration value for a 16-bit/2-byte unsigned integer.
 
-.. c:var:: NPY_UINT
-.. c:var:: NPY_UINT32
+.. c:enumerator:: NPY_UINT
+.. c:enumerator:: NPY_UINT32
 
     The enumeration value for a 32-bit/4-byte unsigned integer.
 
-.. c:var:: NPY_ULONG
+.. c:enumerator:: NPY_ULONG
 
     Equivalent to either NPY_UINT or NPY_ULONGLONG, depending on the
     platform.
 
-.. c:var:: NPY_ULONGLONG
-.. c:var:: NPY_UINT64
+.. c:enumerator:: NPY_ULONGLONG
+.. c:enumerator:: NPY_UINT64
 
     The enumeration value for a 64-bit/8-byte unsigned integer.
 
-.. c:var:: NPY_HALF
-.. c:var:: NPY_FLOAT16
+.. c:enumerator:: NPY_HALF
+.. c:enumerator:: NPY_FLOAT16
 
     The enumeration value for a 16-bit/2-byte IEEE 754-2008 compatible floating
     point type.
 
-.. c:var:: NPY_FLOAT
-.. c:var:: NPY_FLOAT32
+.. c:enumerator:: NPY_FLOAT
+.. c:enumerator:: NPY_FLOAT32
 
     The enumeration value for a 32-bit/4-byte IEEE 754 compatible floating
     point type.
 
-.. c:var:: NPY_DOUBLE
-.. c:var:: NPY_FLOAT64
+.. c:enumerator:: NPY_DOUBLE
+.. c:enumerator:: NPY_FLOAT64
 
     The enumeration value for a 64-bit/8-byte IEEE 754 compatible floating
     point type.
 
-.. c:var:: NPY_LONGDOUBLE
+.. c:enumerator:: NPY_LONGDOUBLE
 
     The enumeration value for a platform-specific floating point type which is
     at least as large as NPY_DOUBLE, but larger on many platforms.
 
-.. c:var:: NPY_CFLOAT
-.. c:var:: NPY_COMPLEX64
+.. c:enumerator:: NPY_CFLOAT
+.. c:enumerator:: NPY_COMPLEX64
 
     The enumeration value for a 64-bit/8-byte complex type made up of
     two NPY_FLOAT values.
 
-.. c:var:: NPY_CDOUBLE
-.. c:var:: NPY_COMPLEX128
+.. c:enumerator:: NPY_CDOUBLE
+.. c:enumerator:: NPY_COMPLEX128
 
     The enumeration value for a 128-bit/16-byte complex type made up of
     two NPY_DOUBLE values.
 
-.. c:var:: NPY_CLONGDOUBLE
+.. c:enumerator:: NPY_CLONGDOUBLE
 
     The enumeration value for a platform-specific complex floating point
     type which is made up of two NPY_LONGDOUBLE values.
 
-.. c:var:: NPY_DATETIME
+.. c:enumerator:: NPY_DATETIME
 
     The enumeration value for a data type which holds dates or datetimes with
     a precision based on selectable date or time units.
 
-.. c:var:: NPY_TIMEDELTA
+.. c:enumerator:: NPY_TIMEDELTA
 
     The enumeration value for a data type which holds lengths of times in
     integers of selectable date or time units.
 
-.. c:var:: NPY_STRING
+.. c:enumerator:: NPY_STRING
 
     The enumeration value for ASCII strings of a selectable size. The
     strings have a fixed maximum size within a given array.
 
-.. c:var:: NPY_UNICODE
+.. c:enumerator:: NPY_UNICODE
 
     The enumeration value for UCS4 strings of a selectable size. The
     strings have a fixed maximum size within a given array.
 
-.. c:var:: NPY_OBJECT
+.. c:enumerator:: NPY_OBJECT
 
     The enumeration value for references to arbitrary Python objects.
 
-.. c:var:: NPY_VOID
+.. c:enumerator:: NPY_VOID
 
     Primarily used to hold struct dtypes, but can contain arbitrary
     binary data.
 
 Some useful aliases of the above types are
 
-.. c:var:: NPY_INTP
+.. c:enumerator:: NPY_INTP
 
     The enumeration value for a signed integer type which is the same
     size as a (void \*) pointer. This is the type used by all
     arrays of indices.
 
-.. c:var:: NPY_UINTP
+.. c:enumerator:: NPY_UINTP
 
     The enumeration value for an unsigned integer type which is the
     same size as a (void \*) pointer.
 
-.. c:var:: NPY_MASK
+.. c:enumerator:: NPY_MASK
 
     The enumeration value of the type used for masks, such as with
     the :c:data:`NPY_ITER_ARRAYMASK` iterator flag. This is equivalent
     to :c:data:`NPY_UINT8`.
 
-.. c:var:: NPY_DEFAULT_TYPE
+.. c:enumerator:: NPY_DEFAULT_TYPE
 
     The default type to use when no dtype is explicitly specified, for
     example when calling np.zero(shape). This is equivalent to
@@ -297,9 +297,13 @@ Boolean
 Unsigned versions of the integers can be defined by pre-pending a 'u'
 to the front of the integer name.
 
-.. c:type:: npy_(u)byte
+.. c:type:: npy_byte
 
-    (unsigned) char
+    char
+
+.. c:type:: npy_ubyte
+
+    unsigned char
 
 .. c:type:: npy_short
 
@@ -309,14 +313,14 @@ to the front of the integer name.
 
     unsigned short
 
-.. c:type:: npy_uint
-
-    unsigned int
-
 .. c:type:: npy_int
 
     int
 
+.. c:type:: npy_uint
+
+    unsigned int
+
 .. c:type:: npy_int16
 
     16-bit integer
@@ -341,13 +345,21 @@ to the front of the integer name.
 
     64-bit unsigned integer
 
-.. c:type:: npy_(u)long
+.. c:type:: npy_long
 
-    (unsigned) long int
+    long int
 
-.. c:type:: npy_(u)longlong
+.. c:type:: npy_ulong
 
-    (unsigned long long int)
+    unsigned long int
+
+.. c:type:: npy_longlong
+
+    long long int
+
+.. c:type:: npy_ulonglong
+
+    unsigned long long int
 
 .. c:type:: npy_intp
 
@@ -367,18 +379,30 @@ to the front of the integer name.
 
     16-bit float
 
-.. c:type:: npy_(c)float
+.. c:type:: npy_float
 
     32-bit float
 
-.. c:type:: npy_(c)double
+.. c:type:: npy_cfloat
+
+    32-bit complex float
+
+.. c:type:: npy_double
 
     64-bit double
 
-.. c:type:: npy_(c)longdouble
+.. c:type:: npy_cdouble
+
+    64-bit complex double
+
+.. c:type:: npy_longdouble
 
     long double
 
+.. c:type:: npy_clongdouble
+
+    long complex double
+
 complex types are structures with **.real** and **.imag** members (in
 that order).
 
diff --git a/doc/source/reference/c-api/iterator.rst b/doc/source/reference/c-api/iterator.rst
index ae96bb3fb..2208cdd2f 100644
--- a/doc/source/reference/c-api/iterator.rst
+++ b/doc/source/reference/c-api/iterator.rst
@@ -312,318 +312,322 @@ Construction and Destruction
 
     Flags that may be passed in ``flags``, applying to the whole
     iterator, are:
+..
+    dedent the enumeration of flags to avoid missing references sphinx warnings 
 
-        .. c:macro:: NPY_ITER_C_INDEX
+.. c:macro:: NPY_ITER_C_INDEX
 
-            Causes the iterator to track a raveled flat index matching C
-            order. This option cannot be used with :c:data:`NPY_ITER_F_INDEX`.
+    Causes the iterator to track a raveled flat index matching C
+    order. This option cannot be used with :c:data:`NPY_ITER_F_INDEX`.
 
-        .. c:macro:: NPY_ITER_F_INDEX
+.. c:macro:: NPY_ITER_F_INDEX
 
-            Causes the iterator to track a raveled flat index matching Fortran
-            order. This option cannot be used with :c:data:`NPY_ITER_C_INDEX`.
+    Causes the iterator to track a raveled flat index matching Fortran
+    order. This option cannot be used with :c:data:`NPY_ITER_C_INDEX`.
+
+.. c:macro:: NPY_ITER_MULTI_INDEX
 
-        .. c:macro:: NPY_ITER_MULTI_INDEX
-
-            Causes the iterator to track a multi-index.
-            This prevents the iterator from coalescing axes to
-            produce bigger inner loops. If the loop is also not buffered
-            and no index is being tracked (`NpyIter_RemoveAxis` can be called),
-            then the iterator size can be ``-1`` to indicate that the iterator
-            is too large. This can happen due to complex broadcasting and
-            will result in errors being created when the setting the iterator
-            range, removing the multi index, or getting the next function.
-            However, it is possible to remove axes again and use the iterator
-            normally if the size is small enough after removal.
-
-        .. c:macro:: NPY_ITER_EXTERNAL_LOOP
-
-            Causes the iterator to skip iteration of the innermost
-            loop, requiring the user of the iterator to handle it.
-
-            This flag is incompatible with :c:data:`NPY_ITER_C_INDEX`,
-            :c:data:`NPY_ITER_F_INDEX`, and :c:data:`NPY_ITER_MULTI_INDEX`.
-
-        .. c:macro:: NPY_ITER_DONT_NEGATE_STRIDES
-
-            This only affects the iterator when :c:type:`NPY_KEEPORDER` is
-            specified for the order parameter.  By default with
-            :c:type:`NPY_KEEPORDER`, the iterator reverses axes which have
-            negative strides, so that memory is traversed in a forward
-            direction.  This disables this step.  Use this flag if you
-            want to use the underlying memory-ordering of the axes,
-            but don't want an axis reversed. This is the behavior of
-            ``numpy.ravel(a, order='K')``, for instance.
-
-        .. c:macro:: NPY_ITER_COMMON_DTYPE
-
-            Causes the iterator to convert all the operands to a common
-            data type, calculated based on the ufunc type promotion rules.
-            Copying or buffering must be enabled.
-
-            If the common data type is known ahead of time, don't use this
-            flag.  Instead, set the requested dtype for all the operands.
-
-        .. c:macro:: NPY_ITER_REFS_OK
-
-            Indicates that arrays with reference types (object
-            arrays or structured arrays containing an object type)
-            may be accepted and used in the iterator.  If this flag
-            is enabled, the caller must be sure to check whether
-            :c:func:`NpyIter_IterationNeedsAPI(iter)` is true, in which case
-            it may not release the GIL during iteration.
-
-        .. c:macro:: NPY_ITER_ZEROSIZE_OK
-
-            Indicates that arrays with a size of zero should be permitted.
-            Since the typical iteration loop does not naturally work with
-            zero-sized arrays, you must check that the IterSize is larger
-            than zero before entering the iteration loop.
-            Currently only the operands are checked, not a forced shape.
-
-        .. c:macro:: NPY_ITER_REDUCE_OK
-
-            Permits writeable operands with a dimension with zero
-            stride and size greater than one.  Note that such operands
-            must be read/write.
-
-            When buffering is enabled, this also switches to a special
-            buffering mode which reduces the loop length as necessary to
-            not trample on values being reduced.
-
-            Note that if you want to do a reduction on an automatically
-            allocated output, you must use :c:func:`NpyIter_GetOperandArray`
-            to get its reference, then set every value to the reduction
-            unit before doing the iteration loop.  In the case of a
-            buffered reduction, this means you must also specify the
-            flag :c:data:`NPY_ITER_DELAY_BUFALLOC`, then reset the iterator
-            after initializing the allocated operand to prepare the
-            buffers.
-
-        .. c:macro:: NPY_ITER_RANGED
-
-            Enables support for iteration of sub-ranges of the full
-            ``iterindex`` range ``[0, NpyIter_IterSize(iter))``.  Use
-            the function :c:func:`NpyIter_ResetToIterIndexRange` to specify
-            a range for iteration.
-
-            This flag can only be used with :c:data:`NPY_ITER_EXTERNAL_LOOP`
-            when :c:data:`NPY_ITER_BUFFERED` is enabled.  This is because
-            without buffering, the inner loop is always the size of the
-            innermost iteration dimension, and allowing it to get cut up
-            would require special handling, effectively making it more
-            like the buffered version.
-
-        .. c:macro:: NPY_ITER_BUFFERED
-
-            Causes the iterator to store buffering data, and use buffering
-            to satisfy data type, alignment, and byte-order requirements.
-            To buffer an operand, do not specify the :c:data:`NPY_ITER_COPY`
-            or :c:data:`NPY_ITER_UPDATEIFCOPY` flags, because they will
-            override buffering.  Buffering is especially useful for Python
-            code using the iterator, allowing for larger chunks
-            of data at once to amortize the Python interpreter overhead.
-
-            If used with :c:data:`NPY_ITER_EXTERNAL_LOOP`, the inner loop
-            for the caller may get larger chunks than would be possible
-            without buffering, because of how the strides are laid out.
-
-            Note that if an operand is given the flag :c:data:`NPY_ITER_COPY`
-            or :c:data:`NPY_ITER_UPDATEIFCOPY`, a copy will be made in preference
-            to buffering.  Buffering will still occur when the array was
-            broadcast so elements need to be duplicated to get a constant
-            stride.
-
-            In normal buffering, the size of each inner loop is equal
-            to the buffer size, or possibly larger if
-            :c:data:`NPY_ITER_GROWINNER` is specified.  If
-            :c:data:`NPY_ITER_REDUCE_OK` is enabled and a reduction occurs,
-            the inner loops may become smaller depending
-            on the structure of the reduction.
-
-        .. c:macro:: NPY_ITER_GROWINNER
-
-            When buffering is enabled, this allows the size of the inner
-            loop to grow when buffering isn't necessary.  This option
-            is best used if you're doing a straight pass through all the
-            data, rather than anything with small cache-friendly arrays
-            of temporary values for each inner loop.
-
-        .. c:macro:: NPY_ITER_DELAY_BUFALLOC
-
-            When buffering is enabled, this delays allocation of the
-            buffers until :c:func:`NpyIter_Reset` or another reset function is
-            called.  This flag exists to avoid wasteful copying of
-            buffer data when making multiple copies of a buffered
-            iterator for multi-threaded iteration.
-
-            Another use of this flag is for setting up reduction operations.
-            After the iterator is created, and a reduction output
-            is allocated automatically by the iterator (be sure to use
-            READWRITE access), its value may be initialized to the reduction
-            unit.  Use :c:func:`NpyIter_GetOperandArray` to get the object.
-            Then, call :c:func:`NpyIter_Reset` to allocate and fill the buffers
-            with their initial values.
-
-        .. c:macro:: NPY_ITER_COPY_IF_OVERLAP
-
-            If any write operand has overlap with any read operand, eliminate all
-            overlap by making temporary copies (enabling UPDATEIFCOPY for write
-            operands, if necessary). A pair of operands has overlap if there is
-            a memory address that contains data common to both arrays.
-
-            Because exact overlap detection has exponential runtime
-            in the number of dimensions, the decision is made based
-            on heuristics, which has false positives (needless copies in unusual
-            cases) but has no false negatives.
-
-            If any read/write overlap exists, this flag ensures the result of the
-            operation is the same as if all operands were copied.
-            In cases where copies would need to be made, **the result of the
-            computation may be undefined without this flag!**
+    Causes the iterator to track a multi-index.
+    This prevents the iterator from coalescing axes to
+    produce bigger inner loops. If the loop is also not buffered
+    and no index is being tracked (`NpyIter_RemoveAxis` can be called),
+    then the iterator size can be ``-1`` to indicate that the iterator
+    is too large. This can happen due to complex broadcasting and
+    will result in errors being created when the setting the iterator
+    range, removing the multi index, or getting the next function.
+    However, it is possible to remove axes again and use the iterator
+    normally if the size is small enough after removal.
+
+.. c:macro:: NPY_ITER_EXTERNAL_LOOP
+
+    Causes the iterator to skip iteration of the innermost
+    loop, requiring the user of the iterator to handle it.
+
+    This flag is incompatible with :c:data:`NPY_ITER_C_INDEX`,
+    :c:data:`NPY_ITER_F_INDEX`, and :c:data:`NPY_ITER_MULTI_INDEX`.
+
+.. c:macro:: NPY_ITER_DONT_NEGATE_STRIDES
+
+    This only affects the iterator when :c:type:`NPY_KEEPORDER` is
+    specified for the order parameter.  By default with
+    :c:type:`NPY_KEEPORDER`, the iterator reverses axes which have
+    negative strides, so that memory is traversed in a forward
+    direction.  This disables this step.  Use this flag if you
+    want to use the underlying memory-ordering of the axes,
+    but don't want an axis reversed. This is the behavior of
+    ``numpy.ravel(a, order='K')``, for instance.
+
+.. c:macro:: NPY_ITER_COMMON_DTYPE
+
+    Causes the iterator to convert all the operands to a common
+    data type, calculated based on the ufunc type promotion rules.
+    Copying or buffering must be enabled.
+
+    If the common data type is known ahead of time, don't use this
+    flag.  Instead, set the requested dtype for all the operands.
+
+.. c:macro:: NPY_ITER_REFS_OK
+
+    Indicates that arrays with reference types (object
+    arrays or structured arrays containing an object type)
+    may be accepted and used in the iterator.  If this flag
+    is enabled, the caller must be sure to check whether
+    :c:expr:`NpyIter_IterationNeedsAPI(iter)` is true, in which case
+    it may not release the GIL during iteration.
+
+.. c:macro:: NPY_ITER_ZEROSIZE_OK
+
+    Indicates that arrays with a size of zero should be permitted.
+    Since the typical iteration loop does not naturally work with
+    zero-sized arrays, you must check that the IterSize is larger
+    than zero before entering the iteration loop.
+    Currently only the operands are checked, not a forced shape.
+
+.. c:macro:: NPY_ITER_REDUCE_OK
+
+    Permits writeable operands with a dimension with zero
+    stride and size greater than one.  Note that such operands
+    must be read/write.
+
+    When buffering is enabled, this also switches to a special
+    buffering mode which reduces the loop length as necessary to
+    not trample on values being reduced.
+
+    Note that if you want to do a reduction on an automatically
+    allocated output, you must use :c:func:`NpyIter_GetOperandArray`
+    to get its reference, then set every value to the reduction
+    unit before doing the iteration loop.  In the case of a
+    buffered reduction, this means you must also specify the
+    flag :c:data:`NPY_ITER_DELAY_BUFALLOC`, then reset the iterator
+    after initializing the allocated operand to prepare the
+    buffers.
+
+.. c:macro:: NPY_ITER_RANGED
+
+    Enables support for iteration of sub-ranges of the full
+    ``iterindex`` range ``[0, NpyIter_IterSize(iter))``.  Use
+    the function :c:func:`NpyIter_ResetToIterIndexRange` to specify
+    a range for iteration.
+
+    This flag can only be used with :c:data:`NPY_ITER_EXTERNAL_LOOP`
+    when :c:data:`NPY_ITER_BUFFERED` is enabled.  This is because
+    without buffering, the inner loop is always the size of the
+    innermost iteration dimension, and allowing it to get cut up
+    would require special handling, effectively making it more
+    like the buffered version.
+
+.. c:macro:: NPY_ITER_BUFFERED
+
+    Causes the iterator to store buffering data, and use buffering
+    to satisfy data type, alignment, and byte-order requirements.
+    To buffer an operand, do not specify the :c:data:`NPY_ITER_COPY`
+    or :c:data:`NPY_ITER_UPDATEIFCOPY` flags, because they will
+    override buffering.  Buffering is especially useful for Python
+    code using the iterator, allowing for larger chunks
+    of data at once to amortize the Python interpreter overhead.
+
+    If used with :c:data:`NPY_ITER_EXTERNAL_LOOP`, the inner loop
+    for the caller may get larger chunks than would be possible
+    without buffering, because of how the strides are laid out.
+
+    Note that if an operand is given the flag :c:data:`NPY_ITER_COPY`
+    or :c:data:`NPY_ITER_UPDATEIFCOPY`, a copy will be made in preference
+    to buffering.  Buffering will still occur when the array was
+    broadcast so elements need to be duplicated to get a constant
+    stride.
+
+    In normal buffering, the size of each inner loop is equal
+    to the buffer size, or possibly larger if
+    :c:data:`NPY_ITER_GROWINNER` is specified.  If
+    :c:data:`NPY_ITER_REDUCE_OK` is enabled and a reduction occurs,
+    the inner loops may become smaller depending
+    on the structure of the reduction.
+
+.. c:macro:: NPY_ITER_GROWINNER
+
+    When buffering is enabled, this allows the size of the inner
+    loop to grow when buffering isn't necessary.  This option
+    is best used if you're doing a straight pass through all the
+    data, rather than anything with small cache-friendly arrays
+    of temporary values for each inner loop.
+
+.. c:macro:: NPY_ITER_DELAY_BUFALLOC
+
+    When buffering is enabled, this delays allocation of the
+    buffers until :c:func:`NpyIter_Reset` or another reset function is
+    called.  This flag exists to avoid wasteful copying of
+    buffer data when making multiple copies of a buffered
+    iterator for multi-threaded iteration.
+
+    Another use of this flag is for setting up reduction operations.
+    After the iterator is created, and a reduction output
+    is allocated automatically by the iterator (be sure to use
+    READWRITE access), its value may be initialized to the reduction
+    unit.  Use :c:func:`NpyIter_GetOperandArray` to get the object.
+    Then, call :c:func:`NpyIter_Reset` to allocate and fill the buffers
+    with their initial values.
+
+.. c:macro:: NPY_ITER_COPY_IF_OVERLAP
+
+    If any write operand has overlap with any read operand, eliminate all
+    overlap by making temporary copies (enabling UPDATEIFCOPY for write
+    operands, if necessary). A pair of operands has overlap if there is
+    a memory address that contains data common to both arrays.
+
+    Because exact overlap detection has exponential runtime
+    in the number of dimensions, the decision is made based
+    on heuristics, which has false positives (needless copies in unusual
+    cases) but has no false negatives.
+
+    If any read/write overlap exists, this flag ensures the result of the
+    operation is the same as if all operands were copied.
+    In cases where copies would need to be made, **the result of the
+    computation may be undefined without this flag!**
 
     Flags that may be passed in ``op_flags[i]``, where ``0 <= i < nop``:
+..
+    dedent the enumeration of flags to avoid missing references sphinx warnings 
+
+.. c:macro:: NPY_ITER_READWRITE
+.. c:macro:: NPY_ITER_READONLY
+.. c:macro:: NPY_ITER_WRITEONLY
+
+    Indicate how the user of the iterator will read or write
+    to ``op[i]``.  Exactly one of these flags must be specified
+    per operand. Using ``NPY_ITER_READWRITE`` or ``NPY_ITER_WRITEONLY``
+    for a user-provided operand may trigger `WRITEBACKIFCOPY``
+    semantics. The data will be written back to the original array
+    when ``NpyIter_Deallocate`` is called.
+
+.. c:macro:: NPY_ITER_COPY
+
+    Allow a copy of ``op[i]`` to be made if it does not
+    meet the data type or alignment requirements as specified
+    by the constructor flags and parameters.
+
+.. c:macro:: NPY_ITER_UPDATEIFCOPY
 
-        .. c:macro:: NPY_ITER_READWRITE
-        .. c:macro:: NPY_ITER_READONLY
-        .. c:macro:: NPY_ITER_WRITEONLY
+    Triggers :c:data:`NPY_ITER_COPY`, and when an array operand
+    is flagged for writing and is copied, causes the data
+    in a copy to be copied back to ``op[i]`` when
+    ``NpyIter_Deallocate`` is called.
 
-            Indicate how the user of the iterator will read or write
-            to ``op[i]``.  Exactly one of these flags must be specified
-            per operand. Using ``NPY_ITER_READWRITE`` or ``NPY_ITER_WRITEONLY``
-            for a user-provided operand may trigger `WRITEBACKIFCOPY``
-            semantics. The data will be written back to the original array
-            when ``NpyIter_Deallocate`` is called.
-
-        .. c:macro:: NPY_ITER_COPY
+    If the operand is flagged as write-only and a copy is needed,
+    an uninitialized temporary array will be created and then copied
+    to back to ``op[i]`` on calling ``NpyIter_Deallocate``, instead of
+    doing the unnecessary copy operation.
 
-            Allow a copy of ``op[i]`` to be made if it does not
-            meet the data type or alignment requirements as specified
-            by the constructor flags and parameters.
+.. c:macro:: NPY_ITER_NBO
+.. c:macro:: NPY_ITER_ALIGNED
+.. c:macro:: NPY_ITER_CONTIG
 
-        .. c:macro:: NPY_ITER_UPDATEIFCOPY
-
-            Triggers :c:data:`NPY_ITER_COPY`, and when an array operand
-            is flagged for writing and is copied, causes the data
-            in a copy to be copied back to ``op[i]`` when
-            ``NpyIter_Deallocate`` is called.
-
-            If the operand is flagged as write-only and a copy is needed,
-            an uninitialized temporary array will be created and then copied
-            to back to ``op[i]`` on calling ``NpyIter_Deallocate``, instead of
-            doing the unnecessary copy operation.
-
-        .. c:macro:: NPY_ITER_NBO
-        .. c:macro:: NPY_ITER_ALIGNED
-        .. c:macro:: NPY_ITER_CONTIG
-
-            Causes the iterator to provide data for ``op[i]``
-            that is in native byte order, aligned according to
-            the dtype requirements, contiguous, or any combination.
-
-            By default, the iterator produces pointers into the
-            arrays provided, which may be aligned or unaligned, and
-            with any byte order.  If copying or buffering is not
-            enabled and the operand data doesn't satisfy the constraints,
-            an error will be raised.
+    Causes the iterator to provide data for ``op[i]``
+    that is in native byte order, aligned according to
+    the dtype requirements, contiguous, or any combination.
 
-            The contiguous constraint applies only to the inner loop,
-            successive inner loops may have arbitrary pointer changes.
+    By default, the iterator produces pointers into the
+    arrays provided, which may be aligned or unaligned, and
+    with any byte order.  If copying or buffering is not
+    enabled and the operand data doesn't satisfy the constraints,
+    an error will be raised.
 
-            If the requested data type is in non-native byte order,
-            the NBO flag overrides it and the requested data type is
-            converted to be in native byte order.
+    The contiguous constraint applies only to the inner loop,
+    successive inner loops may have arbitrary pointer changes.
 
-        .. c:macro:: NPY_ITER_ALLOCATE
+    If the requested data type is in non-native byte order,
+    the NBO flag overrides it and the requested data type is
+    converted to be in native byte order.
 
-            This is for output arrays, and requires that the flag
-            :c:data:`NPY_ITER_WRITEONLY` or :c:data:`NPY_ITER_READWRITE`
-            be set.  If ``op[i]`` is NULL, creates a new array with
-            the final broadcast dimensions, and a layout matching
-            the iteration order of the iterator.
+.. c:macro:: NPY_ITER_ALLOCATE
 
-            When ``op[i]`` is NULL, the requested data type
-            ``op_dtypes[i]`` may be NULL as well, in which case it is
-            automatically generated from the dtypes of the arrays which
-            are flagged as readable.  The rules for generating the dtype
-            are the same is for UFuncs.  Of special note is handling
-            of byte order in the selected dtype.  If there is exactly
-            one input, the input's dtype is used as is.  Otherwise,
-            if more than one input dtypes are combined together, the
-            output will be in native byte order.
+    This is for output arrays, and requires that the flag
+    :c:data:`NPY_ITER_WRITEONLY` or :c:data:`NPY_ITER_READWRITE`
+    be set.  If ``op[i]`` is NULL, creates a new array with
+    the final broadcast dimensions, and a layout matching
+    the iteration order of the iterator.
+
+    When ``op[i]`` is NULL, the requested data type
+    ``op_dtypes[i]`` may be NULL as well, in which case it is
+    automatically generated from the dtypes of the arrays which
+    are flagged as readable.  The rules for generating the dtype
+    are the same is for UFuncs.  Of special note is handling
+    of byte order in the selected dtype.  If there is exactly
+    one input, the input's dtype is used as is.  Otherwise,
+    if more than one input dtypes are combined together, the
+    output will be in native byte order.
+
+    After being allocated with this flag, the caller may retrieve
+    the new array by calling :c:func:`NpyIter_GetOperandArray` and
+    getting the i-th object in the returned C array.  The caller
+    must call Py_INCREF on it to claim a reference to the array.
 
-            After being allocated with this flag, the caller may retrieve
-            the new array by calling :c:func:`NpyIter_GetOperandArray` and
-            getting the i-th object in the returned C array.  The caller
-            must call Py_INCREF on it to claim a reference to the array.
+.. c:macro:: NPY_ITER_NO_SUBTYPE
 
-        .. c:macro:: NPY_ITER_NO_SUBTYPE
+    For use with :c:data:`NPY_ITER_ALLOCATE`, this flag disables
+    allocating an array subtype for the output, forcing
+    it to be a straight ndarray.
 
-            For use with :c:data:`NPY_ITER_ALLOCATE`, this flag disables
-            allocating an array subtype for the output, forcing
-            it to be a straight ndarray.
+    TODO: Maybe it would be better to introduce a function
+    ``NpyIter_GetWrappedOutput`` and remove this flag?
 
-            TODO: Maybe it would be better to introduce a function
-            ``NpyIter_GetWrappedOutput`` and remove this flag?
+.. c:macro:: NPY_ITER_NO_BROADCAST
 
-        .. c:macro:: NPY_ITER_NO_BROADCAST
+    Ensures that the input or output matches the iteration
+    dimensions exactly.
 
-            Ensures that the input or output matches the iteration
-            dimensions exactly.
+.. c:macro:: NPY_ITER_ARRAYMASK
 
-        .. c:macro:: NPY_ITER_ARRAYMASK
+    .. versionadded:: 1.7
+
+    Indicates that this operand is the mask to use for
+    selecting elements when writing to operands which have
+    the :c:data:`NPY_ITER_WRITEMASKED` flag applied to them.
+    Only one operand may have :c:data:`NPY_ITER_ARRAYMASK` flag
+    applied to it.
 
-            .. versionadded:: 1.7
+    The data type of an operand with this flag should be either
+    :c:data:`NPY_BOOL`, :c:data:`NPY_MASK`, or a struct dtype
+    whose fields are all valid mask dtypes. In the latter case,
+    it must match up with a struct operand being WRITEMASKED,
+    as it is specifying a mask for each field of that array.
 
-            Indicates that this operand is the mask to use for
-            selecting elements when writing to operands which have
-            the :c:data:`NPY_ITER_WRITEMASKED` flag applied to them.
-            Only one operand may have :c:data:`NPY_ITER_ARRAYMASK` flag
-            applied to it.
+    This flag only affects writing from the buffer back to
+    the array. This means that if the operand is also
+    :c:data:`NPY_ITER_READWRITE` or :c:data:`NPY_ITER_WRITEONLY`,
+    code doing iteration can write to this operand to
+    control which elements will be untouched and which ones will be
+    modified. This is useful when the mask should be a combination
+    of input masks.
 
-            The data type of an operand with this flag should be either
-            :c:data:`NPY_BOOL`, :c:data:`NPY_MASK`, or a struct dtype
-            whose fields are all valid mask dtypes. In the latter case,
-            it must match up with a struct operand being WRITEMASKED,
-            as it is specifying a mask for each field of that array.
+.. c:macro:: NPY_ITER_WRITEMASKED
 
-            This flag only affects writing from the buffer back to
-            the array. This means that if the operand is also
-            :c:data:`NPY_ITER_READWRITE` or :c:data:`NPY_ITER_WRITEONLY`,
-            code doing iteration can write to this operand to
-            control which elements will be untouched and which ones will be
-            modified. This is useful when the mask should be a combination
-            of input masks.
+    .. versionadded:: 1.7
 
-        .. c:macro:: NPY_ITER_WRITEMASKED
+    This array is the mask for all `writemasked <numpy.nditer>`
+    operands. Code uses the ``writemasked`` flag which indicates 
+    that only elements where the chosen ARRAYMASK operand is True
+    will be written to. In general, the iterator does not enforce
+    this, it is up to the code doing the iteration to follow that
+    promise.
 
-            .. versionadded:: 1.7
+    When ``writemasked`` flag is used, and this operand is buffered,
+    this changes how data is copied from the buffer into the array.
+    A masked copying routine is used, which only copies the
+    elements in the buffer for which ``writemasked``
+    returns true from the corresponding element in the ARRAYMASK
+    operand.
 
-            This array is the mask for all `writemasked <numpy.nditer>`
-            operands. Code uses the ``writemasked`` flag which indicates 
-            that only elements where the chosen ARRAYMASK operand is True
-            will be written to. In general, the iterator does not enforce
-            this, it is up to the code doing the iteration to follow that
-            promise.
-
-            When ``writemasked`` flag is used, and this operand is buffered,
-            this changes how data is copied from the buffer into the array.
-            A masked copying routine is used, which only copies the
-            elements in the buffer for which ``writemasked``
-            returns true from the corresponding element in the ARRAYMASK
-            operand.
+.. c:macro:: NPY_ITER_OVERLAP_ASSUME_ELEMENTWISE
 
-        .. c:macro:: NPY_ITER_OVERLAP_ASSUME_ELEMENTWISE
+    In memory overlap checks, assume that operands with
+    ``NPY_ITER_OVERLAP_ASSUME_ELEMENTWISE`` enabled are accessed only
+    in the iterator order.
 
-            In memory overlap checks, assume that operands with
-            ``NPY_ITER_OVERLAP_ASSUME_ELEMENTWISE`` enabled are accessed only
-            in the iterator order.
+    This enables the iterator to reason about data dependency,
+    possibly avoiding unnecessary copies.
 
-            This enables the iterator to reason about data dependency,
-            possibly avoiding unnecessary copies.
-
-            This flag has effect only if ``NPY_ITER_COPY_IF_OVERLAP`` is enabled
-            on the iterator.
+    This flag has effect only if ``NPY_ITER_COPY_IF_OVERLAP`` is enabled
+    on the iterator.
 
 .. c:function:: NpyIter* NpyIter_AdvancedNew( \
         npy_intp nop, PyArrayObject** op, npy_uint32 flags, NPY_ORDER order, \
@@ -738,7 +742,7 @@ Construction and Destruction
     the iterator.  Any cached functions or pointers from the iterator
     must be retrieved again!
 
-    After calling this function, :c:func:`NpyIter_HasMultiIndex(iter)` will
+    After calling this function, :c:expr:`NpyIter_HasMultiIndex(iter)` will
     return false.
 
     Returns ``NPY_SUCCEED`` or ``NPY_FAIL``.
diff --git a/doc/source/reference/c-api/types-and-structures.rst b/doc/source/reference/c-api/types-and-structures.rst
index 763f985a6..54a1e09e1 100644
--- a/doc/source/reference/c-api/types-and-structures.rst
+++ b/doc/source/reference/c-api/types-and-structures.rst
@@ -7,7 +7,7 @@ Python Types and C-Structures
 
 Several new types are defined in the C-code. Most of these are
 accessible from Python, but a few are not exposed due to their limited
-use. Every new Python type has an associated :c:type:`PyObject *<PyObject>` with an
+use. Every new Python type has an associated :c:expr:`PyObject *` with an
 internal structure that includes a pointer to a "method table" that
 defines how the new object behaves in Python. When you receive a
 Python object into C code, you always get a pointer to a
@@ -61,7 +61,7 @@ hierarchy of actual Python types.
 PyArray_Type and PyArrayObject
 ------------------------------
 
-.. c:var:: PyArray_Type
+.. c:var:: PyTypeObject PyArray_Type
 
    The Python type of the ndarray is :c:data:`PyArray_Type`. In C, every
    ndarray is a pointer to a :c:type:`PyArrayObject` structure. The ob_type
@@ -201,7 +201,7 @@ PyArray_Type and PyArrayObject
 PyArrayDescr_Type and PyArray_Descr
 -----------------------------------
 
-.. c:var:: PyArrayDescr_Type
+.. c:var:: PyTypeObject PyArrayDescr_Type
 
    The :c:data:`PyArrayDescr_Type` is the built-in type of the
    data-type-descriptor objects used to describe how the bytes comprising
@@ -636,7 +636,7 @@ PyArrayDescr_Type and PyArray_Descr
 
         Either ``NULL`` or a dictionary containing low-level casting
         functions for user- defined data-types. Each function is
-        wrapped in a :c:type:`PyCapsule *<PyCapsule>` and keyed by
+        wrapped in a :c:expr:`PyCapsule *` and keyed by
         the data-type number.
 
     .. c:member:: NPY_SCALARKIND scalarkind(PyArrayObject* arr)
@@ -754,7 +754,7 @@ The :c:data:`PyArray_Type` can also be sub-typed.
 PyUFunc_Type and PyUFuncObject
 ------------------------------
 
-.. c:var:: PyUFunc_Type
+.. c:var:: PyTypeObject PyUFunc_Type
 
    The ufunc object is implemented by creation of the
    :c:data:`PyUFunc_Type`. It is a very simple type that implements only
@@ -811,13 +811,14 @@ PyUFunc_Type and PyUFuncObject
           char *core_signature;
           PyUFunc_TypeResolutionFunc *type_resolver;
           PyUFunc_LegacyInnerLoopSelectionFunc *legacy_inner_loop_selector;
-          PyUFunc_MaskedInnerLoopSelectionFunc *masked_inner_loop_selector;
+          void *reserved2;
           npy_uint32 *op_flags;
           npy_uint32 *iter_flags;
           /* new in API version 0x0000000D */
           npy_intp *core_dim_sizes;
           npy_uint32 *core_dim_flags;
           PyObject *identity_value;
+          /* Further private slots (size depends on the NumPy version) */
       } PyUFuncObject;
 
    .. c:macro: PyObject_HEAD
@@ -957,18 +958,17 @@ PyUFunc_Type and PyUFuncObject
 
    .. c:member:: PyUFunc_LegacyInnerLoopSelectionFunc *legacy_inner_loop_selector
 
-       A function which returns an inner loop. The ``legacy`` in the name arises
-       because for NumPy 1.6 a better variant had been planned. This variant
-       has not yet come about.
+       .. deprecated:: 1.22
+
+            Some fallback support for this slot exists, but will be removed
+            eventually.  A univiersal function which relied on this will have
+            eventually have to be ported.
+            See ref:`NEP 41 <NEP41>` and ref:`NEP 43 <NEP43>`
 
    .. c:member:: void *reserved2
 
        For a possible future loop selector with a different signature.
 
-   .. c:member:: PyUFunc_MaskedInnerLoopSelectionFunc *masked_inner_loop_selector
-
-       Function which returns a masked inner loop for the ufunc
-
    .. c:member:: npy_uint32 op_flags
 
        Override the default operand flags for each ufunc operand.
@@ -1006,7 +1006,7 @@ PyUFunc_Type and PyUFuncObject
 PyArrayIter_Type and PyArrayIterObject
 --------------------------------------
 
-.. c:var:: PyArrayIter_Type
+.. c:var:: PyTypeObject PyArrayIter_Type
 
    This is an iterator object that makes it easy to loop over an
    N-dimensional array. It is the object returned from the flat
@@ -1110,13 +1110,13 @@ the internal structure of the iterator object, and merely interact
 with it through the use of the macros :c:func:`PyArray_ITER_NEXT` (it),
 :c:func:`PyArray_ITER_GOTO` (it, dest), or :c:func:`PyArray_ITER_GOTO1D`
 (it, index). All of these macros require the argument *it* to be a
-:c:type:`PyArrayIterObject *`.
+:c:expr:`PyArrayIterObject *`.
 
 
 PyArrayMultiIter_Type and PyArrayMultiIterObject
 ------------------------------------------------
 
-.. c:var:: PyArrayMultiIter_Type
+.. c:var:: PyTypeObject PyArrayMultiIter_Type
 
    This type provides an iterator that encapsulates the concept of
    broadcasting. It allows :math:`N` arrays to be broadcast together
@@ -1178,7 +1178,7 @@ PyArrayMultiIter_Type and PyArrayMultiIterObject
 PyArrayNeighborhoodIter_Type and PyArrayNeighborhoodIterObject
 --------------------------------------------------------------
 
-.. c:var:: PyArrayNeighborhoodIter_Type
+.. c:var:: PyTypeObject PyArrayNeighborhoodIter_Type
 
    This is an iterator object that makes it easy to loop over an
    N-dimensional neighborhood.
@@ -1217,7 +1217,7 @@ PyArrayNeighborhoodIter_Type and PyArrayNeighborhoodIterObject
 PyArrayFlags_Type and PyArrayFlagsObject
 ----------------------------------------
 
-.. c:var:: PyArrayFlags_Type
+.. c:var:: PyTypeObject PyArrayFlags_Type
 
    When the flags attribute is retrieved from Python, a special
    builtin object of this type is constructed. This special type makes
@@ -1466,7 +1466,7 @@ for completeness and assistance in understanding the code.
    to define a 1-d loop for a ufunc for every defined signature of a
    user-defined data-type.
 
-.. c:var:: PyArrayMapIter_Type
+.. c:var:: PyTypeObject PyArrayMapIter_Type
 
    Advanced indexing is handled with this Python type. It is simply a
    loose wrapper around the C-structure containing the variables
diff --git a/doc/source/reference/c-api/ufunc.rst b/doc/source/reference/c-api/ufunc.rst
index 9eb70c3fb..95dc47839 100644
--- a/doc/source/reference/c-api/ufunc.rst
+++ b/doc/source/reference/c-api/ufunc.rst
@@ -283,20 +283,6 @@ Functions
     signature is an array of data-type numbers indicating the inputs
     followed by the outputs assumed by the 1-d loop.
 
-.. c:function:: int PyUFunc_GenericFunction( \
-        PyUFuncObject* self, PyObject* args, PyObject* kwds, PyArrayObject** mps)
-
-    .. deprecated:: NumPy 1.19
-
-        Unless NumPy is made aware of an issue with this, this function
-        is scheduled for rapid removal without replacement.
-
-    Instead of this function ``PyObject_Call(ufunc, args, kwds)`` should be
-    used. The above function differs from this because it ignores support
-    for non-array, or array subclasses as inputs.
-    To ensure identical behaviour, it may be necessary to convert all inputs
-    using ``PyArray_FromAny(obj, NULL, 0, 0, NPY_ARRAY_ENSUREARRAY, NULL)``.
-
 .. c:function:: int PyUFunc_checkfperr(int errmask, PyObject* errobj)
 
     A simple interface to the IEEE error-flag checking support. The
diff --git a/doc/source/reference/global_state.rst b/doc/source/reference/global_state.rst
index b59467210..f18481235 100644
--- a/doc/source/reference/global_state.rst
+++ b/doc/source/reference/global_state.rst
@@ -84,17 +84,3 @@ contiguous in memory.
 Most users will have no reason to change these; for details
 see the :ref:`memory layout <memory-layout>` documentation.
 
-Using the new casting implementation
-------------------------------------
-
-Within NumPy 1.20 it is possible to enable the new experimental casting
-implementation for testing purposes. To do this set::
-
-    NPY_USE_NEW_CASTINGIMPL=1
-
-Setting the flag is only useful to aid with NumPy developement to ensure the
-new version is bug free and should be avoided for production code.
-It is a helpful test for projects that either create custom datatypes or
-use for example complicated structured dtypes. The flag is expected to be
-removed in 1.21 with the new version being always in use.
-
diff --git a/doc/source/reference/index.rst b/doc/source/reference/index.rst
index 6eb74cd77..f12d923df 100644
--- a/doc/source/reference/index.rst
+++ b/doc/source/reference/index.rst
@@ -1,5 +1,7 @@
 .. _reference:
 
+.. module:: numpy
+
 ###############
 NumPy Reference
 ###############
@@ -7,9 +9,6 @@ NumPy Reference
 :Release: |version|
 :Date: |today|
 
-
-.. module:: numpy
-
 This reference manual details functions, modules, and objects
 included in NumPy, describing what they are and what they do.
 For learning how to use NumPy, see the :ref:`complete documentation <numpy_docs_mainpage>`.
diff --git a/doc/source/reference/random/bit_generators/index.rst b/doc/source/reference/random/bit_generators/index.rst
index 315657172..c5c349806 100644
--- a/doc/source/reference/random/bit_generators/index.rst
+++ b/doc/source/reference/random/bit_generators/index.rst
@@ -15,10 +15,13 @@ Supported BitGenerators
 
 The included BitGenerators are:
 
-* PCG-64 - The default. A fast generator that supports many parallel streams
-  and can be advanced by an arbitrary amount. See the documentation for
-  :meth:`~.PCG64.advance`. PCG-64 has a period of :math:`2^{128}`. See the `PCG
-  author's page`_ for more details about this class of PRNG.
+* PCG-64 - The default. A fast generator that can be advanced by an arbitrary
+  amount. See the documentation for :meth:`~.PCG64.advance`. PCG-64 has
+  a period of :math:`2^{128}`. See the `PCG author's page`_ for more details
+  about this class of PRNG.
+* PCG-64 DXSM - An upgraded version of PCG-64 with better statistical
+  properties in parallel contexts. See :ref:`upgrading-pcg64` for more
+  information on these improvements.
 * MT19937 - The standard Python BitGenerator. Adds a `MT19937.jumped`
   function that returns a new generator with state as-if :math:`2^{128}` draws have
   been made.
@@ -43,6 +46,7 @@ The included BitGenerators are:
 
     MT19937 <mt19937>
     PCG64 <pcg64>
+    PCG64DXSM <pcg64dxsm>
     Philox <philox>
     SFC64 <sfc64>
 
@@ -105,6 +109,44 @@ If you need to generate a good seed "offline", then ``SeedSequence().entropy``
 or using ``secrets.randbits(128)`` from the standard library are both
 convenient ways.
 
+If you need to run several stochastic simulations in parallel, best practice
+is to construct a random generator instance for each simulation. 
+To make sure that the random streams have distinct initial states, you can use
+the `spawn` method of `~SeedSequence`. For instance, here we construct a list
+of 12 instances:
+
+.. code-block:: python
+
+    from numpy.random import PCG64, SeedSequence
+    
+    # High quality initial entropy
+    entropy = 0x87351080e25cb0fad77a44a3be03b491
+    base_seq = SeedSequence(entropy)
+    child_seqs = base_seq.spawn(12)    # a list of 12 SeedSequences
+    generators = [PCG64(seq) for seq in child_seqs]
+
+.. end_block
+
+
+An alternative way is to use the fact that a `~SeedSequence` can be initialized
+by a tuple of elements. Here we use a base entropy value and an integer
+``worker_id``
+
+.. code-block:: python
+
+    from numpy.random import PCG64, SeedSequence
+
+    # High quality initial entropy
+    entropy = 0x87351080e25cb0fad77a44a3be03b491    
+    sequences = [SeedSequence((entropy, worker_id)) for worker_id in range(12)]
+    generators = [PCG64(seq) for seq in sequences]
+
+.. end_block
+
+Note that the sequences produced by the latter method will be distinct from
+those constructed via `~SeedSequence.spawn`.
+
+
 .. autosummary::
     :toctree: generated/
 
diff --git a/doc/source/reference/random/bit_generators/mt19937.rst b/doc/source/reference/random/bit_generators/mt19937.rst
index 71875db4e..d05ea7c6f 100644
--- a/doc/source/reference/random/bit_generators/mt19937.rst
+++ b/doc/source/reference/random/bit_generators/mt19937.rst
@@ -4,7 +4,8 @@ Mersenne Twister (MT19937)
 .. currentmodule:: numpy.random
 
 .. autoclass:: MT19937
-	:exclude-members:
+    :members: __init__
+    :exclude-members: __init__
 
 State
 =====
diff --git a/doc/source/reference/random/bit_generators/pcg64.rst b/doc/source/reference/random/bit_generators/pcg64.rst
index edac4620b..889965f77 100644
--- a/doc/source/reference/random/bit_generators/pcg64.rst
+++ b/doc/source/reference/random/bit_generators/pcg64.rst
@@ -4,7 +4,8 @@ Permuted Congruential Generator (64-bit, PCG64)
 .. currentmodule:: numpy.random
 
 .. autoclass:: PCG64
-	:exclude-members:
+    :members: __init__
+    :exclude-members: __init__
 
 State
 =====
diff --git a/doc/source/reference/random/bit_generators/pcg64dxsm.rst b/doc/source/reference/random/bit_generators/pcg64dxsm.rst
new file mode 100644
index 000000000..e37efa5d3
--- /dev/null
+++ b/doc/source/reference/random/bit_generators/pcg64dxsm.rst
@@ -0,0 +1,32 @@
+Permuted Congruential Generator (64-bit, PCG64 DXSM)
+----------------------------------------------------
+
+.. currentmodule:: numpy.random
+
+.. autoclass:: PCG64DXSM
+    :members: __init__
+    :exclude-members: __init__
+
+State
+=====
+
+.. autosummary::
+   :toctree: generated/
+
+   ~PCG64DXSM.state
+
+Parallel generation
+===================
+.. autosummary::
+   :toctree: generated/
+
+   ~PCG64DXSM.advance
+   ~PCG64DXSM.jumped
+
+Extending
+=========
+.. autosummary::
+   :toctree: generated/
+
+   ~PCG64DXSM.cffi
+   ~PCG64DXSM.ctypes
diff --git a/doc/source/reference/random/bit_generators/philox.rst b/doc/source/reference/random/bit_generators/philox.rst
index 8eba2d351..3c2fa4cc5 100644
--- a/doc/source/reference/random/bit_generators/philox.rst
+++ b/doc/source/reference/random/bit_generators/philox.rst
@@ -4,7 +4,8 @@ Philox Counter-based RNG
 .. currentmodule:: numpy.random
 
 .. autoclass:: Philox
-	:exclude-members:
+    :members: __init__
+    :exclude-members: __init__
 
 State
 =====
diff --git a/doc/source/reference/random/bit_generators/sfc64.rst b/doc/source/reference/random/bit_generators/sfc64.rst
index d34124a33..8cb255bc1 100644
--- a/doc/source/reference/random/bit_generators/sfc64.rst
+++ b/doc/source/reference/random/bit_generators/sfc64.rst
@@ -4,7 +4,8 @@ SFC64 Small Fast Chaotic PRNG
 .. currentmodule:: numpy.random
 
 .. autoclass:: SFC64
-        :exclude-members:
+    :members: __init__
+    :exclude-members: __init__
 
 State
 =====
diff --git a/doc/source/reference/random/c-api.rst b/doc/source/reference/random/c-api.rst
index a79da7a49..de403ce98 100644
--- a/doc/source/reference/random/c-api.rst
+++ b/doc/source/reference/random/c-api.rst
@@ -3,6 +3,8 @@ C API for random
 
 .. currentmodule:: numpy.random
 
+.. versionadded:: 1.19.0
+
 Access to various distributions below is available via Cython or C-wrapper
 libraries like CFFI. All the functions accept a :c:type:`bitgen_t` as their
 first argument.  To access these from Cython or C, you must link with the
@@ -40,9 +42,9 @@ The functions are named with the following conventions:
 - The functions without "standard" in their name require additional parameters
   to describe the distributions.
 
-- ``zig`` in the name are based on a ziggurat lookup algorithm is used instead
-  of calculating the ``log``, which is significantly faster. The non-ziggurat
-  variants are used in corner cases and for legacy compatibility.
+- Functions with ``inv`` in their name are based on the slower inverse method
+  instead of a ziggurat lookup algorithm, which is significantly faster. The
+  non-ziggurat variants are used in corner cases and for legacy compatibility.
 
 
 .. c:function:: double random_standard_uniform(bitgen_t *bitgen_state)
@@ -53,6 +55,8 @@ The functions are named with the following conventions:
 
 .. c:function:: void random_standard_exponential_fill(bitgen_t *bitgen_state, npy_intp cnt, double *out)
 
+.. c:function:: void random_standard_exponential_inv_fill(bitgen_t *bitgen_state, npy_intp cnt, double *out)
+
 .. c:function:: double random_standard_normal(bitgen_t* bitgen_state)
 
 .. c:function:: void random_standard_normal_fill(bitgen_t *bitgen_state, npy_intp count, double *out)
@@ -69,6 +73,8 @@ The functions are named with the following conventions:
 
 .. c:function:: void random_standard_exponential_fill_f(bitgen_t *bitgen_state, npy_intp cnt, float *out)
 
+.. c:function:: void random_standard_exponential_inv_fill_f(bitgen_t *bitgen_state, npy_intp cnt, float *out)
+
 .. c:function:: float random_standard_normal_f(bitgen_t* bitgen_state)
 
 .. c:function:: float random_standard_gamma_f(bitgen_t *bitgen_state, float shape)
diff --git a/doc/source/reference/random/generator.rst b/doc/source/reference/random/generator.rst
index 8706e1de2..7934be98a 100644
--- a/doc/source/reference/random/generator.rst
+++ b/doc/source/reference/random/generator.rst
@@ -15,7 +15,8 @@ can be changed by passing an instantized BitGenerator to ``Generator``.
 .. autofunction:: default_rng
 
 .. autoclass:: Generator
-	:exclude-members:
+    :members: __init__
+    :exclude-members: __init__
 
 Accessing the BitGenerator
 ==========================
@@ -70,13 +71,13 @@ By default, `Generator.permuted` returns a copy.  To operate in-place with
 `Generator.permuted`, pass the same array as the first argument *and* as
 the value of the ``out`` parameter.  For example,
 
-    >>> rg = np.random.default_rng()
+    >>> rng = np.random.default_rng()
     >>> x = np.arange(0, 15).reshape(3, 5)
     >>> x
     array([[ 0,  1,  2,  3,  4],
            [ 5,  6,  7,  8,  9],
            [10, 11, 12, 13, 14]])
-    >>> y = rg.permuted(x, axis=1, out=x)
+    >>> y = rng.permuted(x, axis=1, out=x)
     >>> x
     array([[ 1,  0,  2,  4,  3],  # random
            [ 6,  7,  8,  9,  5],
@@ -96,13 +97,13 @@ which dimension of the input array to use as the sequence. In the case of a
 two-dimensional array, ``axis=0`` will, in effect, rearrange the rows of the
 array, and  ``axis=1`` will rearrange the columns.  For example
 
-    >>> rg = np.random.default_rng()
+    >>> rng = np.random.default_rng()
     >>> x = np.arange(0, 15).reshape(3, 5)
     >>> x
     array([[ 0,  1,  2,  3,  4],
            [ 5,  6,  7,  8,  9],
            [10, 11, 12, 13, 14]])
-    >>> rg.permutation(x, axis=1)
+    >>> rng.permutation(x, axis=1)
     array([[ 1,  3,  2,  0,  4],  # random
            [ 6,  8,  7,  5,  9],
            [11, 13, 12, 10, 14]])
@@ -115,7 +116,7 @@ how `numpy.sort` treats it.  Each slice along the given axis is shuffled
 independently of the others.  Compare the following example of the use of
 `Generator.permuted` to the above example of `Generator.permutation`:
 
-    >>> rg.permuted(x, axis=1)
+    >>> rng.permuted(x, axis=1)
     array([[ 1,  0,  2,  4,  3],  # random
            [ 5,  7,  6,  9,  8],
            [10, 14, 12, 13, 11]])
@@ -130,9 +131,9 @@ Shuffling non-NumPy sequences
 a sequence that is not a NumPy array, it shuffles that sequence in-place.
 For example,
 
-    >>> rg = np.random.default_rng()
+    >>> rng = np.random.default_rng()
     >>> a = ['A', 'B', 'C', 'D', 'E']
-    >>> rg.shuffle(a)  # shuffle the list in-place
+    >>> rng.shuffle(a)  # shuffle the list in-place
     >>> a
     ['B', 'D', 'A', 'E', 'C']  # random
 
diff --git a/doc/source/reference/random/index.rst b/doc/source/reference/random/index.rst
index 13ce7c40c..96cd47017 100644
--- a/doc/source/reference/random/index.rst
+++ b/doc/source/reference/random/index.rst
@@ -25,7 +25,7 @@ nep-0019-rng-policy.html>`_ for context on the updated random Numpy number
 routines. The legacy `RandomState` random number routines are still
 available, but limited to a single BitGenerator. See :ref:`new-or-different` 
 for a complete list of improvements and differences from the legacy
-``Randomstate``.
+``RandomState``.
 
 For convenience and backward compatibility, a single `RandomState`
 instance's methods are imported into the numpy.random namespace, see
@@ -84,10 +84,10 @@ different
 .. code-block:: python
 
     try:
-        rg_integers = rg.integers
+        rng_integers = rng.integers
     except AttributeError:
-        rg_integers = rg.randint
-    a = rg_integers(1000)
+        rng_integers = rng.randint
+    a = rng_integers(1000)
 
 Seeds can be passed to any of the BitGenerators. The provided value is mixed
 via `SeedSequence` to spread a possible sequence of seeds across a wider
@@ -97,8 +97,8 @@ is wrapped with a `Generator`.
 .. code-block:: python
 
   from numpy.random import Generator, PCG64
-  rg = Generator(PCG64(12345))
-  rg.standard_normal()
+  rng = Generator(PCG64(12345))
+  rng.standard_normal()
   
 Here we use `default_rng` to create an instance of `Generator` to generate a 
 random float:
@@ -146,10 +146,10 @@ As a convenience NumPy  provides the `default_rng` function to hide these
 details:
   
 >>> from numpy.random import default_rng
->>> rg = default_rng(12345)
->>> print(rg)
+>>> rng = default_rng(12345)
+>>> print(rng)
 Generator(PCG64)
->>> print(rg.random())
+>>> print(rng.random())
 0.22733602246716966
   
 One can also instantiate `Generator` directly with a `BitGenerator` instance.
@@ -158,16 +158,16 @@ To use the default `PCG64` bit generator, one can instantiate it directly and
 pass it to `Generator`:
 
 >>> from numpy.random import Generator, PCG64
->>> rg = Generator(PCG64(12345))
->>> print(rg)
+>>> rng = Generator(PCG64(12345))
+>>> print(rng)
 Generator(PCG64)
 
 Similarly to use the older `MT19937` bit generator (not recommended), one can
 instantiate it directly and pass it to `Generator`:
 
 >>> from numpy.random import Generator, MT19937
->>> rg = Generator(MT19937(12345))
->>> print(rg)
+>>> rng = Generator(MT19937(12345))
+>>> print(rng)
 Generator(MT19937)
 
 What's New or Different
@@ -222,6 +222,9 @@ one of three ways:
 * :ref:`independent-streams`
 * :ref:`parallel-jumped`
 
+Users with a very large amount of parallelism will want to consult
+:ref:`upgrading-pcg64`.
+
 Concepts
 --------
 .. toctree::
@@ -230,6 +233,7 @@ Concepts
    generator
    Legacy Generator (RandomState) <legacy>
    BitGenerators, SeedSequences <bit_generators/index>
+   Upgrading PCG64 with PCG64DXSM <upgrading-pcg64>
 
 Features
 --------
diff --git a/doc/source/reference/random/legacy.rst b/doc/source/reference/random/legacy.rst
index 6cf4775b8..42437dbb6 100644
--- a/doc/source/reference/random/legacy.rst
+++ b/doc/source/reference/random/legacy.rst
@@ -48,7 +48,8 @@ using the state of the `RandomState`:
 
 
 .. autoclass:: RandomState
-	:exclude-members:
+    :members: __init__
+    :exclude-members: __init__
 
 Seeding and State
 =================
diff --git a/doc/source/reference/random/new-or-different.rst b/doc/source/reference/random/new-or-different.rst
index 6cab0f729..a81543926 100644
--- a/doc/source/reference/random/new-or-different.rst
+++ b/doc/source/reference/random/new-or-different.rst
@@ -58,18 +58,18 @@ And in more detail:
 
   from  numpy.random import Generator, PCG64
   import numpy.random
-  rg = Generator(PCG64())
-  %timeit -n 1 rg.standard_normal(100000)
+  rng = Generator(PCG64())
+  %timeit -n 1 rng.standard_normal(100000)
   %timeit -n 1 numpy.random.standard_normal(100000)
 
 .. ipython:: python
 
-  %timeit -n 1 rg.standard_exponential(100000)
+  %timeit -n 1 rng.standard_exponential(100000)
   %timeit -n 1 numpy.random.standard_exponential(100000)
 
 .. ipython:: python
 
-  %timeit -n 1 rg.standard_gamma(3.0, 100000)
+  %timeit -n 1 rng.standard_gamma(3.0, 100000)
   %timeit -n 1 numpy.random.standard_gamma(3.0, 100000)
 
 
@@ -94,9 +94,9 @@ And in more detail:
 
 .. ipython:: python
 
-  rg = Generator(PCG64(0))
-  rg.random(3, dtype='d')
-  rg.random(3, dtype='f')
+  rng = Generator(PCG64(0))
+  rng.random(3, dtype='d')
+  rng.random(3, dtype='f')
 
 * Optional ``out`` argument that allows existing arrays to be filled for
   select distributions
@@ -112,7 +112,7 @@ And in more detail:
 .. ipython:: python
 
   existing = np.zeros(4)
-  rg.random(out=existing[:2])
+  rng.random(out=existing[:2])
   print(existing)
 
 * Optional ``axis`` argument for methods like `~.Generator.choice`,
@@ -121,9 +121,9 @@ And in more detail:
 
 .. ipython:: python
 
-  rg = Generator(PCG64(123456789))
+  rng = Generator(PCG64(123456789))
   a = np.arange(12).reshape((3, 4))
   a
-  rg.choice(a, axis=1, size=5)
-  rg.shuffle(a, axis=1)        # Shuffle in-place
+  rng.choice(a, axis=1, size=5)
+  rng.shuffle(a, axis=1)        # Shuffle in-place
   a
diff --git a/doc/source/reference/random/parallel.rst b/doc/source/reference/random/parallel.rst
index 721584014..7f0207bde 100644
--- a/doc/source/reference/random/parallel.rst
+++ b/doc/source/reference/random/parallel.rst
@@ -88,10 +88,11 @@ territory ([2]_).
        estimate the naive upper bound on a napkin and take comfort knowing
        that the probability is actually lower.
 
-.. [2] In this calculation, we can ignore the amount of numbers drawn from each
-       stream. Each of the PRNGs we provide has some extra protection built in
+.. [2] In this calculation, we can mostly ignore the amount of numbers drawn from each
+       stream. See :ref:`upgrading-pcg64` for the technical details about
+       `PCG64`. The other PRNGs we provide have some extra protection built in
        that avoids overlaps if the `~SeedSequence` pools differ in the
-       slightest bit. `PCG64` has :math:`2^{127}` separate cycles
+       slightest bit. `PCG64DXSM` has :math:`2^{127}` separate cycles
        determined by the seed in addition to the position in the
        :math:`2^{128}` long period for each cycle, so one has to both get on or
        near the same cycle *and* seed a nearby position in the cycle.
@@ -150,12 +151,14 @@ BitGenerator, the size of the jump and the bits in the default unsigned random
 are listed below.
 
 +-----------------+-------------------------+-------------------------+-------------------------+
-| BitGenerator    | Period                  |  Jump Size              | Bits                    |
+| BitGenerator    | Period                  |  Jump Size              | Bits per Draw           |
 +=================+=========================+=========================+=========================+
-| MT19937         | :math:`2^{19937}`       | :math:`2^{128}`         | 32                      |
+| MT19937         | :math:`2^{19937}-1`     | :math:`2^{128}`         | 32                      |
 +-----------------+-------------------------+-------------------------+-------------------------+
 | PCG64           | :math:`2^{128}`         | :math:`~2^{127}` ([3]_) | 64                      |
 +-----------------+-------------------------+-------------------------+-------------------------+
+| PCG64DXSM       | :math:`2^{128}`         | :math:`~2^{127}` ([3]_) | 64                      |
++-----------------+-------------------------+-------------------------+-------------------------+
 | Philox          | :math:`2^{256}`         | :math:`2^{128}`         | 64                      |
 +-----------------+-------------------------+-------------------------+-------------------------+
 
diff --git a/doc/source/reference/random/performance.py b/doc/source/reference/random/performance.py
index 28a42eb0d..794142836 100644
--- a/doc/source/reference/random/performance.py
+++ b/doc/source/reference/random/performance.py
@@ -1,14 +1,13 @@
-from collections import OrderedDict
 from timeit import repeat
 
 import pandas as pd
 
 import numpy as np
-from numpy.random import MT19937, PCG64, Philox, SFC64
+from numpy.random import MT19937, PCG64, PCG64DXSM, Philox, SFC64
 
-PRNGS = [MT19937, PCG64, Philox, SFC64]
+PRNGS = [MT19937, PCG64, PCG64DXSM, Philox, SFC64]
 
-funcs = OrderedDict()
+funcs = {}
 integers = 'integers(0, 2**{bits},size=1000000, dtype="uint{bits}")'
 funcs['32-bit Unsigned Ints'] = integers.format(bits=32)
 funcs['64-bit Unsigned Ints'] = integers.format(bits=64)
@@ -26,10 +25,10 @@ rg = Generator({prng}())
 """
 
 test = "rg.{func}"
-table = OrderedDict()
+table = {}
 for prng in PRNGS:
     print(prng)
-    col = OrderedDict()
+    col = {}
     for key in funcs:
         t = repeat(test.format(func=funcs[key]),
                    setup.format(prng=prng().__class__.__name__),
@@ -38,7 +37,7 @@ for prng in PRNGS:
     col = pd.Series(col)
     table[prng().__class__.__name__] = col
 
-npfuncs = OrderedDict()
+npfuncs = {}
 npfuncs.update(funcs)
 npfuncs['32-bit Unsigned Ints'] = 'randint(2**32,dtype="uint32",size=1000000)'
 npfuncs['64-bit Unsigned Ints'] = 'randint(2**64,dtype="uint64",size=1000000)'
@@ -54,7 +53,7 @@ for key in npfuncs:
     col[key] = 1000 * min(t)
 table['RandomState'] = pd.Series(col)
 
-columns = ['MT19937','PCG64','Philox','SFC64', 'RandomState']
+columns = ['MT19937', 'PCG64', 'PCG64DXSM', 'Philox', 'SFC64', 'RandomState']
 table = pd.DataFrame(table)
 order = np.log(table).mean().sort_values().index
 table = table.T
diff --git a/doc/source/reference/random/performance.rst b/doc/source/reference/random/performance.rst
index 74dad4cc3..85855be59 100644
--- a/doc/source/reference/random/performance.rst
+++ b/doc/source/reference/random/performance.rst
@@ -5,9 +5,12 @@ Performance
 
 Recommendation
 **************
-The recommended generator for general use is `PCG64`. It is
-statistically high quality, full-featured, and fast on most platforms, but
-somewhat slow when compiled for 32-bit processes.
+
+The recommended generator for general use is `PCG64` or its upgraded variant
+`PCG64DXSM` for heavily-parallel use cases. They are statistically high quality,
+full-featured, and fast on most platforms, but somewhat slow when compiled for
+32-bit processes. See :ref:`upgrading-pcg64` for details on when heavy
+parallelism would indicate using `PCG64DXSM`.
 
 `Philox` is fairly slow, but its statistical properties have
 very high quality, and it is easy to get assuredly-independent stream by using
@@ -39,49 +42,48 @@ Integer performance has a similar ordering.
 
 The pattern is similar for other, more complex generators. The normal
 performance of the legacy `RandomState` generator is much
-lower than the other since it uses the Box-Muller transformation rather
-than the Ziggurat generator. The performance gap for Exponentials is also
+lower than the other since it uses the Box-Muller transform rather
+than the Ziggurat method. The performance gap for Exponentials is also
 large due to the cost of computing the log function to invert the CDF.
-The column labeled MT19973 is used the same 32-bit generator as
-`RandomState` but produces random values using
-`Generator`.
+The column labeled MT19973 uses the same 32-bit generator as
+`RandomState` but produces random variates using `Generator`.
 
 .. csv-table::
-    :header: ,MT19937,PCG64,Philox,SFC64,RandomState
-    :widths: 14,14,14,14,14,14
-
-    32-bit Unsigned Ints,3.2,2.7,4.9,2.7,3.2
-    64-bit Unsigned Ints,5.6,3.7,6.3,2.9,5.7
-    Uniforms,7.3,4.1,8.1,3.1,7.3
-    Normals,13.1,10.2,13.5,7.8,34.6
-    Exponentials,7.9,5.4,8.5,4.1,40.3
-    Gammas,34.8,28.0,34.7,25.1,58.1
-    Binomials,25.0,21.4,26.1,19.5,25.2
-    Laplaces,45.1,40.7,45.5,38.1,45.6
-    Poissons,67.6,52.4,69.2,46.4,78.1
+    :header: ,MT19937,PCG64,PCG64DXSM,Philox,SFC64,RandomState
+    :widths: 14,14,14,14,14,14,14
+
+    32-bit Unsigned Ints,3.3,1.9,2.0,3.3,1.8,3.1
+    64-bit Unsigned Ints,5.6,3.2,2.9,4.9,2.5,5.5
+    Uniforms,5.9,3.1,2.9,5.0,2.6,6.0
+    Normals,13.9,10.8,10.5,12.0,8.3,56.8
+    Exponentials,9.1,6.0,5.8,8.1,5.4,63.9
+    Gammas,37.2,30.8,28.9,34.0,27.5,77.0
+    Binomials,21.3,17.4,17.6,19.3,15.6,21.4
+    Laplaces,73.2,72.3,76.1,73.0,72.3,82.5
+    Poissons,111.7,103.4,100.5,109.4,90.7,115.2
 
 The next table presents the performance in percentage relative to values
 generated by the legacy generator, ``RandomState(MT19937())``. The overall
 performance was computed using a geometric mean.
 
 .. csv-table::
-    :header: ,MT19937,PCG64,Philox,SFC64
-    :widths: 14,14,14,14,14
-
-    32-bit Unsigned Ints,101,121,67,121
-    64-bit Unsigned Ints,102,156,91,199
-    Uniforms,100,179,90,235
-    Normals,263,338,257,443
-    Exponentials,507,752,474,985
-    Gammas,167,207,167,231
-    Binomials,101,118,96,129
-    Laplaces,101,112,100,120
-    Poissons,116,149,113,168
-    Overall,144,192,132,225
+    :header: ,MT19937,PCG64,PCG64DXSM,Philox,SFC64
+    :widths: 14,14,14,14,14,14
+
+    32-bit Unsigned Ints,96,162,160,96,175
+    64-bit Unsigned Ints,97,171,188,113,218
+    Uniforms,102,192,206,121,233
+    Normals,409,526,541,471,684
+    Exponentials,701,1071,1101,784,1179
+    Gammas,207,250,266,227,281
+    Binomials,100,123,122,111,138
+    Laplaces,113,114,108,113,114
+    Poissons,103,111,115,105,127
+    Overall,159,219,225,174,251
 
 .. note::
 
-   All timings were taken using Linux on an i5-3570 processor.
+   All timings were taken using Linux on an AMD Ryzen 9 3900X processor.
 
 Performance on different Operating Systems
 ******************************************
@@ -98,33 +100,33 @@ across tables.
 64-bit Linux
 ~~~~~~~~~~~~
 
-===================   =========  =======  ========  =======
-Distribution            MT19937    PCG64    Philox    SFC64
-===================   =========  =======  ========  =======
-32-bit Unsigned Int         100    119.8      67.7    120.2
-64-bit Unsigned Int         100    152.9      90.8    213.3
-Uniforms                    100    179.0      87.0    232.0
-Normals                     100    128.5      99.2    167.8
-Exponentials                100    148.3      93.0    189.3
-**Overall**                 100    144.3      86.8    180.0
-===================   =========  =======  ========  =======
+=====================   =========  =======  ===========  ========  =======
+Distribution            MT19937    PCG64    PCG64DXSM    Philox    SFC64
+=====================   =========  =======  ===========  ========  =======
+32-bit Unsigned Ints          100      168         166        100      182
+64-bit Unsigned Ints          100      176         193        116      224
+Uniforms                      100      188         202        118      228
+Normals                       100      128         132        115      167
+Exponentials                  100      152         157        111      168
+Overall                       100      161         168        112      192
+=====================   =========  =======  ===========  ========  =======
 
 
 64-bit Windows
 ~~~~~~~~~~~~~~
-The relative performance on 64-bit Linux and 64-bit Windows is broadly similar.
-
+The relative performance on 64-bit Linux and 64-bit Windows is broadly similar
+with the notable exception of the Philox generator.
 
-===================   =========  =======  ========  =======
-Distribution            MT19937    PCG64    Philox    SFC64
-===================   =========  =======  ========  =======
-32-bit Unsigned Int         100    129.1      35.0    135.0
-64-bit Unsigned Int         100    146.9      35.7    176.5
-Uniforms                    100    165.0      37.0    192.0
-Normals                     100    128.5      48.5    158.0
-Exponentials                100    151.6      39.0    172.8
-**Overall**                 100    143.6      38.7    165.7
-===================   =========  =======  ========  =======
+=====================   =========  =======  ===========  ========  =======
+Distribution              MT19937    PCG64    PCG64DXSM    Philox    SFC64
+=====================   =========  =======  ===========  ========  =======
+32-bit Unsigned Ints          100      155          131        29      150
+64-bit Unsigned Ints          100      157          143        25      154
+Uniforms                      100      151          144        24      155
+Normals                       100      129          128        37      150
+Exponentials                  100      150          145        28      159
+**Overall**                   100      148          138        28      154
+=====================   =========  =======  ===========  ========  =======
 
 
 32-bit Windows
@@ -134,20 +136,20 @@ The performance of 64-bit generators on 32-bit Windows is much lower than on 64-
 operating systems due to register width. MT19937, the generator that has been
 in NumPy since 2005, operates on 32-bit integers.
 
-===================   =========  =======  ========  =======
-Distribution            MT19937    PCG64    Philox    SFC64
-===================   =========  =======  ========  =======
-32-bit Unsigned Int         100     30.5      21.1     77.9
-64-bit Unsigned Int         100     26.3      19.2     97.0
-Uniforms                    100     28.0      23.0    106.0
-Normals                     100     40.1      31.3    112.6
-Exponentials                100     33.7      26.3    109.8
-**Overall**                 100     31.4      23.8     99.8
-===================   =========  =======  ========  =======
+=====================   =========  =======  ===========  ========  =======
+Distribution            MT19937    PCG64    PCG64DXSM    Philox    SFC64
+=====================   =========  =======  ===========  ========  =======
+32-bit Unsigned Ints          100       24           34        14       57
+64-bit Unsigned Ints          100       21           32        14       74
+Uniforms                      100       21           34        16       73
+Normals                       100       36           57        28      101
+Exponentials                  100       28           44        20       88
+**Overall**                   100       25           39        18       77
+=====================   =========  =======  ===========  ========  =======
 
 
 .. note::
 
-   Linux timings used Ubuntu 18.04 and GCC 7.4.  Windows timings were made on
+   Linux timings used Ubuntu 20.04 and GCC 9.3.0.  Windows timings were made on
    Windows 10 using Microsoft C/C++ Optimizing Compiler Version 19 (Visual
-   Studio 2015). All timings were produced on an i5-3570 processor.
+   Studio 2019). All timings were produced on an AMD Ryzen 9 3900X processor.
diff --git a/doc/source/reference/random/upgrading-pcg64.rst b/doc/source/reference/random/upgrading-pcg64.rst
new file mode 100644
index 000000000..9e540ace9
--- /dev/null
+++ b/doc/source/reference/random/upgrading-pcg64.rst
@@ -0,0 +1,152 @@
+.. _upgrading-pcg64:
+
+.. currentmodule:: numpy.random
+
+Upgrading ``PCG64`` with ``PCG64DXSM``
+--------------------------------------
+
+Uses of the `PCG64` `BitGenerator` in a massively-parallel context have been
+shown to have statistical weaknesses that were not apparent at the first
+release in numpy 1.17. Most users will never observe this weakness and are
+safe to continue to use `PCG64`. We have introduced a new `PCG64DXSM`
+`BitGenerator` that will eventually become the new default `BitGenerator`
+implementation used by `default_rng` in future releases. `PCG64DXSM` solves
+the statistical weakness while preserving the performance and the features of
+`PCG64`.
+
+Does this affect me?
+====================
+
+If you
+
+  1. only use a single `Generator` instance,
+  2. only use `RandomState` or the functions in `numpy.random`,
+  3. only use the `PCG64.jumped` method to generate parallel streams,
+  4. explicitly use a `BitGenerator` other than `PCG64`,
+
+then this weakness does not affect you at all. Carry on.
+
+If you use moderate numbers of parallel streams created with `default_rng` or
+`SeedSequence.spawn`, in the 1000s, then the chance of observing this weakness
+is negligibly small. You can continue to use `PCG64` comfortably.
+
+If you use very large numbers of parallel streams, in the millions, and draw
+large amounts of numbers from each, then the chance of observing this weakness
+can become non-negligible, if still small. An example of such a use case would
+be a very large distributed reinforcement learning problem with millions of
+long Monte Carlo playouts each generating billions of random number draws. Such
+use cases should consider using `PCG64DXSM` explicitly or another
+modern `BitGenerator` like `SFC64` or `Philox`, but it is unlikely that any
+old results you may have calculated are invalid. In any case, the weakness is
+a kind of `Birthday Paradox <https://en.wikipedia.org/wiki/Birthday_problem>`_
+collision. That is, a single pair of parallel streams out of the millions,
+considered together, might fail a stringent set of statistical tests of
+randomness. The remaining millions of streams would all be perfectly fine, and
+the effect of the bad pair in the whole calculation is very likely to be
+swamped by the remaining streams in most applications.
+
+.. _upgrading-pcg64-details:
+
+Technical Details
+=================
+
+Like many PRNG algorithms, `PCG64` is constructed from a transition function,
+which advances a 128-bit state, and an output function, that mixes the 128-bit
+state into a 64-bit integer to be output. One of the guiding design principles
+of the PCG family of PRNGs is to balance the computational cost (and
+pseudorandomness strength) between the transition function and the output
+function. The transition function is a 128-bit linear congruential generator
+(LCG), which consists of multiplying the 128-bit state with a fixed
+multiplication constant and then adding a user-chosen increment, in 128-bit
+modular arithmetic. LCGs are well-analyzed PRNGs with known weaknesses, though
+128-bit LCGs are large enough to pass stringent statistical tests on their own,
+with only the trivial output function. The output function of `PCG64` is
+intended to patch up some of those known weaknesses by doing "just enough"
+scrambling of the bits to assist in the statistical properties without adding
+too much computational cost.
+
+One of these known weaknesses is that advancing the state of the LCG by steps
+numbering a power of two (``bg.advance(2**N)``) will leave the lower ``N`` bits
+identical to the state that was just left. For a single stream drawn from
+sequentially, this is of little consequence. The remaining :math:`128-N` bits provide
+plenty of pseudorandomness that will be mixed in for any practical ``N`` that can
+be observed in a single stream, which is why one does not need to worry about
+this if you only use a single stream in your application. Similarly, the
+`PCG64.jumped` method uses a carefully chosen number of steps to avoid creating
+these collisions. However, once you start creating "randomly-initialized"
+parallel streams, either using OS entropy by calling `default_rng` repeatedly
+or using `SeedSequence.spawn`, then we need to consider how many lower bits
+need to "collide" in order to create a bad pair of streams, and then evaluate
+the probability of creating such a collision.
+`Empirically <https://github.com/numpy/numpy/issues/16313>`_, it has been
+determined that if one shares the lower 58 bits of state and shares an
+increment, then the pair of streams, when interleaved, will fail 
+`PractRand <http://pracrand.sourceforge.net/>`_ in
+a reasonable amount of time, after drawing a few gigabytes of data. Following
+the standard Birthday Paradox calculations for a collision of 58 bits, we can
+see that we can create :math:`2^{29}`, or about half a billion, streams which is when
+the probability of such a collision becomes high. Half a billion streams is
+quite high, and the amount of data each stream needs to draw before the
+statistical correlations become apparent to even the strict ``PractRand`` tests
+is in the gigabytes. But this is on the horizon for very large applications
+like distributed reinforcement learning. There are reasons to expect that even
+in these applications a collision probably will not have a practical effect in
+the total result, since the statistical problem is constrained to just the
+colliding pair.
+
+Now, let us consider the case when the increment is not constrained to be the
+same. Our implementation of `PCG64` seeds both the state and the increment;
+that is, two calls to `default_rng` (almost certainly) have different states
+and increments. Upon our first release, we believed that having the seeded
+increment would provide a certain amount of extra protection, that one would
+have to be "close" in both the state space and increment space in order to
+observe correlations (``PractRand`` failures) in a pair of streams. If that were
+true, then the "bottleneck" for collisions would be the 128-bit entropy pool
+size inside of `SeedSequence` (and 128-bit collisions are in the
+"preposterously unlikely" category). Unfortunately, this is not true.
+
+One of the known properties of an LCG is that different increments create
+*distinct* streams, but with a known relationship. Each LCG has an orbit that
+traverses all :math:`2^{128}` different 128-bit states. Two LCGs with different
+increments are related in that one can "rotate" the orbit of the first LCG
+(advance it by a number of steps that we can compute from the two increments)
+such that then both LCGs will always then have the same state, up to an
+additive constant and maybe an inversion of the bits. If you then iterate both
+streams in lockstep, then the states will *always* remain related by that same
+additive constant (and the inversion, if present). Recall that `PCG64` is
+constructed from both a transition function (the LCG) and an output function.
+It was expected that the scrambling effect of the output function would have
+been strong enough to make the distinct streams practically independent (i.e.
+"passing the ``PractRand`` tests") unless the two increments were
+pathologically related to each other (e.g. 1 and 3). The output function XSL-RR
+of the then-standard PCG algorithm that we implemented in `PCG64` turns out to
+be too weak to cover up for the 58-bit collision of the underlying LCG that we
+described above. For any given pair of increments, the size of the "colliding"
+space of states is the same, so for this weakness, the extra distinctness
+provided by the increments does not translate into extra protection from
+statistical correlations that ``PractRand`` can detect.
+
+Fortunately, strengthening the output function is able to correct this weakness
+and *does* turn the extra distinctness provided by differing increments into
+additional protection from these low-bit collisions. To the `PCG author's
+credit <https://github.com/numpy/numpy/issues/13635#issuecomment-506088698>`_,
+she had developed a stronger output function in response to related discussions
+during the long birth of the new `BitGenerator` system. We NumPy developers
+chose to be "conservative" and use the XSL-RR variant that had undergone
+a longer period of testing at that time. The DXSM output function adopts
+a "xorshift-multiply" construction used in strong integer hashes that has much
+better avalanche properties than the XSL-RR output function. While there are
+"pathological" pairs of increments that induce "bad" additive constants that
+relate the two streams, the vast majority of pairs induce "good" additive
+constants that make the merely-distinct streams of LCG states into
+practically-independent output streams. Indeed, now the claim we once made
+about `PCG64` is actually true of `PCG64DXSM`: collisions are possible, but
+both streams have to simultaneously be both "close" in the 128 bit state space
+*and* "close" in the 127-bit increment space, so that would be less likely than
+the negligible chance of colliding in the 128-bit internal `SeedSequence` pool.
+The DXSM output function is more computationally intensive than XSL-RR, but
+some optimizations in the LCG more than make up for the performance hit on most
+machines, so `PCG64DXSM` is a good, safe upgrade. There are, of course, an
+infinite number of stronger output functions that one could consider, but most
+will have a greater computational cost, and the DXSM output function has now
+received many CPU cycles of testing via ``PractRand`` at this time.
diff --git a/doc/source/reference/routines.array-creation.rst b/doc/source/reference/routines.array-creation.rst
index e718f0052..30780c286 100644
--- a/doc/source/reference/routines.array-creation.rst
+++ b/doc/source/reference/routines.array-creation.rst
@@ -7,8 +7,8 @@ Array creation routines
 
 .. currentmodule:: numpy
 
-Ones and zeros
---------------
+From shape or value
+-------------------
 .. autosummary::
    :toctree: generated/
 
diff --git a/doc/source/reference/routines.ctypeslib.rst b/doc/source/reference/routines.ctypeslib.rst
index 3a059f5d9..c6127ca64 100644
--- a/doc/source/reference/routines.ctypeslib.rst
+++ b/doc/source/reference/routines.ctypeslib.rst
@@ -11,3 +11,10 @@ C-Types Foreign Function Interface (:mod:`numpy.ctypeslib`)
 .. autofunction:: as_ctypes_type
 .. autofunction:: load_library
 .. autofunction:: ndpointer
+
+.. class:: c_intp
+
+    A `ctypes` signed integer type of the same size as `numpy.intp`.
+
+    Depending on the platform, it can be an alias for either `~ctypes.c_int`,
+    `~ctypes.c_long` or `~ctypes.c_longlong`.
diff --git a/doc/source/reference/routines.linalg.rst b/doc/source/reference/routines.linalg.rst
index 86e168b26..76b7ab82c 100644
--- a/doc/source/reference/routines.linalg.rst
+++ b/doc/source/reference/routines.linalg.rst
@@ -30,10 +30,26 @@ flexible broadcasting options.  For example, `numpy.linalg.solve` can handle
 "stacked" arrays, while `scipy.linalg.solve` accepts only a single square
 array as its first argument.
 
+.. note::
+
+   The term *matrix* as it is used on this page indicates a 2d `numpy.array`
+   object, and *not* a `numpy.matrix` object. The latter is no longer
+   recommended, even for linear algebra. See
+   :ref:`the matrix object documentation<matrix-objects>` for
+   more information.
+
+The ``@`` operator
+------------------
+
+Introduced in NumPy 1.10.0, the ``@`` operator is preferable to
+other methods when computing the matrix product between 2d arrays. The
+:func:`numpy.matmul` function implements the ``@`` operator.
+
 .. currentmodule:: numpy
 
 Matrix and vector products
 --------------------------
+
 .. autosummary::
    :toctree: generated/
 
diff --git a/doc/source/reference/routines.other.rst b/doc/source/reference/routines.other.rst
index aefd680bb..339857409 100644
--- a/doc/source/reference/routines.other.rst
+++ b/doc/source/reference/routines.other.rst
@@ -55,4 +55,11 @@ Matlab-like Functions
    :toctree: generated/
 
    who
-   disp
-\ No newline at end of file
+   disp
+
+Exceptions
+----------
+.. autosummary::
+   :toctree: generated/
+
+   AxisError
diff --git a/doc/source/reference/routines.polynomials.classes.rst b/doc/source/reference/routines.polynomials.classes.rst
index 10331e9c1..5f575bed1 100644
--- a/doc/source/reference/routines.polynomials.classes.rst
+++ b/doc/source/reference/routines.polynomials.classes.rst
@@ -290,7 +290,8 @@ polynomials up to degree 5 are plotted below.
     >>> import matplotlib.pyplot as plt
     >>> from numpy.polynomial import Chebyshev as T
     >>> x = np.linspace(-1, 1, 100)
-    >>> for i in range(6): ax = plt.plot(x, T.basis(i)(x), lw=2, label="$T_%d$"%i)
+    >>> for i in range(6):
+    ...     ax = plt.plot(x, T.basis(i)(x), lw=2, label=f"$T_{i}$")
     ...
     >>> plt.legend(loc="upper left")
     <matplotlib.legend.Legend object at 0x3b3ee10>
@@ -304,7 +305,8 @@ The same plots over the range -2 <= `x` <= 2 look very different:
     >>> import matplotlib.pyplot as plt
     >>> from numpy.polynomial import Chebyshev as T
     >>> x = np.linspace(-2, 2, 100)
-    >>> for i in range(6): ax = plt.plot(x, T.basis(i)(x), lw=2, label="$T_%d$"%i)
+    >>> for i in range(6):
+    ...     ax = plt.plot(x, T.basis(i)(x), lw=2, label=f"$T_{i}$")
     ...
     >>> plt.legend(loc="lower right")
     <matplotlib.legend.Legend object at 0x3b3ee10>
diff --git a/doc/source/reference/routines.polynomials.rst b/doc/source/reference/routines.polynomials.rst
index e74c5a683..ecfb012f0 100644
--- a/doc/source/reference/routines.polynomials.rst
+++ b/doc/source/reference/routines.polynomials.rst
@@ -9,24 +9,165 @@ of the `numpy.polynomial` package, introduced in NumPy 1.4.
 
 Prior to NumPy 1.4, `numpy.poly1d` was the class of choice and it is still
 available in order to maintain backward compatibility.
-However, the newer Polynomial package is more complete than `numpy.poly1d`
-and its convenience classes are better behaved in the numpy environment.
+However, the newer `polynomial package <numpy.polynomial>` is more complete
+and its `convenience classes <routines.polynomials.classes>` provide a
+more consistent, better-behaved interface for working with polynomial
+expressions.
 Therefore :mod:`numpy.polynomial` is recommended for new coding.
 
-Transition notice
------------------
-The  various routines in the Polynomial package all deal with
-series whose coefficients go from degree zero upward,
-which is the *reverse order* of the Poly1d convention.
-The easy way to remember this is that indexes
-correspond to degree, i.e., coef[i] is the coefficient of the term of
-degree i.
+.. note:: **Terminology**
 
+   The term *polynomial module* refers to the old API defined in
+   `numpy.lib.polynomial`, which includes the :class:`numpy.poly1d` class and
+   the polynomial functions prefixed with *poly* accessible from the `numpy`
+   namespace (e.g. `numpy.polyadd`, `numpy.polyval`, `numpy.polyfit`, etc.).
+
+   The term *polynomial package* refers to the new API definied in 
+   `numpy.polynomial`, which includes the convenience classes for the
+   different kinds of polynomials (`numpy.polynomial.Polynomial`,
+   `numpy.polynomial.Chebyshev`, etc.).
+
+Transitioning from `numpy.poly1d` to `numpy.polynomial`
+-------------------------------------------------------
+
+As noted above, the :class:`poly1d class <numpy.poly1d>` and associated
+functions defined in ``numpy.lib.polynomial``, such as `numpy.polyfit`
+and `numpy.poly`, are considered legacy and should **not** be used in new
+code.
+Since NumPy version 1.4, the `numpy.polynomial` package is preferred for
+working with polynomials.
+
+Quick Reference
+~~~~~~~~~~~~~~~
+
+The following table highlights some of the main differences between the
+legacy polynomial module and the polynomial package for common tasks.
+The `~numpy.polynomial.polynomial.Polynomial` class is imported for brevity::
+
+    from numpy.polynomial import Polynomial
+
+
++------------------------+------------------------------+---------------------------------------+
+|  **How to...**         | Legacy (`numpy.poly1d`)      | `numpy.polynomial`                    |
++------------------------+------------------------------+---------------------------------------+
+| Create a               | ``p = np.poly1d([1, 2, 3])`` | ``p = Polynomial([3, 2, 1])``         |
+| polynomial object      |                              |                                       |
+| from coefficients [1]_ |                              |                                       |
++------------------------+------------------------------+---------------------------------------+
+| Create a polynomial    | ``r = np.poly([-1, 1])``     | ``p = Polynomial.fromroots([-1, 1])`` |
+| object from roots      | ``p = np.poly1d(r)``         |                                       |
++------------------------+------------------------------+---------------------------------------+
+| Fit a polynomial of    |                              |                                       |
+| degree ``deg`` to data | ``np.polyfit(x, y, deg)``    | ``Polynomial.fit(x, y, deg)``         |
++------------------------+------------------------------+---------------------------------------+
+
+
+.. [1] Note the reversed ordering of the coefficients
+
+Transition Guide
+~~~~~~~~~~~~~~~~
+
+There are significant differences between ``numpy.lib.polynomial`` and
+`numpy.polynomial`.
+The most significant difference is the ordering of the coefficients for the
+polynomial expressions.
+The  various routines in `numpy.polynomial` all
+deal with series whose coefficients go from degree zero upward,
+which is the *reverse order* of the poly1d convention.
+The easy way to remember this is that indices
+correspond to degree, i.e., ``coef[i]`` is the coefficient of the term of
+degree *i*.
+
+Though the difference in convention may be confusing, it is straightforward to
+convert from the legacy polynomial API to the new.
+For example, the following demonstrates how you would convert a `numpy.poly1d`
+instance representing the expression :math:`x^{2} + 2x + 3` to a
+`~numpy.polynomial.polynomial.Polynomial` instance representing the same
+expression::
+
+    >>> p1d = np.poly1d([1, 2, 3])
+    >>> p = np.polynomial.Polynomial(p1d.coef[::-1])
+
+In addition to the ``coef`` attribute, polynomials from the polynomial
+package also have ``domain`` and ``window`` attributes.
+These attributes are most relevant when fitting
+polynomials to data, though it should be noted that polynomials with
+different ``domain`` and ``window`` attributes are not considered equal, and
+can't be mixed in arithmetic::
+
+    >>> p1 = np.polynomial.Polynomial([1, 2, 3])
+    >>> p1
+    Polynomial([1., 2., 3.], domain=[-1,  1], window=[-1,  1])
+    >>> p2 = np.polynomial.Polynomial([1, 2, 3], domain=[-2, 2])
+    >>> p1 == p2
+    False
+    >>> p1 + p2
+    Traceback (most recent call last):
+        ...
+    TypeError: Domains differ
+
+See the documentation for the
+`convenience classes <routines.polynomials.classes>`_ for further details on
+the ``domain`` and ``window`` attributes.
+
+Another major difference bewteen the legacy polynomial module and the
+polynomial package is polynomial fitting. In the old module, fitting was
+done via the `~numpy.polyfit` function. In the polynomial package, the
+`~numpy.polynomial.polynomial.Polynomial.fit` class method is preferred. For
+example, consider a simple linear fit to the following data:
+
+.. ipython:: python
+
+    rng = np.random.default_rng()
+    x = np.arange(10)
+    y = np.arange(10) + rng.standard_normal(10)
+
+With the legacy polynomial module, a linear fit (i.e. polynomial of degree 1)
+could be applied to these data with `~numpy.polyfit`:
+
+.. ipython:: python
+
+    np.polyfit(x, y, deg=1)
+
+With the new polynomial API, the `~numpy.polynomial.polynomial.Polynomial.fit`
+class method is preferred:
+
+.. ipython:: python
+
+    p_fitted = np.polynomial.Polynomial.fit(x, y, deg=1)
+    p_fitted
+
+Note that the coefficients are given *in the scaled domain* defined by the
+linear mapping between the ``window`` and ``domain``.
+`~numpy.polynomial.polynomial.Polynomial.convert` can be used to get the
+coefficients in the unscaled data domain.
+
+.. ipython:: python
+
+    p_fitted.convert()
+
+Documentation for the `~numpy.polynomial` Package
+-------------------------------------------------
+
+In addition to standard power series polynomials, the polynomial package
+provides several additional kinds of polynomials including Chebyshev,
+Hermite (two subtypes), Laguerre, and Legendre polynomials.
+Each of these has an associated
+`convenience class <routines.polynomials.classes>` available from the
+`numpy.polynomial` namespace that provides a consistent interface for working
+with polynomials regardless of their type.
 
 .. toctree::
    :maxdepth: 1
 
    routines.polynomials.classes
+
+Documentation pertaining to specific functions defined for each kind of
+polynomial individually can be found in the corresponding module documentation:
+
+.. toctree::
+   :maxdepth: 1
+
    routines.polynomials.polynomial
    routines.polynomials.chebyshev
    routines.polynomials.hermite
@@ -36,6 +177,9 @@ degree i.
    routines.polynomials.polyutils
 
 
+Documentation for Legacy Polynomials
+------------------------------------
+
 .. toctree::
    :maxdepth: 2
 
diff --git a/doc/source/reference/routines.testing.rst b/doc/source/reference/routines.testing.rst
index 98ce3f377..d9e98e941 100644
--- a/doc/source/reference/routines.testing.rst
+++ b/doc/source/reference/routines.testing.rst
@@ -18,9 +18,6 @@ Asserts
 .. autosummary::
    :toctree: generated/
 
-   assert_almost_equal
-   assert_approx_equal
-   assert_array_almost_equal
    assert_allclose
    assert_array_almost_equal_nulp
    assert_array_max_ulp
@@ -32,6 +29,19 @@ Asserts
    assert_warns
    assert_string_equal
 
+Asserts (not recommended)
+-------------------------
+It is recommended to use one of `assert_allclose`,
+`assert_array_almost_equal_nulp` or `assert_array_max_ulp` instead of these
+functions for more consistent floating point comparisons.
+
+.. autosummary::
+   :toctree: generated/
+
+   assert_almost_equal
+   assert_approx_equal
+   assert_array_almost_equal
+
 Decorators
 ----------
 .. autosummary::
diff --git a/doc/source/reference/simd/simd-optimizations.py b/doc/source/reference/simd/simd-optimizations.py
index 5d6da50e3..a78302db5 100644
--- a/doc/source/reference/simd/simd-optimizations.py
+++ b/doc/source/reference/simd/simd-optimizations.py
@@ -8,7 +8,7 @@ gen_path = path.dirname(path.realpath(__file__))
 from numpy.distutils.ccompiler_opt import CCompilerOpt
 
 class FakeCCompilerOpt(CCompilerOpt):
-    fake_info = ""
+    fake_info = ("arch", "compiler", "extra_args")
     # disable caching no need for it
     conf_nocache = True
     def __init__(self, *args, **kwargs):
@@ -101,7 +101,7 @@ def features_table_sections(name, ftable=None, gtable=None, tab_size=4):
     return content
 
 def features_table(arch, cc="gcc", pretty_name=None, **kwargs):
-    FakeCCompilerOpt.fake_info = arch + cc
+    FakeCCompilerOpt.fake_info = (arch, cc, '')
     ccopt = FakeCCompilerOpt(cpu_baseline="max")
     features = ccopt.cpu_baseline_names()
     ftable = ccopt.gen_features_table(features, **kwargs)
@@ -112,12 +112,12 @@ def features_table(arch, cc="gcc", pretty_name=None, **kwargs):
     return features_table_sections(pretty_name, ftable, gtable, **kwargs)
 
 def features_table_diff(arch, cc, cc_vs="gcc", pretty_name=None, **kwargs):
-    FakeCCompilerOpt.fake_info = arch + cc
+    FakeCCompilerOpt.fake_info = (arch, cc, '')
     ccopt = FakeCCompilerOpt(cpu_baseline="max")
     fnames = ccopt.cpu_baseline_names()
     features = {f:ccopt.feature_implies(f) for f in fnames}
 
-    FakeCCompilerOpt.fake_info = arch + cc_vs
+    FakeCCompilerOpt.fake_info = (arch, cc_vs, '')
     ccopt_vs = FakeCCompilerOpt(cpu_baseline="max")
     fnames_vs = ccopt_vs.cpu_baseline_names()
     features_vs = {f:ccopt_vs.feature_implies(f) for f in fnames_vs}
diff --git a/doc/source/reference/simd/simd-optimizations.rst b/doc/source/reference/simd/simd-optimizations.rst
index 59a4892b2..956824321 100644
--- a/doc/source/reference/simd/simd-optimizations.rst
+++ b/doc/source/reference/simd/simd-optimizations.rst
@@ -96,8 +96,8 @@ NOTES
   arguments must be enclosed in quotes.
 
 - The operand ``+`` is only added for nominal reasons, For example:
-  ``--cpu-basline= "min avx2"`` is equivalent to ``--cpu-basline="min + avx2"``.
-  ``--cpu-basline="min,avx2"`` is equivalent to ``--cpu-basline`="min,+avx2"``
+  ``--cpu-baseline= "min avx2"`` is equivalent to ``--cpu-baseline="min + avx2"``.
+  ``--cpu-baseline="min,avx2"`` is equivalent to ``--cpu-baseline`="min,+avx2"``
 
 - If the CPU feature is not supported by the user platform or
   compiler, it will be skipped rather than raising a fatal error.
diff --git a/doc/source/reference/ufuncs.rst b/doc/source/reference/ufuncs.rst
index 06fbe28dd..3eae4e159 100644
--- a/doc/source/reference/ufuncs.rst
+++ b/doc/source/reference/ufuncs.rst
@@ -266,7 +266,7 @@ can generate this table for your system with the code given in the Figure.
     S - - - - - - - - - - - - - - - - - - - - Y Y Y Y - -
     U - - - - - - - - - - - - - - - - - - - - - Y Y Y - -
     V - - - - - - - - - - - - - - - - - - - - - - Y Y - -
-    O - - - - - - - - - - - - - - - - - - - - - - Y Y - -
+    O - - - - - - - - - - - - - - - - - - - - - - - Y - -
     M - - - - - - - - - - - - - - - - - - - - - - Y Y Y -
     m - - - - - - - - - - - - - - - - - - - - - - Y Y - Y
 
@@ -430,8 +430,10 @@ advanced usage and will not typically be used.
 
     .. versionadded:: 1.6
 
-    Overrides the dtype of the calculation and output arrays. Similar to
-    *signature*.
+    Overrides the DType of the output arrays the same way as the *signature*.
+    This should ensure a matching precision of the calculation.  The exact
+    calculation DTypes chosen may depend on the ufunc and the inputs may be
+    cast to this DType to perform the calculation.
 
 *subok*
 
@@ -442,20 +444,31 @@ advanced usage and will not typically be used.
 
 *signature*
 
-    Either a data-type, a tuple of data-types, or a special signature
-    string indicating the input and output types of a ufunc. This argument
-    allows you to provide a specific signature for the 1-d loop to use
-    in the underlying calculation. If the loop specified does not exist
-    for the ufunc, then a TypeError is raised. Normally, a suitable loop is
-    found automatically by comparing the input types with what is
-    available and searching for a loop with data-types to which all inputs
-    can be cast safely. This keyword argument lets you bypass that
-    search and choose a particular loop. A list of available signatures is
-    provided by the **types** attribute of the ufunc object. For backwards
-    compatibility this argument can also be provided as *sig*, although
-    the long form is preferred. Note that this should not be confused with
-    the generalized ufunc :ref:`signature <details-of-signature>` that is
-    stored in the **signature** attribute of the of the ufunc object.
+    Either a Dtype, a tuple of DTypes, or a special signature string
+    indicating the input and output types of a ufunc.
+
+    This argument allows the user to specify exact DTypes to be used for the
+    calculation.  Casting will be used as necessary. The actual DType of the
+    input arrays is not considered unless ``signature`` is ``None`` for
+    that array.
+
+    When all DTypes are fixed, a specific loop is chosen or an error raised
+    if no matching loop exists.
+    If some DTypes are not specified and left ``None``, the behaviour may
+    depend on the ufunc.
+    At this time, a list of available signatures is provided by the **types**
+    attribute of the ufunc.  (This list may be missing DTypes not defined
+    by NumPy.)
+
+    The ``signature`` only specifies the DType class/type.  For example, it
+    can specifiy that the operation should be ``datetime64`` or ``float64``
+    operation.  It does not specify the ``datetime64`` time-unit or the
+    ``float64`` byte-order.
+
+    For backwards compatibility this argument can also be provided as *sig*,
+    although the long form is preferred.  Note that this should not be
+    confused with the generalized ufunc :ref:`signature <details-of-signature>`
+    that is stored in the **signature** attribute of the of the ufunc object.
 
 *extobj*
 
@@ -628,8 +641,8 @@ Math operations
     for large calculations. If your arrays are large, complicated
     expressions can take longer than absolutely necessary due to the
     creation and (later) destruction of temporary calculation
-    spaces. For example, the expression ``G = a * b + c`` is equivalent to
-    ``t1 = A * B; G = T1 + C; del t1``. It will be more quickly executed
+    spaces. For example, the expression ``G = A * B + C`` is equivalent to
+    ``T1 = A * B; G = T1 + C; del T1``. It will be more quickly executed
     as ``G = A * B; add(G, C, G)`` which is the same as
     ``G = A * B; G += C``.