60 files changed, 2661 insertions, 2362 deletions
diff --git a/doc/neps/nep-0041-improved-dtype-support.rst b/doc/neps/nep-0041-improved-dtype-support.rst
index 56ff5eac6..6dc4ea50c 100644
--- a/doc/neps/nep-0041-improved-dtype-support.rst
+++ b/doc/neps/nep-0041-improved-dtype-support.rst
@@ -514,22 +514,22 @@ are not yet fully clear, we anticipate, and accept the following changes:
 
 * **C-API**:
 
-    * In old versions of NumPy ``PyArray_DescrCheck`` is a macro which uses
-      ``type(dtype) is np.dtype``. When compiling against an old NumPy version,
-      the macro may have to be replaced with the corresponding
-      ``PyObject_IsInstance`` call. (If this is a problem, we could backport
-      fixing the macro)
-
-   * The UFunc machinery changes will break *limited* parts of the current
-     implementation. Replacing e.g. the default ``TypeResolver`` is expected
-     to remain supported for a time, although optimized masked inner loop iteration
-     (which is not even used *within* NumPy) will no longer be supported.
-
-   * All functions currently defined on the dtypes, such as
-     ``PyArray_Descr->f->nonzero``, will be defined and accessed differently.
-     This means that in the long run lowlevel access code will
-     have to be changed to use the new API. Such changes are expected to be
-     necessary in very few project.
+  * In old versions of NumPy ``PyArray_DescrCheck`` is a macro which uses
+    ``type(dtype) is np.dtype``. When compiling against an old NumPy version,
+    the macro may have to be replaced with the corresponding
+    ``PyObject_IsInstance`` call. (If this is a problem, we could backport
+    fixing the macro)
+
+  * The UFunc machinery changes will break *limited* parts of the current
+    implementation. Replacing e.g. the default ``TypeResolver`` is expected
+    to remain supported for a time, although optimized masked inner loop iteration
+    (which is not even used *within* NumPy) will no longer be supported.
+
+  * All functions currently defined on the dtypes, such as
+    ``PyArray_Descr->f->nonzero``, will be defined and accessed differently.
+    This means that in the long run lowlevel access code will
+    have to be changed to use the new API. Such changes are expected to be
+    necessary in very few project.
 
 * **dtype implementors (C-API)**:
 
@@ -541,16 +541,16 @@ are not yet fully clear, we anticipate, and accept the following changes:
     At least in some code paths, a similar mechanism is already used.
 
   * The ``scalarkind`` slot and registration of scalar casting will be
-     removed/ignored without replacement.
-     It currently allows partial value-based casting.
-     The ``PyArray_ScalarKind`` function will continue to work for builtin types,
-     but will not be used internally and be deprecated.
-
-   * Currently user dtypes are defined as instances of ``np.dtype``.
-     The creation works by the user providing a prototype instance.
-     NumPy will need to modify at least the type during registration.
-     This has no effect for either ``rational`` or ``quaternion`` and mutation
-     of the structure seems unlikely after registration.
+    removed/ignored without replacement.
+    It currently allows partial value-based casting.
+    The ``PyArray_ScalarKind`` function will continue to work for builtin types,
+    but will not be used internally and be deprecated.
+
+  * Currently user dtypes are defined as instances of ``np.dtype``.
+    The creation works by the user providing a prototype instance.
+    NumPy will need to modify at least the type during registration.
+    This has no effect for either ``rational`` or ``quaternion`` and mutation
+    of the structure seems unlikely after registration.
 
 Since there is a fairly large API surface concerning datatypes, further changes
 or the limitation certain function to currently existing datatypes is
diff --git a/doc/neps/nep-0042-new-dtypes.rst b/doc/neps/nep-0042-new-dtypes.rst
index 1f476114f..b37555892 100644
--- a/doc/neps/nep-0042-new-dtypes.rst
+++ b/doc/neps/nep-0042-new-dtypes.rst
@@ -267,7 +267,7 @@ information is currently provided and will be defined on the class:
   deprecated. This may be relaxed if a use-case arises.
 
 Additionally, existing methods (and C-side fields) will be provided.
-However, the fields ``kind`` and and ``char`` will be set to ``\0``
+However, the fields ``kind`` and ``char`` will be set to ``\0``
 (NULL character) on the C-side.
 While discouraged, except for NumPy builtin types, ``kind`` both will return
 the ``__qualname__`` of the object to ensure uniqueness for all DTypes.
@@ -307,7 +307,7 @@ is the ``np.datetime64`` scalar.
 
 A potential DType such as ``Categorical`` will not be required to have a clear type
 associated with it. Instead, the ``type`` may be ``object`` and the
-categoircal's values are arbitrary objects.
+categorical's values are arbitrary objects.
 Unlike with well-defined scalars, this ``type`` cannot
 not be used for the dtype discovery necessary for coercion
 (compare section `DType Discovery during Array Coercion`_).
@@ -659,7 +659,7 @@ should be "minutes".
 Common DType Operations
 ^^^^^^^^^^^^^^^^^^^^^^^
 
-Numpy currently provides functions like ``np.result_type`` and
+NumPy currently provides functions like ``np.result_type`` and
 ``np.promote_types`` for determining common types.
 These differ in that ``np.result_type`` can take arrays and scalars as input
 and implements value based promotion [1]_.
@@ -972,7 +972,7 @@ In general we could implement certain casts, such as ``int8`` to ``int24``
 even if the user only provides an ``int16 -> int24`` cast.
 This proposal currently does not provide this functionality.  However,
 it could be extended in the future to either find such casts dynamically,
-or at least allow ``adjust_descriptors`` to return arbitray ``dtypes``.
+or at least allow ``adjust_descriptors`` to return arbitrary ``dtypes``.
 If ``CastingImpl[Int8, Int24].adjust_descriptors((int8, int24))`` returns
 ``(int16, int24)``, the actual casting process could be extended to include
 the ``int8 -> int16`` cast.  Unlike the above example, which is limited
diff --git a/doc/release/upcoming_changes/17068.compatibility.rst b/doc/release/upcoming_changes/17068.compatibility.rst
new file mode 100644
index 000000000..7aa4e58ae
--- /dev/null
+++ b/doc/release/upcoming_changes/17068.compatibility.rst
@@ -0,0 +1,4 @@
+f2py generated code may return unicode instead of byte strings
+--------------------------------------------------------------
+Some byte strings previously returned by f2py generated code may now be unicode
+strings. This results from the ongoing Python2 -> Python3 cleanup.
diff --git a/doc/release/upcoming_changes/17116.expired.rst b/doc/release/upcoming_changes/17116.expired.rst
new file mode 100644
index 000000000..d8a3a43d5
--- /dev/null
+++ b/doc/release/upcoming_changes/17116.expired.rst
@@ -0,0 +1,2 @@
+* The 14-year deprecation of ``np.ctypeslib.ctypes_load_library`` is expired.
+  Use :func:`~numpy.ctypeslib.load_library` instead, which is identical.
diff --git a/doc/source/reference/routines.ctypeslib.rst b/doc/source/reference/routines.ctypeslib.rst
index 562638e9c..3a059f5d9 100644
--- a/doc/source/reference/routines.ctypeslib.rst
+++ b/doc/source/reference/routines.ctypeslib.rst
@@ -9,6 +9,5 @@ C-Types Foreign Function Interface (:mod:`numpy.ctypeslib`)
 .. autofunction:: as_array
 .. autofunction:: as_ctypes
 .. autofunction:: as_ctypes_type
-.. autofunction:: ctypes_load_library
 .. autofunction:: load_library
 .. autofunction:: ndpointer
diff --git a/numpy/__init__.py b/numpy/__init__.py
index c594928ce..41c3dc42d 100644
--- a/numpy/__init__.py
+++ b/numpy/__init__.py
@@ -215,12 +215,11 @@ else:
     del Arrayterator
 
     # These names were removed in NumPy 1.20.  For at least one release,
-    # attempts to access these names in the numpy namespace will have an
-    # error message that refers to NEP 32 and points to the numpy_financial
-    # library.
+    # attempts to access these names in the numpy namespace will trigger
+    # a warning, and calling the function will raise an exception.
     _financial_names = ['fv', 'ipmt', 'irr', 'mirr', 'nper', 'npv', 'pmt',
                         'ppmt', 'pv', 'rate']
-    __expired_attrs__ = {
+    __expired_functions__ = {
         name: (f'In accordance with NEP 32, the function {name} was removed '
                'from NumPy version 1.20.  A replacement for this function '
                'is available in the numpy_financial library: '
@@ -241,13 +240,19 @@ else:
         # module level getattr is only supported in 3.7 onwards
         # https://www.python.org/dev/peps/pep-0562/
         def __getattr__(attr):
-            # Raise AttributeError for expired attributes
+            # Warn for expired attributes, and return a dummy function
+            # that always raises an exception.
             try:
-                msg = __expired_attrs__[attr]
+                msg = __expired_functions__[attr]
             except KeyError:
                 pass
             else:
-                raise AttributeError(msg)
+                warnings.warn(msg, RuntimeWarning)
+
+                def _expired(*args, **kwds):
+                    raise RuntimeError(msg)
+
+                return _expired
 
             # Emit warnings for deprecated attributes
             try:
diff --git a/numpy/__init__.pyi b/numpy/__init__.pyi
index fad5e1774..c6cc94440 100644
--- a/numpy/__init__.pyi
+++ b/numpy/__init__.pyi
@@ -1237,3 +1237,114 @@ def amin(
     initial: _NumberLike = ...,
     where: _ArrayLikeBool = ...,
 ) -> Union[number, ndarray]: ...
+
+# TODO: `np.prod()``: For object arrays `initial` does not necessarily
+# have to be a numerical scalar.
+# The only requirement is that it is compatible
+# with the `.__mul__()` method(s) of the passed array's elements.
+
+# Note that the same situation holds for all wrappers around
+# `np.ufunc.reduce`, e.g. `np.sum()` (`.__add__()`).
+
+@overload
+def prod(
+    a: _Number,
+    axis: Optional[_ShapeLike] = ...,
+    dtype: DtypeLike = ...,
+    out: None = ...,
+    keepdims: bool = ...,
+    initial: _NumberLike = ...,
+    where: _ArrayLikeBool = ...,
+) -> _Number: ...
+@overload
+def prod(
+    a: ArrayLike,
+    axis: None = ...,
+    dtype: DtypeLike = ...,
+    out: None = ...,
+    keepdims: Literal[False] = ...,
+    initial: _NumberLike = ...,
+    where: _ArrayLikeBool = ...,
+) -> number: ...
+@overload
+def prod(
+    a: ArrayLike,
+    axis: Optional[_ShapeLike] = ...,
+    dtype: DtypeLike = ...,
+    out: Optional[ndarray] = ...,
+    keepdims: bool = ...,
+    initial: _NumberLike = ...,
+    where: _ArrayLikeBool = ...,
+) -> Union[number, ndarray]: ...
+def cumprod(
+    a: ArrayLike,
+    axis: Optional[int] = ...,
+    dtype: DtypeLike = ...,
+    out: Optional[ndarray] = ...,
+) -> ndarray: ...
+def ndim(a: ArrayLike) -> int: ...
+def size(a: ArrayLike, axis: Optional[int] = ...) -> int: ...
+@overload
+def around(
+    a: _Number, decimals: int = ..., out: Optional[ndarray] = ...
+) -> _Number: ...
+@overload
+def around(
+    a: _NumberLike, decimals: int = ..., out: Optional[ndarray] = ...
+) -> number: ...
+@overload
+def around(
+    a: ArrayLike, decimals: int = ..., out: Optional[ndarray] = ...
+) -> ndarray: ...
+@overload
+def mean(
+    a: ArrayLike,
+    axis: None = ...,
+    dtype: DtypeLike = ...,
+    out: None = ...,
+    keepdims: Literal[False] = ...,
+) -> number: ...
+@overload
+def mean(
+    a: ArrayLike,
+    axis: Optional[_ShapeLike] = ...,
+    dtype: DtypeLike = ...,
+    out: Optional[ndarray] = ...,
+    keepdims: bool = ...,
+) -> Union[number, ndarray]: ...
+@overload
+def std(
+    a: ArrayLike,
+    axis: None = ...,
+    dtype: DtypeLike = ...,
+    out: None = ...,
+    ddof: int = ...,
+    keepdims: Literal[False] = ...,
+) -> number: ...
+@overload
+def std(
+    a: ArrayLike,
+    axis: Optional[_ShapeLike] = ...,
+    dtype: DtypeLike = ...,
+    out: Optional[ndarray] = ...,
+    ddof: int = ...,
+    keepdims: bool = ...,
+) -> Union[number, ndarray]: ...
+@overload
+def var(
+    a: ArrayLike,
+    axis: None = ...,
+    dtype: DtypeLike = ...,
+    out: None = ...,
+    ddof: int = ...,
+    keepdims: Literal[False] = ...,
+) -> number: ...
+@overload
+def var(
+    a: ArrayLike,
+    axis: Optional[_ShapeLike] = ...,
+    dtype: DtypeLike = ...,
+    out: Optional[ndarray] = ...,
+    ddof: int = ...,
+    keepdims: bool = ...,
+) -> Union[number, ndarray]: ...
diff --git a/numpy/core/arrayprint.py b/numpy/core/arrayprint.py
index 5d9642ea8..ad1530419 100644
--- a/numpy/core/arrayprint.py
+++ b/numpy/core/arrayprint.py
@@ -1628,6 +1628,3 @@ def set_string_function(f, repr=True):
             return multiarray.set_string_function(_default_array_str, 0)
     else:
         return multiarray.set_string_function(f, repr)
-
-set_string_function(_default_array_str, False)
-set_string_function(_default_array_repr, True)
diff --git a/numpy/core/setup.py b/numpy/core/setup.py
index aede12080..a4a84397d 100644
--- a/numpy/core/setup.py
+++ b/numpy/core/setup.py
@@ -790,6 +790,8 @@ def configuration(parent_package='',top_path=None):
             join('src', 'multiarray', 'descriptor.h'),
             join('src', 'multiarray', 'dtypemeta.h'),
             join('src', 'multiarray', 'dragon4.h'),
+            join('src', 'multiarray', 'einsum_debug.h'),
+            join('src', 'multiarray', 'einsum_sumprod.h'),
             join('src', 'multiarray', 'getset.h'),
             join('src', 'multiarray', 'hashdescr.h'),
             join('src', 'multiarray', 'iterators.h'),
@@ -853,6 +855,7 @@ def configuration(parent_package='',top_path=None):
             join('src', 'multiarray', 'dragon4.c'),
             join('src', 'multiarray', 'dtype_transfer.c'),
             join('src', 'multiarray', 'einsum.c.src'),
+            join('src', 'multiarray', 'einsum_sumprod.c.src'),
             join('src', 'multiarray', 'flagsobject.c'),
             join('src', 'multiarray', 'getset.c'),
             join('src', 'multiarray', 'hashdescr.c'),
diff --git a/numpy/core/src/common/array_assign.c b/numpy/core/src/common/array_assign.c
index d626d1260..e365b49e4 100644
--- a/numpy/core/src/common/array_assign.c
+++ b/numpy/core/src/common/array_assign.c
@@ -67,12 +67,12 @@ broadcast_strides(int ndim, npy_intp const *shape,
 broadcast_error: {
         PyObject *errmsg;
 
-        errmsg = PyUString_FromFormat("could not broadcast %s from shape ",
+        errmsg = PyUnicode_FromFormat("could not broadcast %s from shape ",
                                 strides_name);
         PyUString_ConcatAndDel(&errmsg,
                 build_shape_string(strides_ndim, strides_shape));
         PyUString_ConcatAndDel(&errmsg,
-                PyUString_FromString(" into shape "));
+                PyUnicode_FromString(" into shape "));
         PyUString_ConcatAndDel(&errmsg,
                 build_shape_string(ndim, shape));
         PyErr_SetObject(PyExc_ValueError, errmsg);
diff --git a/numpy/core/src/common/npy_cblas.h b/numpy/core/src/common/npy_cblas.h
index 97308238a..072993ec2 100644
--- a/numpy/core/src/common/npy_cblas.h
+++ b/numpy/core/src/common/npy_cblas.h
@@ -47,8 +47,10 @@ enum CBLAS_SIDE {CblasLeft=141, CblasRight=142};
 
 #ifdef HAVE_BLAS_ILP64
 #define CBLAS_INT npy_int64
+#define CBLAS_INT_MAX NPY_MAX_INT64
 #else
 #define CBLAS_INT int
+#define CBLAS_INT_MAX INT_MAX
 #endif
 
 #define BLASNAME(name) CBLAS_FUNC(name)
@@ -59,6 +61,39 @@ enum CBLAS_SIDE {CblasLeft=141, CblasRight=142};
 #undef BLASINT
 #undef BLASNAME
 
+
+/*
+ * Convert NumPy stride to BLAS stride. Returns 0 if conversion cannot be done
+ * (BLAS won't handle negative or zero strides the way we want).
+ */
+static NPY_INLINE CBLAS_INT
+blas_stride(npy_intp stride, unsigned itemsize)
+{
+    /*
+     * Should probably check pointer alignment also, but this may cause
+     * problems if we require complex to be 16 byte aligned.
+     */
+    if (stride > 0 && (stride % itemsize) == 0) {
+        stride /= itemsize;
+        if (stride <= CBLAS_INT_MAX) {
+            return stride;
+        }
+    }
+    return 0;
+}
+
+/*
+ * Define a chunksize for CBLAS.
+ *
+ * The chunksize is the greatest power of two less than CBLAS_INT_MAX.
+ */
+#if NPY_MAX_INTP > CBLAS_INT_MAX
+# define NPY_CBLAS_CHUNK  (CBLAS_INT_MAX / 2 + 1)
+#else
+# define NPY_CBLAS_CHUNK  NPY_MAX_INTP
+#endif
+
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/numpy/core/src/multiarray/_multiarray_tests.c.src b/numpy/core/src/multiarray/_multiarray_tests.c.src
index da631c830..9c1fa0bad 100644
--- a/numpy/core/src/multiarray/_multiarray_tests.c.src
+++ b/numpy/core/src/multiarray/_multiarray_tests.c.src
@@ -1902,7 +1902,7 @@ PrintFloat_Printf_g(PyObject *obj, int precision)
         PyOS_snprintf(str, sizeof(str), "%.*g", precision, val);
     }
 
-    return PyUString_FromString(str);
+    return PyUnicode_FromString(str);
 }
 
 
diff --git a/numpy/core/src/multiarray/arrayobject.c b/numpy/core/src/multiarray/arrayobject.c
index 95c650674..5da1b5f29 100644
--- a/numpy/core/src/multiarray/arrayobject.c
+++ b/numpy/core/src/multiarray/arrayobject.c
@@ -416,7 +416,7 @@ WARN_IN_DEALLOC(PyObject* warning, const char * msg) {
     if (PyErr_WarnEx(warning, msg, 1) < 0) {
         PyObject * s;
 
-        s = PyUString_FromString("array_dealloc");
+        s = PyUnicode_FromString("array_dealloc");
         if (s) {
             PyErr_WriteUnraisable(s);
             Py_DECREF(s);
diff --git a/numpy/core/src/multiarray/arraytypes.c.src b/numpy/core/src/multiarray/arraytypes.c.src
index 9508fb5ad..3fee587b9 100644
--- a/numpy/core/src/multiarray/arraytypes.c.src
+++ b/numpy/core/src/multiarray/arraytypes.c.src
@@ -865,7 +865,7 @@ VOID_setitem(PyObject *op, void *input, void *vap)
             npy_intp names_size = PyTuple_GET_SIZE(descr->names);
 
             if (names_size != PyTuple_Size(op)) {
-                errmsg = PyUString_FromFormat(
+                errmsg = PyUnicode_FromFormat(
                         "could not assign tuple of length %zd to structure "
                         "with %" NPY_INTP_FMT " fields.",
                         PyTuple_Size(op), names_size);
diff --git a/numpy/core/src/multiarray/buffer.c b/numpy/core/src/multiarray/buffer.c
index 25bb2d195..af40cdc2c 100644
--- a/numpy/core/src/multiarray/buffer.c
+++ b/numpy/core/src/multiarray/buffer.c
@@ -931,7 +931,7 @@ _descriptor_from_pep3118_format(char const *s)
     }
     *p = '\0';
 
-    str = PyUString_FromStringAndSize(buf, strlen(buf));
+    str = PyUnicode_FromStringAndSize(buf, strlen(buf));
     if (str == NULL) {
         free(buf);
         return NULL;
diff --git a/numpy/core/src/multiarray/common.c b/numpy/core/src/multiarray/common.c
index 3d3ac7709..5f8250fb7 100644
--- a/numpy/core/src/multiarray/common.c
+++ b/numpy/core/src/multiarray/common.c
@@ -133,7 +133,7 @@ NPY_NO_EXPORT PyArray_Descr *
 _array_typedescr_fromstr(char const *c_str)
 {
     PyArray_Descr *descr = NULL;
-    PyObject *stringobj = PyString_FromString(c_str);
+    PyObject *stringobj = PyBytes_FromString(c_str);
 
     if (stringobj == NULL) {
         return NULL;
@@ -264,10 +264,10 @@ convert_shape_to_string(npy_intp n, npy_intp const *vals, char *ending)
     for (i = 0; i < n && vals[i] < 0; i++);
 
     if (i == n) {
-        return PyUString_FromFormat("()%s", ending);
+        return PyUnicode_FromFormat("()%s", ending);
     }
     else {
-        ret = PyUString_FromFormat("(%" NPY_INTP_FMT, vals[i++]);
+        ret = PyUnicode_FromFormat("(%" NPY_INTP_FMT, vals[i++]);
         if (ret == NULL) {
             return NULL;
         }
@@ -275,10 +275,10 @@ convert_shape_to_string(npy_intp n, npy_intp const *vals, char *ending)
 
     for (; i < n; ++i) {
         if (vals[i] < 0) {
-            tmp = PyUString_FromString(",newaxis");
+            tmp = PyUnicode_FromString(",newaxis");
         }
         else {
-            tmp = PyUString_FromFormat(",%" NPY_INTP_FMT, vals[i]);
+            tmp = PyUnicode_FromFormat(",%" NPY_INTP_FMT, vals[i]);
         }
         if (tmp == NULL) {
             Py_DECREF(ret);
@@ -292,10 +292,10 @@ convert_shape_to_string(npy_intp n, npy_intp const *vals, char *ending)
     }
 
     if (i == 1) {
-        tmp = PyUString_FromFormat(",)%s", ending);
+        tmp = PyUnicode_FromFormat(",)%s", ending);
     }
     else {
-        tmp = PyUString_FromFormat(")%s", ending);
+        tmp = PyUnicode_FromFormat(")%s", ending);
     }
     PyUString_ConcatAndDel(&ret, tmp);
     return ret;
@@ -310,7 +310,7 @@ dot_alignment_error(PyArrayObject *a, int i, PyArrayObject *b, int j)
              *shape1 = NULL, *shape2 = NULL,
              *shape1_i = NULL, *shape2_j = NULL;
 
-    format = PyUString_FromString("shapes %s and %s not aligned:"
+    format = PyUnicode_FromString("shapes %s and %s not aligned:"
                                   " %d (dim %d) != %d (dim %d)");
 
     shape1 = convert_shape_to_string(PyArray_NDIM(a), PyArray_DIMS(a), "");
@@ -333,7 +333,7 @@ dot_alignment_error(PyArrayObject *a, int i, PyArrayObject *b, int j)
         goto end;
     }
 
-    errmsg = PyUString_Format(format, fmt_args);
+    errmsg = PyUnicode_Format(format, fmt_args);
     if (errmsg != NULL) {
         PyErr_SetObject(PyExc_ValueError, errmsg);
     }
diff --git a/numpy/core/src/multiarray/common.h b/numpy/core/src/multiarray/common.h
index 793cefaf8..4410825fa 100644
--- a/numpy/core/src/multiarray/common.h
+++ b/numpy/core/src/multiarray/common.h
@@ -292,43 +292,6 @@ npy_memchr(char * haystack, char needle,
     return p;
 }
 
-/*
- * Convert NumPy stride to BLAS stride. Returns 0 if conversion cannot be done
- * (BLAS won't handle negative or zero strides the way we want).
- */
-static NPY_INLINE int
-blas_stride(npy_intp stride, unsigned itemsize)
-{
-    /*
-     * Should probably check pointer alignment also, but this may cause
-     * problems if we require complex to be 16 byte aligned.
-     */
-    if (stride > 0 && npy_is_aligned((void *)stride, itemsize)) {
-        stride /= itemsize;
-#ifndef HAVE_BLAS_ILP64
-        if (stride <= INT_MAX) {
-#else
-        if (stride <= NPY_MAX_INT64) {
-#endif
-            return stride;
-        }
-    }
-    return 0;
-}
-
-/*
- * Define a chunksize for CBLAS. CBLAS counts in integers.
- */
-#if NPY_MAX_INTP > INT_MAX
-# ifndef HAVE_BLAS_ILP64
-#  define NPY_CBLAS_CHUNK  (INT_MAX / 2 + 1)
-# else
-#  define NPY_CBLAS_CHUNK  (NPY_MAX_INT64 / 2 + 1)
-# endif
-#else
-# define NPY_CBLAS_CHUNK  NPY_MAX_INTP
-#endif
-
 #include "ucsnarrow.h"
 
 /*
diff --git a/numpy/core/src/multiarray/convert.c b/numpy/core/src/multiarray/convert.c
index 41a10afdc..b68b9322d 100644
--- a/numpy/core/src/multiarray/convert.c
+++ b/numpy/core/src/multiarray/convert.c
@@ -248,13 +248,13 @@ PyArray_ToFile(PyArrayObject *self, FILE *fp, char *sep, char *format)
                     return -1;
                 }
                 PyTuple_SET_ITEM(tupobj,0,obj);
-                obj = PyUString_FromString((const char *)format);
+                obj = PyUnicode_FromString((const char *)format);
                 if (obj == NULL) {
                     Py_DECREF(tupobj);
                     Py_DECREF(it);
                     return -1;
                 }
-                strobj = PyUString_Format(obj, tupobj);
+                strobj = PyUnicode_Format(obj, tupobj);
                 Py_DECREF(obj);
                 Py_DECREF(tupobj);
                 if (strobj == NULL) {
diff --git a/numpy/core/src/multiarray/convert_datatype.c b/numpy/core/src/multiarray/convert_datatype.c
index bd038c53a..1f5845eb7 100644
--- a/numpy/core/src/multiarray/convert_datatype.c
+++ b/numpy/core/src/multiarray/convert_datatype.c
@@ -95,8 +95,11 @@ PyArray_GetCastFunc(PyArray_Descr *descr, int type_num)
             key = PyLong_FromLong(type_num);
             cobj = PyDict_GetItem(obj, key);
             Py_DECREF(key);
-            if (cobj && NpyCapsule_Check(cobj)) {
-                castfunc = NpyCapsule_AsVoidPtr(cobj);
+            if (cobj && PyCapsule_CheckExact(cobj)) {
+                castfunc = PyCapsule_GetPointer(cobj, NULL);
+                if (castfunc == NULL) {
+                    return NULL;
+                }
             }
         }
     }
diff --git a/numpy/core/src/multiarray/ctors.c b/numpy/core/src/multiarray/ctors.c
index 3ff397817..6add032bf 100644
--- a/numpy/core/src/multiarray/ctors.c
+++ b/numpy/core/src/multiarray/ctors.c
@@ -868,11 +868,14 @@ PyArray_NewFromDescr_int(
 
         func = PyObject_GetAttr((PyObject *)fa, npy_ma_str_array_finalize);
         if (func && func != Py_None) {
-            if (NpyCapsule_Check(func)) {
+            if (PyCapsule_CheckExact(func)) {
                 /* A C-function is stored here */
                 PyArray_FinalizeFunc *cfunc;
-                cfunc = NpyCapsule_AsVoidPtr(func);
+                cfunc = PyCapsule_GetPointer(func, NULL);
                 Py_DECREF(func);
+                if (cfunc == NULL) {
+                    goto fail;
+                }
                 if (cfunc((PyArrayObject *)fa, obj) < 0) {
                     goto fail;
                 }
@@ -1747,7 +1750,7 @@ PyArray_FromStructInterface(PyObject *input)
             return Py_NotImplemented;
         }
     }
-    if (!NpyCapsule_Check(attr)) {
+    if (!PyCapsule_CheckExact(attr)) {
         if (PyType_Check(input) && PyObject_HasAttrString(attr, "__get__")) {
             /*
              * If the input is a class `attr` should be a property-like object.
@@ -1759,7 +1762,10 @@ PyArray_FromStructInterface(PyObject *input)
         }
         goto fail;
     }
-    inter = NpyCapsule_AsVoidPtr(attr);
+    inter = PyCapsule_GetPointer(attr, NULL);
+    if (inter == NULL) {
+        goto fail;
+    }
     if (inter->two != 2) {
         goto fail;
     }
diff --git a/numpy/core/src/multiarray/datetime.c b/numpy/core/src/multiarray/datetime.c
index 3649bbe4c..348473309 100644
--- a/numpy/core/src/multiarray/datetime.c
+++ b/numpy/core/src/multiarray/datetime.c
@@ -1435,14 +1435,14 @@ raise_if_datetime64_metadata_cast_error(char *object_type,
     }
     else {
         PyObject *errmsg;
-        errmsg = PyUString_FromFormat("Cannot cast %s "
+        errmsg = PyUnicode_FromFormat("Cannot cast %s "
                     "from metadata ", object_type);
         errmsg = append_metastr_to_string(src_meta, 0, errmsg);
         PyUString_ConcatAndDel(&errmsg,
-                PyUString_FromString(" to "));
+                PyUnicode_FromString(" to "));
         errmsg = append_metastr_to_string(dst_meta, 0, errmsg);
         PyUString_ConcatAndDel(&errmsg,
-                PyUString_FromFormat(" according to the rule %s",
+                PyUnicode_FromFormat(" according to the rule %s",
                         npy_casting_to_string(casting)));
         PyErr_SetObject(PyExc_TypeError, errmsg);
         Py_DECREF(errmsg);
@@ -1467,14 +1467,14 @@ raise_if_timedelta64_metadata_cast_error(char *object_type,
     }
     else {
         PyObject *errmsg;
-        errmsg = PyUString_FromFormat("Cannot cast %s "
+        errmsg = PyUnicode_FromFormat("Cannot cast %s "
                     "from metadata ", object_type);
         errmsg = append_metastr_to_string(src_meta, 0, errmsg);
         PyUString_ConcatAndDel(&errmsg,
-                PyUString_FromString(" to "));
+                PyUnicode_FromString(" to "));
         errmsg = append_metastr_to_string(dst_meta, 0, errmsg);
         PyUString_ConcatAndDel(&errmsg,
-                PyUString_FromFormat(" according to the rule %s",
+                PyUnicode_FromFormat(" according to the rule %s",
                         npy_casting_to_string(casting)));
         PyErr_SetObject(PyExc_TypeError, errmsg);
         Py_DECREF(errmsg);
@@ -1601,15 +1601,15 @@ compute_datetime_metadata_greatest_common_divisor(
 
 incompatible_units: {
         PyObject *errmsg;
-        errmsg = PyUString_FromString("Cannot get "
+        errmsg = PyUnicode_FromString("Cannot get "
                     "a common metadata divisor for "
                     "NumPy datetime metadata ");
         errmsg = append_metastr_to_string(meta1, 0, errmsg);
         PyUString_ConcatAndDel(&errmsg,
-                PyUString_FromString(" and "));
+                PyUnicode_FromString(" and "));
         errmsg = append_metastr_to_string(meta2, 0, errmsg);
         PyUString_ConcatAndDel(&errmsg,
-                PyUString_FromString(" because they have "
+                PyUnicode_FromString(" because they have "
                     "incompatible nonlinear base time units"));
         PyErr_SetObject(PyExc_TypeError, errmsg);
         Py_DECREF(errmsg);
@@ -1617,12 +1617,12 @@ incompatible_units: {
     }
 units_overflow: {
         PyObject *errmsg;
-        errmsg = PyUString_FromString("Integer overflow "
+        errmsg = PyUnicode_FromString("Integer overflow "
                     "getting a common metadata divisor for "
                     "NumPy datetime metadata ");
         errmsg = append_metastr_to_string(meta1, 0, errmsg);
         PyUString_ConcatAndDel(&errmsg,
-                PyUString_FromString(" and "));
+                PyUnicode_FromString(" and "));
         errmsg = append_metastr_to_string(meta2, 0, errmsg);
         PyErr_SetObject(PyExc_OverflowError, errmsg);
         Py_DECREF(errmsg);
@@ -1747,7 +1747,7 @@ convert_datetime_metadata_to_tuple(PyArray_DatetimeMetaData *meta)
     }
 
     PyTuple_SET_ITEM(dt_tuple, 0,
-            PyUString_FromString(_datetime_strings[meta->base]));
+            PyUnicode_FromString(_datetime_strings[meta->base]));
     PyTuple_SET_ITEM(dt_tuple, 1,
             PyLong_FromLong(meta->num));
 
@@ -1771,7 +1771,7 @@ convert_datetime_metadata_tuple_to_datetime_metadata(PyObject *tuple,
 
     if (!PyTuple_Check(tuple)) {
         PyObject *errmsg;
-        errmsg = PyUString_FromString("Require tuple for tuple to NumPy "
+        errmsg = PyUnicode_FromString("Require tuple for tuple to NumPy "
                                       "datetime metadata conversion, not ");
         PyUString_ConcatAndDel(&errmsg, PyObject_Repr(tuple));
         PyErr_SetObject(PyExc_TypeError, errmsg);
@@ -1973,7 +1973,7 @@ append_metastr_to_string(PyArray_DatetimeMetaData *meta,
     if (meta->base == NPY_FR_GENERIC) {
         /* Without brackets, give a string "generic" */
         if (skip_brackets) {
-            PyUString_ConcatAndDel(&ret, PyUString_FromString("generic"));
+            PyUString_ConcatAndDel(&ret, PyUnicode_FromString("generic"));
             return ret;
         }
         /* But with brackets, append nothing */
@@ -1994,18 +1994,18 @@ append_metastr_to_string(PyArray_DatetimeMetaData *meta,
 
     if (num == 1) {
         if (skip_brackets) {
-            res = PyUString_FromFormat("%s", basestr);
+            res = PyUnicode_FromFormat("%s", basestr);
         }
         else {
-            res = PyUString_FromFormat("[%s]", basestr);
+            res = PyUnicode_FromFormat("[%s]", basestr);
         }
     }
     else {
         if (skip_brackets) {
-            res = PyUString_FromFormat("%d%s", num, basestr);
+            res = PyUnicode_FromFormat("%d%s", num, basestr);
         }
         else {
-            res = PyUString_FromFormat("[%d%s]", num, basestr);
+            res = PyUnicode_FromFormat("[%d%s]", num, basestr);
         }
     }
 
diff --git a/numpy/core/src/multiarray/descriptor.c b/numpy/core/src/multiarray/descriptor.c
index f47f0ce06..95597b812 100644
--- a/numpy/core/src/multiarray/descriptor.c
+++ b/numpy/core/src/multiarray/descriptor.c
@@ -472,7 +472,7 @@ _convert_from_array_descr(PyObject *obj, int align)
         if (PyUnicode_GetLength(name) == 0) {
             Py_DECREF(name);
             if (title == NULL) {
-                name = PyUString_FromFormat("f%d", i);
+                name = PyUnicode_FromFormat("f%d", i);
                 if (name == NULL) {
                     goto fail;
                 }
@@ -673,7 +673,7 @@ _convert_from_list(PyObject *obj, int align)
         }
         PyTuple_SET_ITEM(tup, 0, (PyObject *)conv);
         PyTuple_SET_ITEM(tup, 1, size_obj);
-        PyObject *key = PyUString_FromFormat("f%d", i);
+        PyObject *key = PyUnicode_FromFormat("f%d", i);
         if (!key) {
             Py_DECREF(tup);
             goto fail;
@@ -1887,10 +1887,10 @@ arraydescr_protocol_typestr_get(PyArray_Descr *self)
         size >>= 2;
     }
     if (self->type_num == NPY_OBJECT) {
-        ret = PyUString_FromFormat("%c%c", endian, basic_);
+        ret = PyUnicode_FromFormat("%c%c", endian, basic_);
     }
     else {
-        ret = PyUString_FromFormat("%c%c%d", endian, basic_, size);
+        ret = PyUnicode_FromFormat("%c%c%d", endian, basic_, size);
     }
     if (PyDataType_ISDATETIME(self)) {
         PyArray_DatetimeMetaData *meta;
@@ -1974,7 +1974,7 @@ arraydescr_protocol_descr_get(PyArray_Descr *self)
         if (dobj == NULL) {
             return NULL;
         }
-        PyTuple_SET_ITEM(dobj, 0, PyUString_FromString(""));
+        PyTuple_SET_ITEM(dobj, 0, PyUnicode_FromString(""));
         PyTuple_SET_ITEM(dobj, 1, arraydescr_protocol_typestr_get(self));
         res = PyList_New(1);
         if (res == NULL) {
@@ -2450,7 +2450,7 @@ arraydescr_reduce(PyArray_Descr *self, PyObject *NPY_UNUSED(args))
         if (self->type_num == NPY_UNICODE) {
             elsize >>= 2;
         }
-        obj = PyUString_FromFormat("%c%d",self->kind, elsize);
+        obj = PyUnicode_FromFormat("%c%d",self->kind, elsize);
     }
     PyTuple_SET_ITEM(ret, 1, Py_BuildValue("(NOO)", obj, Py_False, Py_True));
 
@@ -2492,7 +2492,7 @@ arraydescr_reduce(PyArray_Descr *self, PyObject *NPY_UNUSED(args))
         PyTuple_SET_ITEM(state, 0, PyLong_FromLong(3));
     }
 
-    PyTuple_SET_ITEM(state, 1, PyUString_FromFormat("%c", endian));
+    PyTuple_SET_ITEM(state, 1, PyUnicode_FromFormat("%c", endian));
     PyTuple_SET_ITEM(state, 2, arraydescr_subdescr_get(self));
     if (PyDataType_HASFIELDS(self)) {
         Py_INCREF(self->names);
@@ -2894,7 +2894,7 @@ arraydescr_setstate(PyArray_Descr *self, PyObject *args)
         PyArray_DatetimeMetaData temp_dt_data;
 
         if ((! PyTuple_Check(metadata)) || (PyTuple_Size(metadata) != 2)) {
-            errmsg = PyUString_FromString("Invalid datetime dtype (metadata, c_metadata): ");
+            errmsg = PyUnicode_FromString("Invalid datetime dtype (metadata, c_metadata): ");
             PyUString_ConcatAndDel(&errmsg, PyObject_Repr(metadata));
             PyErr_SetObject(PyExc_ValueError, errmsg);
             Py_DECREF(errmsg);
@@ -3393,7 +3393,7 @@ arraydescr_field_subset_view(PyArray_Descr *self, PyObject *ind)
         /* disallow duplicate field indices */
         if (PyDict_Contains(fields, name)) {
             PyObject *msg = NULL;
-            PyObject *fmt = PyUString_FromString(
+            PyObject *fmt = PyUnicode_FromString(
                                    "duplicate field of name {!r}");
             if (fmt != NULL) {
                 msg = PyObject_CallMethod(fmt, "format", "O", name);
diff --git a/numpy/core/src/multiarray/dragon4.c b/numpy/core/src/multiarray/dragon4.c
index 553d0effb..a7b252a77 100644
--- a/numpy/core/src/multiarray/dragon4.c
+++ b/numpy/core/src/multiarray/dragon4.c
@@ -3093,7 +3093,7 @@ Dragon4_Positional_##Type##_opt(npy_type *val, Dragon4_Options *opt)\
         free_dragon4_bigint_scratch(scratch);\
         return NULL;\
     }\
-    ret = PyUString_FromString(scratch->repr);\
+    ret = PyUnicode_FromString(scratch->repr);\
     free_dragon4_bigint_scratch(scratch);\
     return ret;\
 }\
@@ -3130,7 +3130,7 @@ Dragon4_Scientific_##Type##_opt(npy_type *val, Dragon4_Options *opt)\
         free_dragon4_bigint_scratch(scratch);\
         return NULL;\
     }\
-    ret = PyUString_FromString(scratch->repr);\
+    ret = PyUnicode_FromString(scratch->repr);\
     free_dragon4_bigint_scratch(scratch);\
     return ret;\
 }\
diff --git a/numpy/core/src/multiarray/einsum.c.src b/numpy/core/src/multiarray/einsum.c.src
index 2538e05c6..6ad375f67 100644
--- a/numpy/core/src/multiarray/einsum.c.src
+++ b/numpy/core/src/multiarray/einsum.c.src
@@ -16,7 +16,6 @@
 #define _MULTIARRAYMODULE
 #include <numpy/npy_common.h>
 #include <numpy/arrayobject.h>
-#include <numpy/halffloat.h>
 #include <npy_pycompat.h>
 
 #include <ctype.h>
@@ -25,1898 +24,8 @@
 #include "common.h"
 #include "ctors.h"
 
-#ifdef NPY_HAVE_SSE_INTRINSICS
-#define EINSUM_USE_SSE1 1
-#else
-#define EINSUM_USE_SSE1 0
-#endif
-
-#ifdef NPY_HAVE_SSE2_INTRINSICS
-#define EINSUM_USE_SSE2 1
-#else
-#define EINSUM_USE_SSE2 0
-#endif
-
-#if EINSUM_USE_SSE1
-#include <xmmintrin.h>
-#endif
-
-#if EINSUM_USE_SSE2
-#include <emmintrin.h>
-#endif
-
-#define EINSUM_IS_SSE_ALIGNED(x) ((((npy_intp)x)&0xf) == 0)
-
-/********** PRINTF DEBUG TRACING **************/
-#define NPY_EINSUM_DBG_TRACING 0
-
-#if NPY_EINSUM_DBG_TRACING
-#define NPY_EINSUM_DBG_PRINT(s) printf("%s", s);
-#define NPY_EINSUM_DBG_PRINT1(s, p1) printf(s, p1);
-#define NPY_EINSUM_DBG_PRINT2(s, p1, p2) printf(s, p1, p2);
-#define NPY_EINSUM_DBG_PRINT3(s, p1, p2, p3) printf(s);
-#else
-#define NPY_EINSUM_DBG_PRINT(s)
-#define NPY_EINSUM_DBG_PRINT1(s, p1)
-#define NPY_EINSUM_DBG_PRINT2(s, p1, p2)
-#define NPY_EINSUM_DBG_PRINT3(s, p1, p2, p3)
-#endif
-/**********************************************/
-
-/**begin repeat
- * #name = byte, short, int, long, longlong,
- *         ubyte, ushort, uint, ulong, ulonglong,
- *         half, float, double, longdouble,
- *         cfloat, cdouble, clongdouble#
- * #type = npy_byte, npy_short, npy_int, npy_long, npy_longlong,
- *         npy_ubyte, npy_ushort, npy_uint, npy_ulong, npy_ulonglong,
- *         npy_half, npy_float, npy_double, npy_longdouble,
- *         npy_cfloat, npy_cdouble, npy_clongdouble#
- * #temptype = npy_byte, npy_short, npy_int, npy_long, npy_longlong,
- *             npy_ubyte, npy_ushort, npy_uint, npy_ulong, npy_ulonglong,
- *             npy_float, npy_float, npy_double, npy_longdouble,
- *             npy_float, npy_double, npy_longdouble#
- * #to = ,,,,,
- *       ,,,,,
- *       npy_float_to_half,,,,
- *       ,,#
- * #from = ,,,,,
- *         ,,,,,
- *         npy_half_to_float,,,,
- *         ,,#
- * #complex = 0*5,
- *            0*5,
- *            0*4,
- *            1*3#
- * #float32 = 0*5,
- *            0*5,
- *            0,1,0,0,
- *            0*3#
- * #float64 = 0*5,
- *            0*5,
- *            0,0,1,0,
- *            0*3#
- */
-
-/**begin repeat1
- * #nop = 1, 2, 3, 1000#
- * #noplabel = one, two, three, any#
- */
-static void
-@name@_sum_of_products_@noplabel@(int nop, char **dataptr,
-                                npy_intp const *strides, npy_intp count)
-{
-#if (@nop@ == 1) || (@nop@ <= 3 && !@complex@)
-    char *data0 = dataptr[0];
-    npy_intp stride0 = strides[0];
-#endif
-#if (@nop@ == 2 || @nop@ == 3) && !@complex@
-    char *data1 = dataptr[1];
-    npy_intp stride1 = strides[1];
-#endif
-#if (@nop@ == 3) && !@complex@
-    char *data2 = dataptr[2];
-    npy_intp stride2 = strides[2];
-#endif
-#if (@nop@ == 1) || (@nop@ <= 3 && !@complex@)
-    char *data_out = dataptr[@nop@];
-    npy_intp stride_out = strides[@nop@];
-#endif
-
-    NPY_EINSUM_DBG_PRINT1("@name@_sum_of_products_@noplabel@ (%d)\n", (int)count);
-
-    while (count--) {
-#if !@complex@
-#  if @nop@ == 1
-        *(@type@ *)data_out = @to@(@from@(*(@type@ *)data0) +
-                                         @from@(*(@type@ *)data_out));
-        data0 += stride0;
-        data_out += stride_out;
-#  elif @nop@ == 2
-        *(@type@ *)data_out = @to@(@from@(*(@type@ *)data0) *
-                                         @from@(*(@type@ *)data1) +
-                                         @from@(*(@type@ *)data_out));
-        data0 += stride0;
-        data1 += stride1;
-        data_out += stride_out;
-#  elif @nop@ == 3
-        *(@type@ *)data_out = @to@(@from@(*(@type@ *)data0) *
-                                         @from@(*(@type@ *)data1) *
-                                         @from@(*(@type@ *)data2) +
-                                         @from@(*(@type@ *)data_out));
-        data0 += stride0;
-        data1 += stride1;
-        data2 += stride2;
-        data_out += stride_out;
-#  else
-        @temptype@ temp = @from@(*(@type@ *)dataptr[0]);
-        int i;
-        for (i = 1; i < nop; ++i) {
-            temp *= @from@(*(@type@ *)dataptr[i]);
-        }
-        *(@type@ *)dataptr[nop] = @to@(temp +
-                                           @from@(*(@type@ *)dataptr[i]));
-        for (i = 0; i <= nop; ++i) {
-            dataptr[i] += strides[i];
-        }
-#  endif
-#else /* complex */
-#  if @nop@ == 1
-        ((@temptype@ *)data_out)[0] = ((@temptype@ *)data0)[0] +
-                                         ((@temptype@ *)data_out)[0];
-        ((@temptype@ *)data_out)[1] = ((@temptype@ *)data0)[1] +
-                                         ((@temptype@ *)data_out)[1];
-        data0 += stride0;
-        data_out += stride_out;
-#  else
-#    if @nop@ <= 3
-#define _SUMPROD_NOP @nop@
-#    else
-#define _SUMPROD_NOP nop
-#    endif
-        @temptype@ re, im, tmp;
-        int i;
-        re = ((@temptype@ *)dataptr[0])[0];
-        im = ((@temptype@ *)dataptr[0])[1];
-        for (i = 1; i < _SUMPROD_NOP; ++i) {
-            tmp = re * ((@temptype@ *)dataptr[i])[0] -
-                  im * ((@temptype@ *)dataptr[i])[1];
-            im = re * ((@temptype@ *)dataptr[i])[1] +
-                 im * ((@temptype@ *)dataptr[i])[0];
-            re = tmp;
-        }
-        ((@temptype@ *)dataptr[_SUMPROD_NOP])[0] = re +
-                                     ((@temptype@ *)dataptr[_SUMPROD_NOP])[0];
-        ((@temptype@ *)dataptr[_SUMPROD_NOP])[1] = im +
-                                     ((@temptype@ *)dataptr[_SUMPROD_NOP])[1];
-
-        for (i = 0; i <= _SUMPROD_NOP; ++i) {
-            dataptr[i] += strides[i];
-        }
-#undef _SUMPROD_NOP
-#  endif
-#endif
-    }
-}
-
-#if @nop@ == 1
-
-static void
-@name@_sum_of_products_contig_one(int nop, char **dataptr,
-                                npy_intp const *NPY_UNUSED(strides), npy_intp count)
-{
-    @type@ *data0 = (@type@ *)dataptr[0];
-    @type@ *data_out = (@type@ *)dataptr[1];
-
-    NPY_EINSUM_DBG_PRINT1("@name@_sum_of_products_contig_one (%d)\n",
-                                                            (int)count);
-
-/* This is placed before the main loop to make small counts faster */
-finish_after_unrolled_loop:
-    switch (count) {
-/**begin repeat2
- * #i = 6, 5, 4, 3, 2, 1, 0#
- */
-        case @i@+1:
-#if !@complex@
-            data_out[@i@] = @to@(@from@(data0[@i@]) +
-                                 @from@(data_out[@i@]));
-#else
-            ((@temptype@ *)data_out + 2*@i@)[0] =
-                                    ((@temptype@ *)data0 + 2*@i@)[0] +
-                                    ((@temptype@ *)data_out + 2*@i@)[0];
-            ((@temptype@ *)data_out + 2*@i@)[1] =
-                                    ((@temptype@ *)data0 + 2*@i@)[1] +
-                                    ((@temptype@ *)data_out + 2*@i@)[1];
-#endif
-/**end repeat2**/
-        case 0:
-            return;
-    }
-
-    /* Unroll the loop by 8 */
-    while (count >= 8) {
-        count -= 8;
-
-/**begin repeat2
- * #i = 0, 1, 2, 3, 4, 5, 6, 7#
- */
-#if !@complex@
-        data_out[@i@] = @to@(@from@(data0[@i@]) +
-                             @from@(data_out[@i@]));
-#else /* complex */
-        ((@temptype@ *)data_out + 2*@i@)[0] =
-                                ((@temptype@ *)data0 + 2*@i@)[0] +
-                                ((@temptype@ *)data_out + 2*@i@)[0];
-        ((@temptype@ *)data_out + 2*@i@)[1] =
-                                ((@temptype@ *)data0 + 2*@i@)[1] +
-                                ((@temptype@ *)data_out + 2*@i@)[1];
-#endif
-/**end repeat2**/
-        data0 += 8;
-        data_out += 8;
-    }
-
-    /* Finish off the loop */
-    goto finish_after_unrolled_loop;
-}
-
-#elif @nop@ == 2 && !@complex@
-
-static void
-@name@_sum_of_products_contig_two(int nop, char **dataptr,
-                                npy_intp const *NPY_UNUSED(strides), npy_intp count)
-{
-    @type@ *data0 = (@type@ *)dataptr[0];
-    @type@ *data1 = (@type@ *)dataptr[1];
-    @type@ *data_out = (@type@ *)dataptr[2];
-
-#if EINSUM_USE_SSE1 && @float32@
-    __m128 a, b;
-#elif EINSUM_USE_SSE2 && @float64@
-    __m128d a, b;
-#endif
-
-    NPY_EINSUM_DBG_PRINT1("@name@_sum_of_products_contig_two (%d)\n",
-                                                            (int)count);
-
-/* This is placed before the main loop to make small counts faster */
-finish_after_unrolled_loop:
-    switch (count) {
-/**begin repeat2
- * #i = 6, 5, 4, 3, 2, 1, 0#
- */
-        case @i@+1:
-            data_out[@i@] = @to@(@from@(data0[@i@]) *
-                                 @from@(data1[@i@]) +
-                                 @from@(data_out[@i@]));
-/**end repeat2**/
-        case 0:
-            return;
-    }
-
-#if EINSUM_USE_SSE1 && @float32@
-    /* Use aligned instructions if possible */
-    if (EINSUM_IS_SSE_ALIGNED(data0) && EINSUM_IS_SSE_ALIGNED(data1) &&
-        EINSUM_IS_SSE_ALIGNED(data_out)) {
-        /* Unroll the loop by 8 */
-        while (count >= 8) {
-            count -= 8;
-
-/**begin repeat2
- * #i = 0, 4#
- */
-            a = _mm_mul_ps(_mm_load_ps(data0+@i@), _mm_load_ps(data1+@i@));
-            b = _mm_add_ps(a, _mm_load_ps(data_out+@i@));
-            _mm_store_ps(data_out+@i@, b);
-/**end repeat2**/
-            data0 += 8;
-            data1 += 8;
-            data_out += 8;
-        }
-
-        /* Finish off the loop */
-        goto finish_after_unrolled_loop;
-    }
-#elif EINSUM_USE_SSE2 && @float64@
-    /* Use aligned instructions if possible */
-    if (EINSUM_IS_SSE_ALIGNED(data0) && EINSUM_IS_SSE_ALIGNED(data1) &&
-        EINSUM_IS_SSE_ALIGNED(data_out)) {
-        /* Unroll the loop by 8 */
-        while (count >= 8) {
-            count -= 8;
-
-/**begin repeat2
- * #i = 0, 2, 4, 6#
- */
-            a = _mm_mul_pd(_mm_load_pd(data0+@i@), _mm_load_pd(data1+@i@));
-            b = _mm_add_pd(a, _mm_load_pd(data_out+@i@));
-            _mm_store_pd(data_out+@i@, b);
-/**end repeat2**/
-            data0 += 8;
-            data1 += 8;
-            data_out += 8;
-        }
-
-        /* Finish off the loop */
-        goto finish_after_unrolled_loop;
-    }
-#endif
-
-    /* Unroll the loop by 8 */
-    while (count >= 8) {
-        count -= 8;
-
-#if EINSUM_USE_SSE1 && @float32@
-/**begin repeat2
- * #i = 0, 4#
- */
-        a = _mm_mul_ps(_mm_loadu_ps(data0+@i@), _mm_loadu_ps(data1+@i@));
-        b = _mm_add_ps(a, _mm_loadu_ps(data_out+@i@));
-        _mm_storeu_ps(data_out+@i@, b);
-/**end repeat2**/
-#elif EINSUM_USE_SSE2 && @float64@
-/**begin repeat2
- * #i = 0, 2, 4, 6#
- */
-        a = _mm_mul_pd(_mm_loadu_pd(data0+@i@), _mm_loadu_pd(data1+@i@));
-        b = _mm_add_pd(a, _mm_loadu_pd(data_out+@i@));
-        _mm_storeu_pd(data_out+@i@, b);
-/**end repeat2**/
-#else
-/**begin repeat2
- * #i = 0, 1, 2, 3, 4, 5, 6, 7#
- */
-        data_out[@i@] = @to@(@from@(data0[@i@]) *
-                             @from@(data1[@i@]) +
-                             @from@(data_out[@i@]));
-/**end repeat2**/
-#endif
-        data0 += 8;
-        data1 += 8;
-        data_out += 8;
-    }
-
-    /* Finish off the loop */
-    goto finish_after_unrolled_loop;
-}
-
-/* Some extra specializations for the two operand case */
-static void
-@name@_sum_of_products_stride0_contig_outcontig_two(int nop, char **dataptr,
-                                npy_intp const *NPY_UNUSED(strides), npy_intp count)
-{
-    @temptype@ value0 = @from@(*(@type@ *)dataptr[0]);
-    @type@ *data1 = (@type@ *)dataptr[1];
-    @type@ *data_out = (@type@ *)dataptr[2];
-
-#if EINSUM_USE_SSE1 && @float32@
-    __m128 a, b, value0_sse;
-#elif EINSUM_USE_SSE2 && @float64@
-    __m128d a, b, value0_sse;
-#endif
-
-    NPY_EINSUM_DBG_PRINT1("@name@_sum_of_products_stride0_contig_outcontig_two (%d)\n",
-                                                    (int)count);
-
-/* This is placed before the main loop to make small counts faster */
-finish_after_unrolled_loop:
-    switch (count) {
-/**begin repeat2
- * #i = 6, 5, 4, 3, 2, 1, 0#
- */
-        case @i@+1:
-            data_out[@i@] = @to@(value0 *
-                                 @from@(data1[@i@]) +
-                                 @from@(data_out[@i@]));
-/**end repeat2**/
-        case 0:
-            return;
-    }
-
-#if EINSUM_USE_SSE1 && @float32@
-    value0_sse = _mm_set_ps1(value0);
-
-    /* Use aligned instructions if possible */
-    if (EINSUM_IS_SSE_ALIGNED(data1) && EINSUM_IS_SSE_ALIGNED(data_out)) {
-        /* Unroll the loop by 8 */
-        while (count >= 8) {
-            count -= 8;
-
-/**begin repeat2
- * #i = 0, 4#
- */
-            a = _mm_mul_ps(value0_sse, _mm_load_ps(data1+@i@));
-            b = _mm_add_ps(a, _mm_load_ps(data_out+@i@));
-            _mm_store_ps(data_out+@i@, b);
-/**end repeat2**/
-            data1 += 8;
-            data_out += 8;
-        }
-
-        /* Finish off the loop */
-        if (count > 0) {
-            goto finish_after_unrolled_loop;
-        }
-        else {
-            return;
-        }
-    }
-#elif EINSUM_USE_SSE2 && @float64@
-    value0_sse = _mm_set1_pd(value0);
-
-    /* Use aligned instructions if possible */
-    if (EINSUM_IS_SSE_ALIGNED(data1) && EINSUM_IS_SSE_ALIGNED(data_out)) {
-        /* Unroll the loop by 8 */
-        while (count >= 8) {
-            count -= 8;
-
-/**begin repeat2
- * #i = 0, 2, 4, 6#
- */
-            a = _mm_mul_pd(value0_sse, _mm_load_pd(data1+@i@));
-            b = _mm_add_pd(a, _mm_load_pd(data_out+@i@));
-            _mm_store_pd(data_out+@i@, b);
-/**end repeat2**/
-            data1 += 8;
-            data_out += 8;
-        }
-
-        /* Finish off the loop */
-        if (count > 0) {
-            goto finish_after_unrolled_loop;
-        }
-        else {
-            return;
-        }
-    }
-#endif
-
-    /* Unroll the loop by 8 */
-    while (count >= 8) {
-        count -= 8;
-
-#if EINSUM_USE_SSE1 && @float32@
-/**begin repeat2
- * #i = 0, 4#
- */
-        a = _mm_mul_ps(value0_sse, _mm_loadu_ps(data1+@i@));
-        b = _mm_add_ps(a, _mm_loadu_ps(data_out+@i@));
-        _mm_storeu_ps(data_out+@i@, b);
-/**end repeat2**/
-#elif EINSUM_USE_SSE2 && @float64@
-/**begin repeat2
- * #i = 0, 2, 4, 6#
- */
-        a = _mm_mul_pd(value0_sse, _mm_loadu_pd(data1+@i@));
-        b = _mm_add_pd(a, _mm_loadu_pd(data_out+@i@));
-        _mm_storeu_pd(data_out+@i@, b);
-/**end repeat2**/
-#else
-/**begin repeat2
- * #i = 0, 1, 2, 3, 4, 5, 6, 7#
- */
-        data_out[@i@] = @to@(value0 *
-                             @from@(data1[@i@]) +
-                             @from@(data_out[@i@]));
-/**end repeat2**/
-#endif
-        data1 += 8;
-        data_out += 8;
-    }
-
-    /* Finish off the loop */
-    if (count > 0) {
-        goto finish_after_unrolled_loop;
-    }
-}
-
-static void
-@name@_sum_of_products_contig_stride0_outcontig_two(int nop, char **dataptr,
-                                npy_intp const *NPY_UNUSED(strides), npy_intp count)
-{
-    @type@ *data0 = (@type@ *)dataptr[0];
-    @temptype@ value1 = @from@(*(@type@ *)dataptr[1]);
-    @type@ *data_out = (@type@ *)dataptr[2];
-
-#if EINSUM_USE_SSE1 && @float32@
-    __m128 a, b, value1_sse;
-#elif EINSUM_USE_SSE2 && @float64@
-    __m128d a, b, value1_sse;
-#endif
-
-    NPY_EINSUM_DBG_PRINT1("@name@_sum_of_products_contig_stride0_outcontig_two (%d)\n",
-                                                    (int)count);
-
-/* This is placed before the main loop to make small counts faster */
-finish_after_unrolled_loop:
-    switch (count) {
-/**begin repeat2
- * #i = 6, 5, 4, 3, 2, 1, 0#
- */
-        case @i@+1:
-            data_out[@i@] = @to@(@from@(data0[@i@])*
-                                 value1  +
-                                 @from@(data_out[@i@]));
-/**end repeat2**/
-        case 0:
-            return;
-    }
-
-#if EINSUM_USE_SSE1 && @float32@
-    value1_sse = _mm_set_ps1(value1);
-
-    /* Use aligned instructions if possible */
-    if (EINSUM_IS_SSE_ALIGNED(data0) && EINSUM_IS_SSE_ALIGNED(data_out)) {
-        /* Unroll the loop by 8 */
-        while (count >= 8) {
-            count -= 8;
-
-/**begin repeat2
- * #i = 0, 4#
- */
-            a = _mm_mul_ps(_mm_load_ps(data0+@i@), value1_sse);
-            b = _mm_add_ps(a, _mm_load_ps(data_out+@i@));
-            _mm_store_ps(data_out+@i@, b);
-/**end repeat2**/
-            data0 += 8;
-            data_out += 8;
-        }
-
-        /* Finish off the loop */
-        goto finish_after_unrolled_loop;
-    }
-#elif EINSUM_USE_SSE2 && @float64@
-    value1_sse = _mm_set1_pd(value1);
-
-    /* Use aligned instructions if possible */
-    if (EINSUM_IS_SSE_ALIGNED(data0) && EINSUM_IS_SSE_ALIGNED(data_out)) {
-        /* Unroll the loop by 8 */
-        while (count >= 8) {
-            count -= 8;
-
-/**begin repeat2
- * #i = 0, 2, 4, 6#
- */
-            a = _mm_mul_pd(_mm_load_pd(data0+@i@), value1_sse);
-            b = _mm_add_pd(a, _mm_load_pd(data_out+@i@));
-            _mm_store_pd(data_out+@i@, b);
-/**end repeat2**/
-            data0 += 8;
-            data_out += 8;
-        }
-
-        /* Finish off the loop */
-        goto finish_after_unrolled_loop;
-    }
-#endif
-
-    /* Unroll the loop by 8 */
-    while (count >= 8) {
-        count -= 8;
-
-#if EINSUM_USE_SSE1 && @float32@
-/**begin repeat2
- * #i = 0, 4#
- */
-        a = _mm_mul_ps(_mm_loadu_ps(data0+@i@), value1_sse);
-        b = _mm_add_ps(a, _mm_loadu_ps(data_out+@i@));
-        _mm_storeu_ps(data_out+@i@, b);
-/**end repeat2**/
-#elif EINSUM_USE_SSE2 && @float64@
-/**begin repeat2
- * #i = 0, 2, 4, 6#
- */
-        a = _mm_mul_pd(_mm_loadu_pd(data0+@i@), value1_sse);
-        b = _mm_add_pd(a, _mm_loadu_pd(data_out+@i@));
-        _mm_storeu_pd(data_out+@i@, b);
-/**end repeat2**/
-#else
-/**begin repeat2
- * #i = 0, 1, 2, 3, 4, 5, 6, 7#
- */
-        data_out[@i@] = @to@(@from@(data0[@i@])*
-                             value1  +
-                             @from@(data_out[@i@]));
-/**end repeat2**/
-#endif
-        data0 += 8;
-        data_out += 8;
-    }
-
-    /* Finish off the loop */
-    goto finish_after_unrolled_loop;
-}
-
-static void
-@name@_sum_of_products_contig_contig_outstride0_two(int nop, char **dataptr,
-                                npy_intp const *NPY_UNUSED(strides), npy_intp count)
-{
-    @type@ *data0 = (@type@ *)dataptr[0];
-    @type@ *data1 = (@type@ *)dataptr[1];
-    @temptype@ accum = 0;
-
-#if EINSUM_USE_SSE1 && @float32@
-    __m128 a, accum_sse = _mm_setzero_ps();
-#elif EINSUM_USE_SSE2 && @float64@
-    __m128d a, accum_sse = _mm_setzero_pd();
-#endif
-
-    NPY_EINSUM_DBG_PRINT1("@name@_sum_of_products_contig_contig_outstride0_two (%d)\n",
-                                                    (int)count);
-
-/* This is placed before the main loop to make small counts faster */
-finish_after_unrolled_loop:
-    switch (count) {
-/**begin repeat2
- * #i = 6, 5, 4, 3, 2, 1, 0#
- */
-        case @i@+1:
-            accum += @from@(data0[@i@]) * @from@(data1[@i@]);
-/**end repeat2**/
-        case 0:
-            *(@type@ *)dataptr[2] = @to@(@from@(*(@type@ *)dataptr[2]) + accum);
-            return;
-    }
-
-#if EINSUM_USE_SSE1 && @float32@
-    /* Use aligned instructions if possible */
-    if (EINSUM_IS_SSE_ALIGNED(data0) && EINSUM_IS_SSE_ALIGNED(data1)) {
-        /* Unroll the loop by 8 */
-        while (count >= 8) {
-            count -= 8;
-
-            _mm_prefetch(data0 + 512, _MM_HINT_T0);
-            _mm_prefetch(data1 + 512, _MM_HINT_T0);
-
-/**begin repeat2
- * #i = 0, 4#
- */
-            /*
-             * NOTE: This accumulation changes the order, so will likely
-             *       produce slightly different results.
-             */
-            a = _mm_mul_ps(_mm_load_ps(data0+@i@), _mm_load_ps(data1+@i@));
-            accum_sse = _mm_add_ps(accum_sse, a);
-/**end repeat2**/
-            data0 += 8;
-            data1 += 8;
-        }
-
-        /* Add the four SSE values and put in accum */
-        a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(2,3,0,1));
-        accum_sse = _mm_add_ps(a, accum_sse);
-        a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(1,0,3,2));
-        accum_sse = _mm_add_ps(a, accum_sse);
-        _mm_store_ss(&accum, accum_sse);
-
-        /* Finish off the loop */
-        goto finish_after_unrolled_loop;
-    }
-#elif EINSUM_USE_SSE2 && @float64@
-    /* Use aligned instructions if possible */
-    if (EINSUM_IS_SSE_ALIGNED(data0) && EINSUM_IS_SSE_ALIGNED(data1)) {
-        /* Unroll the loop by 8 */
-        while (count >= 8) {
-            count -= 8;
-
-            _mm_prefetch(data0 + 512, _MM_HINT_T0);
-            _mm_prefetch(data1 + 512, _MM_HINT_T0);
-
-/**begin repeat2
- * #i = 0, 2, 4, 6#
- */
-            /*
-             * NOTE: This accumulation changes the order, so will likely
-             *       produce slightly different results.
-             */
-            a = _mm_mul_pd(_mm_load_pd(data0+@i@), _mm_load_pd(data1+@i@));
-            accum_sse = _mm_add_pd(accum_sse, a);
-/**end repeat2**/
-            data0 += 8;
-            data1 += 8;
-        }
-
-        /* Add the two SSE2 values and put in accum */
-        a = _mm_shuffle_pd(accum_sse, accum_sse, _MM_SHUFFLE2(0,1));
-        accum_sse = _mm_add_pd(a, accum_sse);
-        _mm_store_sd(&accum, accum_sse);
-
-        /* Finish off the loop */
-        goto finish_after_unrolled_loop;
-    }
-#endif
-
-    /* Unroll the loop by 8 */
-    while (count >= 8) {
-        count -= 8;
-
-#if EINSUM_USE_SSE1 && @float32@
-        _mm_prefetch(data0 + 512, _MM_HINT_T0);
-        _mm_prefetch(data1 + 512, _MM_HINT_T0);
-
-/**begin repeat2
- * #i = 0, 4#
- */
-        /*
-         * NOTE: This accumulation changes the order, so will likely
-         *       produce slightly different results.
-         */
-        a = _mm_mul_ps(_mm_loadu_ps(data0+@i@), _mm_loadu_ps(data1+@i@));
-        accum_sse = _mm_add_ps(accum_sse, a);
-/**end repeat2**/
-#elif EINSUM_USE_SSE2 && @float64@
-        _mm_prefetch(data0 + 512, _MM_HINT_T0);
-        _mm_prefetch(data1 + 512, _MM_HINT_T0);
-
-/**begin repeat2
- * #i = 0, 2, 4, 6#
- */
-        /*
-         * NOTE: This accumulation changes the order, so will likely
-         *       produce slightly different results.
-         */
-        a = _mm_mul_pd(_mm_loadu_pd(data0+@i@), _mm_loadu_pd(data1+@i@));
-        accum_sse = _mm_add_pd(accum_sse, a);
-/**end repeat2**/
-#else
-/**begin repeat2
- * #i = 0, 1, 2, 3, 4, 5, 6, 7#
- */
-        accum += @from@(data0[@i@]) * @from@(data1[@i@]);
-/**end repeat2**/
-#endif
-        data0 += 8;
-        data1 += 8;
-    }
-
-#if EINSUM_USE_SSE1 && @float32@
-    /* Add the four SSE values and put in accum */
-    a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(2,3,0,1));
-    accum_sse = _mm_add_ps(a, accum_sse);
-    a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(1,0,3,2));
-    accum_sse = _mm_add_ps(a, accum_sse);
-    _mm_store_ss(&accum, accum_sse);
-#elif EINSUM_USE_SSE2 && @float64@
-    /* Add the two SSE2 values and put in accum */
-    a = _mm_shuffle_pd(accum_sse, accum_sse, _MM_SHUFFLE2(0,1));
-    accum_sse = _mm_add_pd(a, accum_sse);
-    _mm_store_sd(&accum, accum_sse);
-#endif
-
-    /* Finish off the loop */
-    goto finish_after_unrolled_loop;
-}
-
-static void
-@name@_sum_of_products_stride0_contig_outstride0_two(int nop, char **dataptr,
-                                npy_intp const *NPY_UNUSED(strides), npy_intp count)
-{
-    @temptype@ value0 = @from@(*(@type@ *)dataptr[0]);
-    @type@ *data1 = (@type@ *)dataptr[1];
-    @temptype@ accum = 0;
-
-#if EINSUM_USE_SSE1 && @float32@
-    __m128 a, accum_sse = _mm_setzero_ps();
-#elif EINSUM_USE_SSE2 && @float64@
-    __m128d a, accum_sse = _mm_setzero_pd();
-#endif
-
-    NPY_EINSUM_DBG_PRINT1("@name@_sum_of_products_stride0_contig_outstride0_two (%d)\n",
-                                                    (int)count);
-
-/* This is placed before the main loop to make small counts faster */
-finish_after_unrolled_loop:
-    switch (count) {
-/**begin repeat2
- * #i = 6, 5, 4, 3, 2, 1, 0#
- */
-        case @i@+1:
-            accum += @from@(data1[@i@]);
-/**end repeat2**/
-        case 0:
-            *(@type@ *)dataptr[2] = @to@(@from@(*(@type@ *)dataptr[2]) + value0 * accum);
-            return;
-    }
-
-#if EINSUM_USE_SSE1 && @float32@
-    /* Use aligned instructions if possible */
-    if (EINSUM_IS_SSE_ALIGNED(data1)) {
-        /* Unroll the loop by 8 */
-        while (count >= 8) {
-            count -= 8;
-
-/**begin repeat2
- * #i = 0, 4#
- */
-            /*
-             * NOTE: This accumulation changes the order, so will likely
-             *       produce slightly different results.
-             */
-            accum_sse = _mm_add_ps(accum_sse, _mm_load_ps(data1+@i@));
-/**end repeat2**/
-            data1 += 8;
-        }
-        /* Add the four SSE values and put in accum */
-        a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(2,3,0,1));
-        accum_sse = _mm_add_ps(a, accum_sse);
-        a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(1,0,3,2));
-        accum_sse = _mm_add_ps(a, accum_sse);
-        _mm_store_ss(&accum, accum_sse);
-
-        /* Finish off the loop */
-        goto finish_after_unrolled_loop;
-    }
-#elif EINSUM_USE_SSE2 && @float64@
-    /* Use aligned instructions if possible */
-    if (EINSUM_IS_SSE_ALIGNED(data1)) {
-        /* Unroll the loop by 8 */
-        while (count >= 8) {
-            count -= 8;
-
-/**begin repeat2
- * #i = 0, 2, 4, 6#
- */
-            /*
-             * NOTE: This accumulation changes the order, so will likely
-             *       produce slightly different results.
-             */
-            accum_sse = _mm_add_pd(accum_sse, _mm_load_pd(data1+@i@));
-/**end repeat2**/
-            data1 += 8;
-        }
-        /* Add the two SSE2 values and put in accum */
-        a = _mm_shuffle_pd(accum_sse, accum_sse, _MM_SHUFFLE2(0,1));
-        accum_sse = _mm_add_pd(a, accum_sse);
-        _mm_store_sd(&accum, accum_sse);
-
-        /* Finish off the loop */
-        goto finish_after_unrolled_loop;
-    }
-#endif
-
-    /* Unroll the loop by 8 */
-    while (count >= 8) {
-        count -= 8;
-
-#if EINSUM_USE_SSE1 && @float32@
-/**begin repeat2
- * #i = 0, 4#
- */
-        /*
-         * NOTE: This accumulation changes the order, so will likely
-         *       produce slightly different results.
-         */
-        accum_sse = _mm_add_ps(accum_sse, _mm_loadu_ps(data1+@i@));
-/**end repeat2**/
-#elif EINSUM_USE_SSE2 && @float64@
-/**begin repeat2
- * #i = 0, 2, 4, 6#
- */
-        /*
-         * NOTE: This accumulation changes the order, so will likely
-         *       produce slightly different results.
-         */
-        accum_sse = _mm_add_pd(accum_sse, _mm_loadu_pd(data1+@i@));
-/**end repeat2**/
-#else
-/**begin repeat2
- * #i = 0, 1, 2, 3, 4, 5, 6, 7#
- */
-        accum += @from@(data1[@i@]);
-/**end repeat2**/
-#endif
-        data1 += 8;
-    }
-
-#if EINSUM_USE_SSE1 && @float32@
-    /* Add the four SSE values and put in accum */
-    a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(2,3,0,1));
-    accum_sse = _mm_add_ps(a, accum_sse);
-    a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(1,0,3,2));
-    accum_sse = _mm_add_ps(a, accum_sse);
-    _mm_store_ss(&accum, accum_sse);
-#elif EINSUM_USE_SSE2 && @float64@
-    /* Add the two SSE2 values and put in accum */
-    a = _mm_shuffle_pd(accum_sse, accum_sse, _MM_SHUFFLE2(0,1));
-    accum_sse = _mm_add_pd(a, accum_sse);
-    _mm_store_sd(&accum, accum_sse);
-#endif
-
-    /* Finish off the loop */
-    goto finish_after_unrolled_loop;
-}
-
-static void
-@name@_sum_of_products_contig_stride0_outstride0_two(int nop, char **dataptr,
-                                npy_intp const *NPY_UNUSED(strides), npy_intp count)
-{
-    @type@ *data0 = (@type@ *)dataptr[0];
-    @temptype@ value1 = @from@(*(@type@ *)dataptr[1]);
-    @temptype@ accum = 0;
-
-#if EINSUM_USE_SSE1 && @float32@
-    __m128 a, accum_sse = _mm_setzero_ps();
-#elif EINSUM_USE_SSE2 && @float64@
-    __m128d a, accum_sse = _mm_setzero_pd();
-#endif
-
-    NPY_EINSUM_DBG_PRINT1("@name@_sum_of_products_contig_stride0_outstride0_two (%d)\n",
-                                                    (int)count);
-
-/* This is placed before the main loop to make small counts faster */
-finish_after_unrolled_loop:
-    switch (count) {
-/**begin repeat2
- * #i = 6, 5, 4, 3, 2, 1, 0#
- */
-        case @i@+1:
-            accum += @from@(data0[@i@]);
-/**end repeat2**/
-        case 0:
-            *(@type@ *)dataptr[2] = @to@(@from@(*(@type@ *)dataptr[2]) + accum * value1);
-            return;
-    }
-
-#if EINSUM_USE_SSE1 && @float32@
-    /* Use aligned instructions if possible */
-    if (EINSUM_IS_SSE_ALIGNED(data0)) {
-        /* Unroll the loop by 8 */
-        while (count >= 8) {
-            count -= 8;
-
-/**begin repeat2
- * #i = 0, 4#
- */
-            /*
-             * NOTE: This accumulation changes the order, so will likely
-             *       produce slightly different results.
-             */
-            accum_sse = _mm_add_ps(accum_sse, _mm_load_ps(data0+@i@));
-/**end repeat2**/
-            data0 += 8;
-        }
-        /* Add the four SSE values and put in accum */
-        a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(2,3,0,1));
-        accum_sse = _mm_add_ps(a, accum_sse);
-        a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(1,0,3,2));
-        accum_sse = _mm_add_ps(a, accum_sse);
-        _mm_store_ss(&accum, accum_sse);
-        /* Finish off the loop */
-        goto finish_after_unrolled_loop;
-    }
-#elif EINSUM_USE_SSE2 && @float64@
-    /* Use aligned instructions if possible */
-    if (EINSUM_IS_SSE_ALIGNED(data0)) {
-        /* Unroll the loop by 8 */
-        while (count >= 8) {
-            count -= 8;
-
-/**begin repeat2
- * #i = 0, 2, 4, 6#
- */
-            /*
-             * NOTE: This accumulation changes the order, so will likely
-             *       produce slightly different results.
-             */
-            accum_sse = _mm_add_pd(accum_sse, _mm_load_pd(data0+@i@));
-/**end repeat2**/
-            data0 += 8;
-        }
-        /* Add the two SSE2 values and put in accum */
-        a = _mm_shuffle_pd(accum_sse, accum_sse, _MM_SHUFFLE2(0,1));
-        accum_sse = _mm_add_pd(a, accum_sse);
-        _mm_store_sd(&accum, accum_sse);
-        /* Finish off the loop */
-        goto finish_after_unrolled_loop;
-    }
-#endif
-
-    /* Unroll the loop by 8 */
-    while (count >= 8) {
-        count -= 8;
-
-#if EINSUM_USE_SSE1 && @float32@
-/**begin repeat2
- * #i = 0, 4#
- */
-        /*
-         * NOTE: This accumulation changes the order, so will likely
-         *       produce slightly different results.
-         */
-        accum_sse = _mm_add_ps(accum_sse, _mm_loadu_ps(data0+@i@));
-/**end repeat2**/
-#elif EINSUM_USE_SSE2 && @float64@
-/**begin repeat2
- * #i = 0, 2, 4, 6#
- */
-        /*
-         * NOTE: This accumulation changes the order, so will likely
-         *       produce slightly different results.
-         */
-        accum_sse = _mm_add_pd(accum_sse, _mm_loadu_pd(data0+@i@));
-/**end repeat2**/
-#else
-/**begin repeat2
- * #i = 0, 1, 2, 3, 4, 5, 6, 7#
- */
-        accum += @from@(data0[@i@]);
-/**end repeat2**/
-#endif
-        data0 += 8;
-    }
-
-#if EINSUM_USE_SSE1 && @float32@
-    /* Add the four SSE values and put in accum */
-    a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(2,3,0,1));
-    accum_sse = _mm_add_ps(a, accum_sse);
-    a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(1,0,3,2));
-    accum_sse = _mm_add_ps(a, accum_sse);
-    _mm_store_ss(&accum, accum_sse);
-#elif EINSUM_USE_SSE2 && @float64@
-    /* Add the two SSE2 values and put in accum */
-    a = _mm_shuffle_pd(accum_sse, accum_sse, _MM_SHUFFLE2(0,1));
-    accum_sse = _mm_add_pd(a, accum_sse);
-    _mm_store_sd(&accum, accum_sse);
-#endif
-
-    /* Finish off the loop */
-    goto finish_after_unrolled_loop;
-}
-
-#elif @nop@ == 3 && !@complex@
-
-static void
-@name@_sum_of_products_contig_three(int nop, char **dataptr,
-                                npy_intp const *NPY_UNUSED(strides), npy_intp count)
-{
-    @type@ *data0 = (@type@ *)dataptr[0];
-    @type@ *data1 = (@type@ *)dataptr[1];
-    @type@ *data2 = (@type@ *)dataptr[2];
-    @type@ *data_out = (@type@ *)dataptr[3];
-
-    /* Unroll the loop by 8 */
-    while (count >= 8) {
-        count -= 8;
-
-/**begin repeat2
- * #i = 0, 1, 2, 3, 4, 5, 6, 7#
- */
-        data_out[@i@] = @to@(@from@(data0[@i@]) *
-                             @from@(data1[@i@]) *
-                             @from@(data2[@i@]) +
-                             @from@(data_out[@i@]));
-/**end repeat2**/
-        data0 += 8;
-        data1 += 8;
-        data2 += 8;
-        data_out += 8;
-    }
-
-    /* Finish off the loop */
-
-/**begin repeat2
- * #i = 0, 1, 2, 3, 4, 5, 6, 7#
- */
-    if (count-- == 0) {
-        return;
-    }
-    data_out[@i@] = @to@(@from@(data0[@i@]) *
-                         @from@(data1[@i@]) *
-                         @from@(data2[@i@]) +
-                         @from@(data_out[@i@]));
-/**end repeat2**/
-}
-
-#else /* @nop@ > 3 || @complex */
-
-static void
-@name@_sum_of_products_contig_@noplabel@(int nop, char **dataptr,
-                                npy_intp const *NPY_UNUSED(strides), npy_intp count)
-{
-    NPY_EINSUM_DBG_PRINT1("@name@_sum_of_products_contig_@noplabel@ (%d)\n",
-                                                    (int)count);
-
-    while (count--) {
-#if !@complex@
-        @temptype@ temp = @from@(*(@type@ *)dataptr[0]);
-        int i;
-        for (i = 1; i < nop; ++i) {
-            temp *= @from@(*(@type@ *)dataptr[i]);
-        }
-        *(@type@ *)dataptr[nop] = @to@(temp +
-                                           @from@(*(@type@ *)dataptr[i]));
-        for (i = 0; i <= nop; ++i) {
-            dataptr[i] += sizeof(@type@);
-        }
-#else /* complex */
-#  if @nop@ <= 3
-#    define _SUMPROD_NOP @nop@
-#  else
-#    define _SUMPROD_NOP nop
-#  endif
-        @temptype@ re, im, tmp;
-        int i;
-        re = ((@temptype@ *)dataptr[0])[0];
-        im = ((@temptype@ *)dataptr[0])[1];
-        for (i = 1; i < _SUMPROD_NOP; ++i) {
-            tmp = re * ((@temptype@ *)dataptr[i])[0] -
-                  im * ((@temptype@ *)dataptr[i])[1];
-            im = re * ((@temptype@ *)dataptr[i])[1] +
-                 im * ((@temptype@ *)dataptr[i])[0];
-            re = tmp;
-        }
-        ((@temptype@ *)dataptr[_SUMPROD_NOP])[0] = re +
-                                     ((@temptype@ *)dataptr[_SUMPROD_NOP])[0];
-        ((@temptype@ *)dataptr[_SUMPROD_NOP])[1] = im +
-                                     ((@temptype@ *)dataptr[_SUMPROD_NOP])[1];
-
-        for (i = 0; i <= _SUMPROD_NOP; ++i) {
-            dataptr[i] += sizeof(@type@);
-        }
-#  undef _SUMPROD_NOP
-#endif
-    }
-}
-
-#endif /* functions for various @nop@ */
-
-#if @nop@ == 1
-
-static void
-@name@_sum_of_products_contig_outstride0_one(int nop, char **dataptr,
-                                npy_intp const *strides, npy_intp count)
-{
-#if @complex@
-    @temptype@ accum_re = 0, accum_im = 0;
-    @temptype@ *data0 = (@temptype@ *)dataptr[0];
-#else
-    @temptype@ accum = 0;
-    @type@ *data0 = (@type@ *)dataptr[0];
-#endif
-
-#if EINSUM_USE_SSE1 && @float32@
-    __m128 a, accum_sse = _mm_setzero_ps();
-#elif EINSUM_USE_SSE2 && @float64@
-    __m128d a, accum_sse = _mm_setzero_pd();
-#endif
-
-
-    NPY_EINSUM_DBG_PRINT1("@name@_sum_of_products_contig_outstride0_one (%d)\n",
-                                                    (int)count);
-
-/* This is placed before the main loop to make small counts faster */
-finish_after_unrolled_loop:
-    switch (count) {
-/**begin repeat2
- * #i = 6, 5, 4, 3, 2, 1, 0#
- */
-        case @i@+1:
-#if !@complex@
-            accum += @from@(data0[@i@]);
-#else /* complex */
-            accum_re += data0[2*@i@+0];
-            accum_im += data0[2*@i@+1];
-#endif
-/**end repeat2**/
-        case 0:
-#if @complex@
-            ((@temptype@ *)dataptr[1])[0] += accum_re;
-            ((@temptype@ *)dataptr[1])[1] += accum_im;
-#else
-            *((@type@ *)dataptr[1]) = @to@(accum +
-                                    @from@(*((@type@ *)dataptr[1])));
-#endif
-            return;
-    }
-
-#if EINSUM_USE_SSE1 && @float32@
-    /* Use aligned instructions if possible */
-    if (EINSUM_IS_SSE_ALIGNED(data0)) {
-        /* Unroll the loop by 8 */
-        while (count >= 8) {
-            count -= 8;
-
-            _mm_prefetch(data0 + 512, _MM_HINT_T0);
-
-/**begin repeat2
- * #i = 0, 4#
- */
-            /*
-             * NOTE: This accumulation changes the order, so will likely
-             *       produce slightly different results.
-             */
-            accum_sse = _mm_add_ps(accum_sse, _mm_load_ps(data0+@i@));
-/**end repeat2**/
-            data0 += 8;
-        }
-
-        /* Add the four SSE values and put in accum */
-        a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(2,3,0,1));
-        accum_sse = _mm_add_ps(a, accum_sse);
-        a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(1,0,3,2));
-        accum_sse = _mm_add_ps(a, accum_sse);
-        _mm_store_ss(&accum, accum_sse);
-
-        /* Finish off the loop */
-        goto finish_after_unrolled_loop;
-    }
-#elif EINSUM_USE_SSE2 && @float64@
-    /* Use aligned instructions if possible */
-    if (EINSUM_IS_SSE_ALIGNED(data0)) {
-        /* Unroll the loop by 8 */
-        while (count >= 8) {
-            count -= 8;
-
-            _mm_prefetch(data0 + 512, _MM_HINT_T0);
-
-/**begin repeat2
- * #i = 0, 2, 4, 6#
- */
-            /*
-             * NOTE: This accumulation changes the order, so will likely
-             *       produce slightly different results.
-             */
-            accum_sse = _mm_add_pd(accum_sse, _mm_load_pd(data0+@i@));
-/**end repeat2**/
-            data0 += 8;
-        }
-
-        /* Add the two SSE2 values and put in accum */
-        a = _mm_shuffle_pd(accum_sse, accum_sse, _MM_SHUFFLE2(0,1));
-        accum_sse = _mm_add_pd(a, accum_sse);
-        _mm_store_sd(&accum, accum_sse);
-
-        /* Finish off the loop */
-        goto finish_after_unrolled_loop;
-    }
-#endif
-
-    /* Unroll the loop by 8 */
-    while (count >= 8) {
-        count -= 8;
-
-#if EINSUM_USE_SSE1 && @float32@
-        _mm_prefetch(data0 + 512, _MM_HINT_T0);
-
-/**begin repeat2
- * #i = 0, 4#
- */
-        /*
-         * NOTE: This accumulation changes the order, so will likely
-         *       produce slightly different results.
-         */
-        accum_sse = _mm_add_ps(accum_sse, _mm_loadu_ps(data0+@i@));
-/**end repeat2**/
-#elif EINSUM_USE_SSE2 && @float64@
-        _mm_prefetch(data0 + 512, _MM_HINT_T0);
-
-/**begin repeat2
- * #i = 0, 2, 4, 6#
- */
-        /*
-         * NOTE: This accumulation changes the order, so will likely
-         *       produce slightly different results.
-         */
-        accum_sse = _mm_add_pd(accum_sse, _mm_loadu_pd(data0+@i@));
-/**end repeat2**/
-#else
-/**begin repeat2
- * #i = 0, 1, 2, 3, 4, 5, 6, 7#
- */
-#  if !@complex@
-        accum += @from@(data0[@i@]);
-#  else /* complex */
-        accum_re += data0[2*@i@+0];
-        accum_im += data0[2*@i@+1];
-#  endif
-/**end repeat2**/
-#endif
-
-#if !@complex@
-        data0 += 8;
-#else
-        data0 += 8*2;
-#endif
-    }
-
-#if EINSUM_USE_SSE1 && @float32@
-    /* Add the four SSE values and put in accum */
-    a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(2,3,0,1));
-    accum_sse = _mm_add_ps(a, accum_sse);
-    a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(1,0,3,2));
-    accum_sse = _mm_add_ps(a, accum_sse);
-    _mm_store_ss(&accum, accum_sse);
-#elif EINSUM_USE_SSE2 && @float64@
-    /* Add the two SSE2 values and put in accum */
-    a = _mm_shuffle_pd(accum_sse, accum_sse, _MM_SHUFFLE2(0,1));
-    accum_sse = _mm_add_pd(a, accum_sse);
-    _mm_store_sd(&accum, accum_sse);
-#endif
-
-    /* Finish off the loop */
-    goto finish_after_unrolled_loop;
-}
-
-#endif /* @nop@ == 1 */
-
-static void
-@name@_sum_of_products_outstride0_@noplabel@(int nop, char **dataptr,
-                                npy_intp const *strides, npy_intp count)
-{
-#if @complex@
-    @temptype@ accum_re = 0, accum_im = 0;
-#else
-    @temptype@ accum = 0;
-#endif
-
-#if (@nop@ == 1) || (@nop@ <= 3 && !@complex@)
-    char *data0 = dataptr[0];
-    npy_intp stride0 = strides[0];
-#endif
-#if (@nop@ == 2 || @nop@ == 3) && !@complex@
-    char *data1 = dataptr[1];
-    npy_intp stride1 = strides[1];
-#endif
-#if (@nop@ == 3) && !@complex@
-    char *data2 = dataptr[2];
-    npy_intp stride2 = strides[2];
-#endif
-
-    NPY_EINSUM_DBG_PRINT1("@name@_sum_of_products_outstride0_@noplabel@ (%d)\n",
-                                                    (int)count);
-
-    while (count--) {
-#if !@complex@
-#  if @nop@ == 1
-        accum += @from@(*(@type@ *)data0);
-        data0 += stride0;
-#  elif @nop@ == 2
-        accum += @from@(*(@type@ *)data0) *
-                 @from@(*(@type@ *)data1);
-        data0 += stride0;
-        data1 += stride1;
-#  elif @nop@ == 3
-        accum += @from@(*(@type@ *)data0) *
-                 @from@(*(@type@ *)data1) *
-                 @from@(*(@type@ *)data2);
-        data0 += stride0;
-        data1 += stride1;
-        data2 += stride2;
-#  else
-        @temptype@ temp = @from@(*(@type@ *)dataptr[0]);
-        int i;
-        for (i = 1; i < nop; ++i) {
-            temp *= @from@(*(@type@ *)dataptr[i]);
-        }
-        accum += temp;
-        for (i = 0; i < nop; ++i) {
-            dataptr[i] += strides[i];
-        }
-#  endif
-#else /* complex */
-#  if @nop@ == 1
-        accum_re += ((@temptype@ *)data0)[0];
-        accum_im += ((@temptype@ *)data0)[1];
-        data0 += stride0;
-#  else
-#    if @nop@ <= 3
-#define _SUMPROD_NOP @nop@
-#    else
-#define _SUMPROD_NOP nop
-#    endif
-        @temptype@ re, im, tmp;
-        int i;
-        re = ((@temptype@ *)dataptr[0])[0];
-        im = ((@temptype@ *)dataptr[0])[1];
-        for (i = 1; i < _SUMPROD_NOP; ++i) {
-            tmp = re * ((@temptype@ *)dataptr[i])[0] -
-                  im * ((@temptype@ *)dataptr[i])[1];
-            im = re * ((@temptype@ *)dataptr[i])[1] +
-                 im * ((@temptype@ *)dataptr[i])[0];
-            re = tmp;
-        }
-        accum_re += re;
-        accum_im += im;
-        for (i = 0; i < _SUMPROD_NOP; ++i) {
-            dataptr[i] += strides[i];
-        }
-#undef _SUMPROD_NOP
-#  endif
-#endif
-    }
-
-#if @complex@
-#  if @nop@ <= 3
-    ((@temptype@ *)dataptr[@nop@])[0] += accum_re;
-    ((@temptype@ *)dataptr[@nop@])[1] += accum_im;
-#  else
-    ((@temptype@ *)dataptr[nop])[0] += accum_re;
-    ((@temptype@ *)dataptr[nop])[1] += accum_im;
-#  endif
-#else
-#  if @nop@ <= 3
-    *((@type@ *)dataptr[@nop@]) = @to@(accum +
-                                    @from@(*((@type@ *)dataptr[@nop@])));
-#  else
-    *((@type@ *)dataptr[nop]) = @to@(accum +
-                                    @from@(*((@type@ *)dataptr[nop])));
-#  endif
-#endif
-
-}
-
-/**end repeat1**/
-
-/**end repeat**/
-
-
-/* Do OR of ANDs for the boolean type */
-
-/**begin repeat
- * #nop = 1, 2, 3, 1000#
- * #noplabel = one, two, three, any#
- */
-
-static void
-bool_sum_of_products_@noplabel@(int nop, char **dataptr,
-                                npy_intp const *strides, npy_intp count)
-{
-#if (@nop@ <= 3)
-    char *data0 = dataptr[0];
-    npy_intp stride0 = strides[0];
-#endif
-#if (@nop@ == 2 || @nop@ == 3)
-    char *data1 = dataptr[1];
-    npy_intp stride1 = strides[1];
-#endif
-#if (@nop@ == 3)
-    char *data2 = dataptr[2];
-    npy_intp stride2 = strides[2];
-#endif
-#if (@nop@ <= 3)
-    char *data_out = dataptr[@nop@];
-    npy_intp stride_out = strides[@nop@];
-#endif
-
-    while (count--) {
-#if @nop@ == 1
-        *(npy_bool *)data_out = *(npy_bool *)data0 ||
-                                  *(npy_bool *)data_out;
-        data0 += stride0;
-        data_out += stride_out;
-#elif @nop@ == 2
-        *(npy_bool *)data_out = (*(npy_bool *)data0 &&
-                                   *(npy_bool *)data1) ||
-                                   *(npy_bool *)data_out;
-        data0 += stride0;
-        data1 += stride1;
-        data_out += stride_out;
-#elif @nop@ == 3
-        *(npy_bool *)data_out = (*(npy_bool *)data0 &&
-                                   *(npy_bool *)data1 &&
-                                   *(npy_bool *)data2) ||
-                                   *(npy_bool *)data_out;
-        data0 += stride0;
-        data1 += stride1;
-        data2 += stride2;
-        data_out += stride_out;
-#else
-        npy_bool temp = *(npy_bool *)dataptr[0];
-        int i;
-        for (i = 1; i < nop; ++i) {
-            temp = temp && *(npy_bool *)dataptr[i];
-        }
-        *(npy_bool *)dataptr[nop] = temp || *(npy_bool *)dataptr[i];
-        for (i = 0; i <= nop; ++i) {
-            dataptr[i] += strides[i];
-        }
-#endif
-    }
-}
-
-static void
-bool_sum_of_products_contig_@noplabel@(int nop, char **dataptr,
-                                npy_intp const *strides, npy_intp count)
-{
-#if (@nop@ <= 3)
-    char *data0 = dataptr[0];
-#endif
-#if (@nop@ == 2 || @nop@ == 3)
-    char *data1 = dataptr[1];
-#endif
-#if (@nop@ == 3)
-    char *data2 = dataptr[2];
-#endif
-#if (@nop@ <= 3)
-    char *data_out = dataptr[@nop@];
-#endif
-
-#if (@nop@ <= 3)
-/* This is placed before the main loop to make small counts faster */
-finish_after_unrolled_loop:
-    switch (count) {
-/**begin repeat1
- * #i = 6, 5, 4, 3, 2, 1, 0#
- */
-        case @i@+1:
-#  if @nop@ == 1
-            ((npy_bool *)data_out)[@i@] = ((npy_bool *)data0)[@i@] ||
-                                            ((npy_bool *)data_out)[@i@];
-#  elif @nop@ == 2
-            ((npy_bool *)data_out)[@i@] =
-                            (((npy_bool *)data0)[@i@] &&
-                             ((npy_bool *)data1)[@i@]) ||
-                                ((npy_bool *)data_out)[@i@];
-#  elif @nop@ == 3
-            ((npy_bool *)data_out)[@i@] =
-                           (((npy_bool *)data0)[@i@] &&
-                            ((npy_bool *)data1)[@i@] &&
-                            ((npy_bool *)data2)[@i@]) ||
-                                ((npy_bool *)data_out)[@i@];
-#  endif
-/**end repeat1**/
-        case 0:
-            return;
-    }
-#endif
-
-/* Unroll the loop by 8 for fixed-size nop */
-#if (@nop@ <= 3)
-    while (count >= 8) {
-        count -= 8;
-#else
-    while (count--) {
-#endif
-
-#  if @nop@ == 1
-/**begin repeat1
- * #i = 0, 1, 2, 3, 4, 5, 6, 7#
- */
-        *((npy_bool *)data_out + @i@) = (*((npy_bool *)data0 + @i@)) ||
-                                        (*((npy_bool *)data_out + @i@));
-/**end repeat1**/
-        data0 += 8*sizeof(npy_bool);
-        data_out += 8*sizeof(npy_bool);
-#  elif @nop@ == 2
-/**begin repeat1
- * #i = 0, 1, 2, 3, 4, 5, 6, 7#
- */
-        *((npy_bool *)data_out + @i@) =
-                        ((*((npy_bool *)data0 + @i@)) &&
-                         (*((npy_bool *)data1 + @i@))) ||
-                            (*((npy_bool *)data_out + @i@));
-/**end repeat1**/
-        data0 += 8*sizeof(npy_bool);
-        data1 += 8*sizeof(npy_bool);
-        data_out += 8*sizeof(npy_bool);
-#  elif @nop@ == 3
-/**begin repeat1
- * #i = 0, 1, 2, 3, 4, 5, 6, 7#
- */
-        *((npy_bool *)data_out + @i@) =
-                       ((*((npy_bool *)data0 + @i@)) &&
-                        (*((npy_bool *)data1 + @i@)) &&
-                        (*((npy_bool *)data2 + @i@))) ||
-                            (*((npy_bool *)data_out + @i@));
-/**end repeat1**/
-        data0 += 8*sizeof(npy_bool);
-        data1 += 8*sizeof(npy_bool);
-        data2 += 8*sizeof(npy_bool);
-        data_out += 8*sizeof(npy_bool);
-#  else
-        npy_bool temp = *(npy_bool *)dataptr[0];
-        int i;
-        for (i = 1; i < nop; ++i) {
-            temp = temp && *(npy_bool *)dataptr[i];
-        }
-        *(npy_bool *)dataptr[nop] = temp || *(npy_bool *)dataptr[i];
-        for (i = 0; i <= nop; ++i) {
-            dataptr[i] += sizeof(npy_bool);
-        }
-#  endif
-    }
-
-    /* If the loop was unrolled, we need to finish it off */
-#if (@nop@ <= 3)
-    goto finish_after_unrolled_loop;
-#endif
-}
-
-static void
-bool_sum_of_products_outstride0_@noplabel@(int nop, char **dataptr,
-                                npy_intp const *strides, npy_intp count)
-{
-    npy_bool accum = 0;
-
-#if (@nop@ <= 3)
-    char *data0 = dataptr[0];
-    npy_intp stride0 = strides[0];
-#endif
-#if (@nop@ == 2 || @nop@ == 3)
-    char *data1 = dataptr[1];
-    npy_intp stride1 = strides[1];
-#endif
-#if (@nop@ == 3)
-    char *data2 = dataptr[2];
-    npy_intp stride2 = strides[2];
-#endif
-
-    while (count--) {
-#if @nop@ == 1
-        accum = *(npy_bool *)data0 || accum;
-        data0 += stride0;
-#elif @nop@ == 2
-        accum = (*(npy_bool *)data0 && *(npy_bool *)data1) || accum;
-        data0 += stride0;
-        data1 += stride1;
-#elif @nop@ == 3
-        accum = (*(npy_bool *)data0 &&
-                 *(npy_bool *)data1 &&
-                 *(npy_bool *)data2) || accum;
-        data0 += stride0;
-        data1 += stride1;
-        data2 += stride2;
-#else
-        npy_bool temp = *(npy_bool *)dataptr[0];
-        int i;
-        for (i = 1; i < nop; ++i) {
-            temp = temp && *(npy_bool *)dataptr[i];
-        }
-        accum = temp || accum;
-        for (i = 0; i <= nop; ++i) {
-            dataptr[i] += strides[i];
-        }
-#endif
-    }
-
-#  if @nop@ <= 3
-    *((npy_bool *)dataptr[@nop@]) = accum || *((npy_bool *)dataptr[@nop@]);
-#  else
-    *((npy_bool *)dataptr[nop]) = accum || *((npy_bool *)dataptr[nop]);
-#  endif
-}
-
-/**end repeat**/
-
-typedef void (*sum_of_products_fn)(int, char **, npy_intp const*, npy_intp);
-
-/* These tables need to match up with the type enum */
-static sum_of_products_fn
-_contig_outstride0_unary_specialization_table[NPY_NTYPES] = {
-/**begin repeat
- * #name = bool,
- *         byte, ubyte,
- *         short, ushort,
- *         int, uint,
- *         long, ulong,
- *         longlong, ulonglong,
- *         float, double, longdouble,
- *         cfloat, cdouble, clongdouble,
- *         object, string, unicode, void,
- *         datetime, timedelta, half#
- * #use = 0,
- *        1, 1,
- *        1, 1,
- *        1, 1,
- *        1, 1,
- *        1, 1,
- *        1, 1, 1,
- *        1, 1, 1,
- *        0, 0, 0, 0,
- *        0, 0, 1#
- */
-#if @use@
-    &@name@_sum_of_products_contig_outstride0_one,
-#else
-    NULL,
-#endif
-/**end repeat**/
-}; /* End of _contig_outstride0_unary_specialization_table */
-
-static sum_of_products_fn _binary_specialization_table[NPY_NTYPES][5] = {
-/**begin repeat
- * #name = bool,
- *         byte, ubyte,
- *         short, ushort,
- *         int, uint,
- *         long, ulong,
- *         longlong, ulonglong,
- *         float, double, longdouble,
- *         cfloat, cdouble, clongdouble,
- *         object, string, unicode, void,
- *         datetime, timedelta, half#
- * #use = 0,
- *        1, 1,
- *        1, 1,
- *        1, 1,
- *        1, 1,
- *        1, 1,
- *        1, 1, 1,
- *        0, 0, 0,
- *        0, 0, 0, 0,
- *        0, 0, 1#
- */
-#if @use@
-{
-    &@name@_sum_of_products_stride0_contig_outstride0_two,
-    &@name@_sum_of_products_stride0_contig_outcontig_two,
-    &@name@_sum_of_products_contig_stride0_outstride0_two,
-    &@name@_sum_of_products_contig_stride0_outcontig_two,
-    &@name@_sum_of_products_contig_contig_outstride0_two,
-},
-#else
-    {NULL, NULL, NULL, NULL, NULL},
-#endif
-/**end repeat**/
-}; /* End of _binary_specialization_table */
-
-static sum_of_products_fn _outstride0_specialized_table[NPY_NTYPES][4] = {
-/**begin repeat
- * #name = bool,
- *         byte, ubyte,
- *         short, ushort,
- *         int, uint,
- *         long, ulong,
- *         longlong, ulonglong,
- *         float, double, longdouble,
- *         cfloat, cdouble, clongdouble,
- *         object, string, unicode, void,
- *         datetime, timedelta, half#
- * #use = 1,
- *        1, 1,
- *        1, 1,
- *        1, 1,
- *        1, 1,
- *        1, 1,
- *        1, 1, 1,
- *        1, 1, 1,
- *        0, 0, 0, 0,
- *        0, 0, 1#
- */
-#if @use@
-{
-    &@name@_sum_of_products_outstride0_any,
-    &@name@_sum_of_products_outstride0_one,
-    &@name@_sum_of_products_outstride0_two,
-    &@name@_sum_of_products_outstride0_three
-},
-#else
-    {NULL, NULL, NULL, NULL},
-#endif
-/**end repeat**/
-}; /* End of _outstride0_specialized_table */
-
-static sum_of_products_fn _allcontig_specialized_table[NPY_NTYPES][4] = {
-/**begin repeat
- * #name = bool,
- *         byte, ubyte,
- *         short, ushort,
- *         int, uint,
- *         long, ulong,
- *         longlong, ulonglong,
- *         float, double, longdouble,
- *         cfloat, cdouble, clongdouble,
- *         object, string, unicode, void,
- *         datetime, timedelta, half#
- * #use = 1,
- *        1, 1,
- *        1, 1,
- *        1, 1,
- *        1, 1,
- *        1, 1,
- *        1, 1, 1,
- *        1, 1, 1,
- *        0, 0, 0, 0,
- *        0, 0, 1#
- */
-#if @use@
-{
-    &@name@_sum_of_products_contig_any,
-    &@name@_sum_of_products_contig_one,
-    &@name@_sum_of_products_contig_two,
-    &@name@_sum_of_products_contig_three
-},
-#else
-    {NULL, NULL, NULL, NULL},
-#endif
-/**end repeat**/
-}; /* End of _allcontig_specialized_table */
-
-static sum_of_products_fn _unspecialized_table[NPY_NTYPES][4] = {
-/**begin repeat
- * #name = bool,
- *         byte, ubyte,
- *         short, ushort,
- *         int, uint,
- *         long, ulong,
- *         longlong, ulonglong,
- *         float, double, longdouble,
- *         cfloat, cdouble, clongdouble,
- *         object, string, unicode, void,
- *         datetime, timedelta, half#
- * #use = 1,
- *        1, 1,
- *        1, 1,
- *        1, 1,
- *        1, 1,
- *        1, 1,
- *        1, 1, 1,
- *        1, 1, 1,
- *        0, 0, 0, 0,
- *        0, 0, 1#
- */
-#if @use@
-{
-    &@name@_sum_of_products_any,
-    &@name@_sum_of_products_one,
-    &@name@_sum_of_products_two,
-    &@name@_sum_of_products_three
-},
-#else
-    {NULL, NULL, NULL, NULL},
-#endif
-/**end repeat**/
-}; /* End of _unnspecialized_table */
-
-static sum_of_products_fn
-get_sum_of_products_function(int nop, int type_num,
-                             npy_intp itemsize, npy_intp const *fixed_strides)
-{
-    int iop;
-
-    if (type_num >= NPY_NTYPES) {
-        return NULL;
-    }
-
-    /* contiguous reduction */
-    if (nop == 1 && fixed_strides[0] == itemsize && fixed_strides[1] == 0) {
-        sum_of_products_fn ret =
-            _contig_outstride0_unary_specialization_table[type_num];
-        if (ret != NULL) {
-            return ret;
-        }
-    }
-
-    /* nop of 2 has more specializations */
-    if (nop == 2) {
-        /* Encode the zero/contiguous strides */
-        int code;
-        code = (fixed_strides[0] == 0) ? 0 :
-                    (fixed_strides[0] == itemsize) ? 2*2*1 : 8;
-        code += (fixed_strides[1] == 0) ? 0 :
-                    (fixed_strides[1] == itemsize) ? 2*1 : 8;
-        code += (fixed_strides[2] == 0) ? 0 :
-                    (fixed_strides[2] == itemsize) ? 1 : 8;
-        if (code >= 2 && code < 7) {
-            sum_of_products_fn ret =
-                        _binary_specialization_table[type_num][code-2];
-            if (ret != NULL) {
-                return ret;
-            }
-        }
-    }
-
-    /* Inner loop with an output stride of 0 */
-    if (fixed_strides[nop] == 0) {
-        return _outstride0_specialized_table[type_num][nop <= 3 ? nop : 0];
-    }
-
-    /* Check for all contiguous */
-    for (iop = 0; iop < nop + 1; ++iop) {
-        if (fixed_strides[iop] != itemsize) {
-            break;
-        }
-    }
-
-    /* Contiguous loop */
-    if (iop == nop + 1) {
-        return _allcontig_specialized_table[type_num][nop <= 3 ? nop : 0];
-    }
-
-    /* None of the above specializations caught it, general loops */
-    return _unspecialized_table[type_num][nop <= 3 ? nop : 0];
-}
+#include "einsum_sumprod.h"
+#include "einsum_debug.h"
 
 
 /*
diff --git a/numpy/core/src/multiarray/einsum_debug.h b/numpy/core/src/multiarray/einsum_debug.h
new file mode 100644
index 000000000..9aa81fcbd
--- /dev/null
+++ b/numpy/core/src/multiarray/einsum_debug.h
@@ -0,0 +1,28 @@
+/*
+ * This file provides debug macros used by the other einsum files.
+ *
+ * Copyright (c) 2011 by Mark Wiebe (mwwiebe@gmail.com)
+ * The University of British Columbia
+ *
+ * See LICENSE.txt for the license.
+ */
+#ifndef _NPY_MULTIARRAY_EINSUM_DEBUG_H
+#define _NPY_MULTIARRAY_EINSUM_DEBUG_H
+
+/********** PRINTF DEBUG TRACING **************/
+#define NPY_EINSUM_DBG_TRACING 0
+
+#if NPY_EINSUM_DBG_TRACING
+#include <cstdio>
+#define NPY_EINSUM_DBG_PRINT(s) printf("%s", s);
+#define NPY_EINSUM_DBG_PRINT1(s, p1) printf(s, p1);
+#define NPY_EINSUM_DBG_PRINT2(s, p1, p2) printf(s, p1, p2);
+#define NPY_EINSUM_DBG_PRINT3(s, p1, p2, p3) printf(s);
+#else
+#define NPY_EINSUM_DBG_PRINT(s)
+#define NPY_EINSUM_DBG_PRINT1(s, p1)
+#define NPY_EINSUM_DBG_PRINT2(s, p1, p2)
+#define NPY_EINSUM_DBG_PRINT3(s, p1, p2, p3)
+#endif
+
+#endif
diff --git a/numpy/core/src/multiarray/einsum_sumprod.c.src b/numpy/core/src/multiarray/einsum_sumprod.c.src
new file mode 100644
index 000000000..c58e74287
--- /dev/null
+++ b/numpy/core/src/multiarray/einsum_sumprod.c.src
@@ -0,0 +1,1897 @@
+/*
+ * This file provides optimized sum of product implementations used internally
+ * by einsum.
+ *
+ * Copyright (c) 2011 by Mark Wiebe (mwwiebe@gmail.com)
+ * The University of British Columbia
+ *
+ * See LICENSE.txt for the license.
+ */
+
+#define NPY_NO_DEPRECATED_API NPY_API_VERSION
+#define _MULTIARRAYMODULE
+
+#include <numpy/npy_common.h>
+#include <numpy/ndarraytypes.h>  /* for NPY_NTYPES */
+#include <numpy/halffloat.h>
+
+#include "einsum_sumprod.h"
+#include "einsum_debug.h"
+
+
+#ifdef NPY_HAVE_SSE_INTRINSICS
+#define EINSUM_USE_SSE1 1
+#else
+#define EINSUM_USE_SSE1 0
+#endif
+
+#ifdef NPY_HAVE_SSE2_INTRINSICS
+#define EINSUM_USE_SSE2 1
+#else
+#define EINSUM_USE_SSE2 0
+#endif
+
+#if EINSUM_USE_SSE1
+#include <xmmintrin.h>
+#endif
+
+#if EINSUM_USE_SSE2
+#include <emmintrin.h>
+#endif
+
+#define EINSUM_IS_SSE_ALIGNED(x) ((((npy_intp)x)&0xf) == 0)
+
+/**********************************************/
+
+/**begin repeat
+ * #name = byte, short, int, long, longlong,
+ *         ubyte, ushort, uint, ulong, ulonglong,
+ *         half, float, double, longdouble,
+ *         cfloat, cdouble, clongdouble#
+ * #type = npy_byte, npy_short, npy_int, npy_long, npy_longlong,
+ *         npy_ubyte, npy_ushort, npy_uint, npy_ulong, npy_ulonglong,
+ *         npy_half, npy_float, npy_double, npy_longdouble,
+ *         npy_cfloat, npy_cdouble, npy_clongdouble#
+ * #temptype = npy_byte, npy_short, npy_int, npy_long, npy_longlong,
+ *             npy_ubyte, npy_ushort, npy_uint, npy_ulong, npy_ulonglong,
+ *             npy_float, npy_float, npy_double, npy_longdouble,
+ *             npy_float, npy_double, npy_longdouble#
+ * #to = ,,,,,
+ *       ,,,,,
+ *       npy_float_to_half,,,,
+ *       ,,#
+ * #from = ,,,,,
+ *         ,,,,,
+ *         npy_half_to_float,,,,
+ *         ,,#
+ * #complex = 0*5,
+ *            0*5,
+ *            0*4,
+ *            1*3#
+ * #float32 = 0*5,
+ *            0*5,
+ *            0,1,0,0,
+ *            0*3#
+ * #float64 = 0*5,
+ *            0*5,
+ *            0,0,1,0,
+ *            0*3#
+ */
+
+/**begin repeat1
+ * #nop = 1, 2, 3, 1000#
+ * #noplabel = one, two, three, any#
+ */
+static void
+@name@_sum_of_products_@noplabel@(int nop, char **dataptr,
+                                npy_intp const *strides, npy_intp count)
+{
+#if (@nop@ == 1) || (@nop@ <= 3 && !@complex@)
+    char *data0 = dataptr[0];
+    npy_intp stride0 = strides[0];
+#endif
+#if (@nop@ == 2 || @nop@ == 3) && !@complex@
+    char *data1 = dataptr[1];
+    npy_intp stride1 = strides[1];
+#endif
+#if (@nop@ == 3) && !@complex@
+    char *data2 = dataptr[2];
+    npy_intp stride2 = strides[2];
+#endif
+#if (@nop@ == 1) || (@nop@ <= 3 && !@complex@)
+    char *data_out = dataptr[@nop@];
+    npy_intp stride_out = strides[@nop@];
+#endif
+
+    NPY_EINSUM_DBG_PRINT1("@name@_sum_of_products_@noplabel@ (%d)\n", (int)count);
+
+    while (count--) {
+#if !@complex@
+#  if @nop@ == 1
+        *(@type@ *)data_out = @to@(@from@(*(@type@ *)data0) +
+                                         @from@(*(@type@ *)data_out));
+        data0 += stride0;
+        data_out += stride_out;
+#  elif @nop@ == 2
+        *(@type@ *)data_out = @to@(@from@(*(@type@ *)data0) *
+                                         @from@(*(@type@ *)data1) +
+                                         @from@(*(@type@ *)data_out));
+        data0 += stride0;
+        data1 += stride1;
+        data_out += stride_out;
+#  elif @nop@ == 3
+        *(@type@ *)data_out = @to@(@from@(*(@type@ *)data0) *
+                                         @from@(*(@type@ *)data1) *
+                                         @from@(*(@type@ *)data2) +
+                                         @from@(*(@type@ *)data_out));
+        data0 += stride0;
+        data1 += stride1;
+        data2 += stride2;
+        data_out += stride_out;
+#  else
+        @temptype@ temp = @from@(*(@type@ *)dataptr[0]);
+        int i;
+        for (i = 1; i < nop; ++i) {
+            temp *= @from@(*(@type@ *)dataptr[i]);
+        }
+        *(@type@ *)dataptr[nop] = @to@(temp +
+                                           @from@(*(@type@ *)dataptr[i]));
+        for (i = 0; i <= nop; ++i) {
+            dataptr[i] += strides[i];
+        }
+#  endif
+#else /* complex */
+#  if @nop@ == 1
+        ((@temptype@ *)data_out)[0] = ((@temptype@ *)data0)[0] +
+                                         ((@temptype@ *)data_out)[0];
+        ((@temptype@ *)data_out)[1] = ((@temptype@ *)data0)[1] +
+                                         ((@temptype@ *)data_out)[1];
+        data0 += stride0;
+        data_out += stride_out;
+#  else
+#    if @nop@ <= 3
+#define _SUMPROD_NOP @nop@
+#    else
+#define _SUMPROD_NOP nop
+#    endif
+        @temptype@ re, im, tmp;
+        int i;
+        re = ((@temptype@ *)dataptr[0])[0];
+        im = ((@temptype@ *)dataptr[0])[1];
+        for (i = 1; i < _SUMPROD_NOP; ++i) {
+            tmp = re * ((@temptype@ *)dataptr[i])[0] -
+                  im * ((@temptype@ *)dataptr[i])[1];
+            im = re * ((@temptype@ *)dataptr[i])[1] +
+                 im * ((@temptype@ *)dataptr[i])[0];
+            re = tmp;
+        }
+        ((@temptype@ *)dataptr[_SUMPROD_NOP])[0] = re +
+                                     ((@temptype@ *)dataptr[_SUMPROD_NOP])[0];
+        ((@temptype@ *)dataptr[_SUMPROD_NOP])[1] = im +
+                                     ((@temptype@ *)dataptr[_SUMPROD_NOP])[1];
+
+        for (i = 0; i <= _SUMPROD_NOP; ++i) {
+            dataptr[i] += strides[i];
+        }
+#undef _SUMPROD_NOP
+#  endif
+#endif
+    }
+}
+
+#if @nop@ == 1
+
+static void
+@name@_sum_of_products_contig_one(int nop, char **dataptr,
+                                npy_intp const *NPY_UNUSED(strides), npy_intp count)
+{
+    @type@ *data0 = (@type@ *)dataptr[0];
+    @type@ *data_out = (@type@ *)dataptr[1];
+
+    NPY_EINSUM_DBG_PRINT1("@name@_sum_of_products_contig_one (%d)\n",
+                                                            (int)count);
+
+/* This is placed before the main loop to make small counts faster */
+finish_after_unrolled_loop:
+    switch (count) {
+/**begin repeat2
+ * #i = 6, 5, 4, 3, 2, 1, 0#
+ */
+        case @i@+1:
+#if !@complex@
+            data_out[@i@] = @to@(@from@(data0[@i@]) +
+                                 @from@(data_out[@i@]));
+#else
+            ((@temptype@ *)data_out + 2*@i@)[0] =
+                                    ((@temptype@ *)data0 + 2*@i@)[0] +
+                                    ((@temptype@ *)data_out + 2*@i@)[0];
+            ((@temptype@ *)data_out + 2*@i@)[1] =
+                                    ((@temptype@ *)data0 + 2*@i@)[1] +
+                                    ((@temptype@ *)data_out + 2*@i@)[1];
+#endif
+/**end repeat2**/
+        case 0:
+            return;
+    }
+
+    /* Unroll the loop by 8 */
+    while (count >= 8) {
+        count -= 8;
+
+/**begin repeat2
+ * #i = 0, 1, 2, 3, 4, 5, 6, 7#
+ */
+#if !@complex@
+        data_out[@i@] = @to@(@from@(data0[@i@]) +
+                             @from@(data_out[@i@]));
+#else /* complex */
+        ((@temptype@ *)data_out + 2*@i@)[0] =
+                                ((@temptype@ *)data0 + 2*@i@)[0] +
+                                ((@temptype@ *)data_out + 2*@i@)[0];
+        ((@temptype@ *)data_out + 2*@i@)[1] =
+                                ((@temptype@ *)data0 + 2*@i@)[1] +
+                                ((@temptype@ *)data_out + 2*@i@)[1];
+#endif
+/**end repeat2**/
+        data0 += 8;
+        data_out += 8;
+    }
+
+    /* Finish off the loop */
+    goto finish_after_unrolled_loop;
+}
+
+#elif @nop@ == 2 && !@complex@
+
+static void
+@name@_sum_of_products_contig_two(int nop, char **dataptr,
+                                npy_intp const *NPY_UNUSED(strides), npy_intp count)
+{
+    @type@ *data0 = (@type@ *)dataptr[0];
+    @type@ *data1 = (@type@ *)dataptr[1];
+    @type@ *data_out = (@type@ *)dataptr[2];
+
+#if EINSUM_USE_SSE1 && @float32@
+    __m128 a, b;
+#elif EINSUM_USE_SSE2 && @float64@
+    __m128d a, b;
+#endif
+
+    NPY_EINSUM_DBG_PRINT1("@name@_sum_of_products_contig_two (%d)\n",
+                                                            (int)count);
+
+/* This is placed before the main loop to make small counts faster */
+finish_after_unrolled_loop:
+    switch (count) {
+/**begin repeat2
+ * #i = 6, 5, 4, 3, 2, 1, 0#
+ */
+        case @i@+1:
+            data_out[@i@] = @to@(@from@(data0[@i@]) *
+                                 @from@(data1[@i@]) +
+                                 @from@(data_out[@i@]));
+/**end repeat2**/
+        case 0:
+            return;
+    }
+
+#if EINSUM_USE_SSE1 && @float32@
+    /* Use aligned instructions if possible */
+    if (EINSUM_IS_SSE_ALIGNED(data0) && EINSUM_IS_SSE_ALIGNED(data1) &&
+        EINSUM_IS_SSE_ALIGNED(data_out)) {
+        /* Unroll the loop by 8 */
+        while (count >= 8) {
+            count -= 8;
+
+/**begin repeat2
+ * #i = 0, 4#
+ */
+            a = _mm_mul_ps(_mm_load_ps(data0+@i@), _mm_load_ps(data1+@i@));
+            b = _mm_add_ps(a, _mm_load_ps(data_out+@i@));
+            _mm_store_ps(data_out+@i@, b);
+/**end repeat2**/
+            data0 += 8;
+            data1 += 8;
+            data_out += 8;
+        }
+
+        /* Finish off the loop */
+        goto finish_after_unrolled_loop;
+    }
+#elif EINSUM_USE_SSE2 && @float64@
+    /* Use aligned instructions if possible */
+    if (EINSUM_IS_SSE_ALIGNED(data0) && EINSUM_IS_SSE_ALIGNED(data1) &&
+        EINSUM_IS_SSE_ALIGNED(data_out)) {
+        /* Unroll the loop by 8 */
+        while (count >= 8) {
+            count -= 8;
+
+/**begin repeat2
+ * #i = 0, 2, 4, 6#
+ */
+            a = _mm_mul_pd(_mm_load_pd(data0+@i@), _mm_load_pd(data1+@i@));
+            b = _mm_add_pd(a, _mm_load_pd(data_out+@i@));
+            _mm_store_pd(data_out+@i@, b);
+/**end repeat2**/
+            data0 += 8;
+            data1 += 8;
+            data_out += 8;
+        }
+
+        /* Finish off the loop */
+        goto finish_after_unrolled_loop;
+    }
+#endif
+
+    /* Unroll the loop by 8 */
+    while (count >= 8) {
+        count -= 8;
+
+#if EINSUM_USE_SSE1 && @float32@
+/**begin repeat2
+ * #i = 0, 4#
+ */
+        a = _mm_mul_ps(_mm_loadu_ps(data0+@i@), _mm_loadu_ps(data1+@i@));
+        b = _mm_add_ps(a, _mm_loadu_ps(data_out+@i@));
+        _mm_storeu_ps(data_out+@i@, b);
+/**end repeat2**/
+#elif EINSUM_USE_SSE2 && @float64@
+/**begin repeat2
+ * #i = 0, 2, 4, 6#
+ */
+        a = _mm_mul_pd(_mm_loadu_pd(data0+@i@), _mm_loadu_pd(data1+@i@));
+        b = _mm_add_pd(a, _mm_loadu_pd(data_out+@i@));
+        _mm_storeu_pd(data_out+@i@, b);
+/**end repeat2**/
+#else
+/**begin repeat2
+ * #i = 0, 1, 2, 3, 4, 5, 6, 7#
+ */
+        data_out[@i@] = @to@(@from@(data0[@i@]) *
+                             @from@(data1[@i@]) +
+                             @from@(data_out[@i@]));
+/**end repeat2**/
+#endif
+        data0 += 8;
+        data1 += 8;
+        data_out += 8;
+    }
+
+    /* Finish off the loop */
+    goto finish_after_unrolled_loop;
+}
+
+/* Some extra specializations for the two operand case */
+static void
+@name@_sum_of_products_stride0_contig_outcontig_two(int nop, char **dataptr,
+                                npy_intp const *NPY_UNUSED(strides), npy_intp count)
+{
+    @temptype@ value0 = @from@(*(@type@ *)dataptr[0]);
+    @type@ *data1 = (@type@ *)dataptr[1];
+    @type@ *data_out = (@type@ *)dataptr[2];
+
+#if EINSUM_USE_SSE1 && @float32@
+    __m128 a, b, value0_sse;
+#elif EINSUM_USE_SSE2 && @float64@
+    __m128d a, b, value0_sse;
+#endif
+
+    NPY_EINSUM_DBG_PRINT1("@name@_sum_of_products_stride0_contig_outcontig_two (%d)\n",
+                                                    (int)count);
+
+/* This is placed before the main loop to make small counts faster */
+finish_after_unrolled_loop:
+    switch (count) {
+/**begin repeat2
+ * #i = 6, 5, 4, 3, 2, 1, 0#
+ */
+        case @i@+1:
+            data_out[@i@] = @to@(value0 *
+                                 @from@(data1[@i@]) +
+                                 @from@(data_out[@i@]));
+/**end repeat2**/
+        case 0:
+            return;
+    }
+
+#if EINSUM_USE_SSE1 && @float32@
+    value0_sse = _mm_set_ps1(value0);
+
+    /* Use aligned instructions if possible */
+    if (EINSUM_IS_SSE_ALIGNED(data1) && EINSUM_IS_SSE_ALIGNED(data_out)) {
+        /* Unroll the loop by 8 */
+        while (count >= 8) {
+            count -= 8;
+
+/**begin repeat2
+ * #i = 0, 4#
+ */
+            a = _mm_mul_ps(value0_sse, _mm_load_ps(data1+@i@));
+            b = _mm_add_ps(a, _mm_load_ps(data_out+@i@));
+            _mm_store_ps(data_out+@i@, b);
+/**end repeat2**/
+            data1 += 8;
+            data_out += 8;
+        }
+
+        /* Finish off the loop */
+        if (count > 0) {
+            goto finish_after_unrolled_loop;
+        }
+        else {
+            return;
+        }
+    }
+#elif EINSUM_USE_SSE2 && @float64@
+    value0_sse = _mm_set1_pd(value0);
+
+    /* Use aligned instructions if possible */
+    if (EINSUM_IS_SSE_ALIGNED(data1) && EINSUM_IS_SSE_ALIGNED(data_out)) {
+        /* Unroll the loop by 8 */
+        while (count >= 8) {
+            count -= 8;
+
+/**begin repeat2
+ * #i = 0, 2, 4, 6#
+ */
+            a = _mm_mul_pd(value0_sse, _mm_load_pd(data1+@i@));
+            b = _mm_add_pd(a, _mm_load_pd(data_out+@i@));
+            _mm_store_pd(data_out+@i@, b);
+/**end repeat2**/
+            data1 += 8;
+            data_out += 8;
+        }
+
+        /* Finish off the loop */
+        if (count > 0) {
+            goto finish_after_unrolled_loop;
+        }
+        else {
+            return;
+        }
+    }
+#endif
+
+    /* Unroll the loop by 8 */
+    while (count >= 8) {
+        count -= 8;
+
+#if EINSUM_USE_SSE1 && @float32@
+/**begin repeat2
+ * #i = 0, 4#
+ */
+        a = _mm_mul_ps(value0_sse, _mm_loadu_ps(data1+@i@));
+        b = _mm_add_ps(a, _mm_loadu_ps(data_out+@i@));
+        _mm_storeu_ps(data_out+@i@, b);
+/**end repeat2**/
+#elif EINSUM_USE_SSE2 && @float64@
+/**begin repeat2
+ * #i = 0, 2, 4, 6#
+ */
+        a = _mm_mul_pd(value0_sse, _mm_loadu_pd(data1+@i@));
+        b = _mm_add_pd(a, _mm_loadu_pd(data_out+@i@));
+        _mm_storeu_pd(data_out+@i@, b);
+/**end repeat2**/
+#else
+/**begin repeat2
+ * #i = 0, 1, 2, 3, 4, 5, 6, 7#
+ */
+        data_out[@i@] = @to@(value0 *
+                             @from@(data1[@i@]) +
+                             @from@(data_out[@i@]));
+/**end repeat2**/
+#endif
+        data1 += 8;
+        data_out += 8;
+    }
+
+    /* Finish off the loop */
+    if (count > 0) {
+        goto finish_after_unrolled_loop;
+    }
+}
+
+static void
+@name@_sum_of_products_contig_stride0_outcontig_two(int nop, char **dataptr,
+                                npy_intp const *NPY_UNUSED(strides), npy_intp count)
+{
+    @type@ *data0 = (@type@ *)dataptr[0];
+    @temptype@ value1 = @from@(*(@type@ *)dataptr[1]);
+    @type@ *data_out = (@type@ *)dataptr[2];
+
+#if EINSUM_USE_SSE1 && @float32@
+    __m128 a, b, value1_sse;
+#elif EINSUM_USE_SSE2 && @float64@
+    __m128d a, b, value1_sse;
+#endif
+
+    NPY_EINSUM_DBG_PRINT1("@name@_sum_of_products_contig_stride0_outcontig_two (%d)\n",
+                                                    (int)count);
+
+/* This is placed before the main loop to make small counts faster */
+finish_after_unrolled_loop:
+    switch (count) {
+/**begin repeat2
+ * #i = 6, 5, 4, 3, 2, 1, 0#
+ */
+        case @i@+1:
+            data_out[@i@] = @to@(@from@(data0[@i@])*
+                                 value1  +
+                                 @from@(data_out[@i@]));
+/**end repeat2**/
+        case 0:
+            return;
+    }
+
+#if EINSUM_USE_SSE1 && @float32@
+    value1_sse = _mm_set_ps1(value1);
+
+    /* Use aligned instructions if possible */
+    if (EINSUM_IS_SSE_ALIGNED(data0) && EINSUM_IS_SSE_ALIGNED(data_out)) {
+        /* Unroll the loop by 8 */
+        while (count >= 8) {
+            count -= 8;
+
+/**begin repeat2
+ * #i = 0, 4#
+ */
+            a = _mm_mul_ps(_mm_load_ps(data0+@i@), value1_sse);
+            b = _mm_add_ps(a, _mm_load_ps(data_out+@i@));
+            _mm_store_ps(data_out+@i@, b);
+/**end repeat2**/
+            data0 += 8;
+            data_out += 8;
+        }
+
+        /* Finish off the loop */
+        goto finish_after_unrolled_loop;
+    }
+#elif EINSUM_USE_SSE2 && @float64@
+    value1_sse = _mm_set1_pd(value1);
+
+    /* Use aligned instructions if possible */
+    if (EINSUM_IS_SSE_ALIGNED(data0) && EINSUM_IS_SSE_ALIGNED(data_out)) {
+        /* Unroll the loop by 8 */
+        while (count >= 8) {
+            count -= 8;
+
+/**begin repeat2
+ * #i = 0, 2, 4, 6#
+ */
+            a = _mm_mul_pd(_mm_load_pd(data0+@i@), value1_sse);
+            b = _mm_add_pd(a, _mm_load_pd(data_out+@i@));
+            _mm_store_pd(data_out+@i@, b);
+/**end repeat2**/
+            data0 += 8;
+            data_out += 8;
+        }
+
+        /* Finish off the loop */
+        goto finish_after_unrolled_loop;
+    }
+#endif
+
+    /* Unroll the loop by 8 */
+    while (count >= 8) {
+        count -= 8;
+
+#if EINSUM_USE_SSE1 && @float32@
+/**begin repeat2
+ * #i = 0, 4#
+ */
+        a = _mm_mul_ps(_mm_loadu_ps(data0+@i@), value1_sse);
+        b = _mm_add_ps(a, _mm_loadu_ps(data_out+@i@));
+        _mm_storeu_ps(data_out+@i@, b);
+/**end repeat2**/
+#elif EINSUM_USE_SSE2 && @float64@
+/**begin repeat2
+ * #i = 0, 2, 4, 6#
+ */
+        a = _mm_mul_pd(_mm_loadu_pd(data0+@i@), value1_sse);
+        b = _mm_add_pd(a, _mm_loadu_pd(data_out+@i@));
+        _mm_storeu_pd(data_out+@i@, b);
+/**end repeat2**/
+#else
+/**begin repeat2
+ * #i = 0, 1, 2, 3, 4, 5, 6, 7#
+ */
+        data_out[@i@] = @to@(@from@(data0[@i@])*
+                             value1  +
+                             @from@(data_out[@i@]));
+/**end repeat2**/
+#endif
+        data0 += 8;
+        data_out += 8;
+    }
+
+    /* Finish off the loop */
+    goto finish_after_unrolled_loop;
+}
+
+static void
+@name@_sum_of_products_contig_contig_outstride0_two(int nop, char **dataptr,
+                                npy_intp const *NPY_UNUSED(strides), npy_intp count)
+{
+    @type@ *data0 = (@type@ *)dataptr[0];
+    @type@ *data1 = (@type@ *)dataptr[1];
+    @temptype@ accum = 0;
+
+#if EINSUM_USE_SSE1 && @float32@
+    __m128 a, accum_sse = _mm_setzero_ps();
+#elif EINSUM_USE_SSE2 && @float64@
+    __m128d a, accum_sse = _mm_setzero_pd();
+#endif
+
+    NPY_EINSUM_DBG_PRINT1("@name@_sum_of_products_contig_contig_outstride0_two (%d)\n",
+                                                    (int)count);
+
+/* This is placed before the main loop to make small counts faster */
+finish_after_unrolled_loop:
+    switch (count) {
+/**begin repeat2
+ * #i = 6, 5, 4, 3, 2, 1, 0#
+ */
+        case @i@+1:
+            accum += @from@(data0[@i@]) * @from@(data1[@i@]);
+/**end repeat2**/
+        case 0:
+            *(@type@ *)dataptr[2] = @to@(@from@(*(@type@ *)dataptr[2]) + accum);
+            return;
+    }
+
+#if EINSUM_USE_SSE1 && @float32@
+    /* Use aligned instructions if possible */
+    if (EINSUM_IS_SSE_ALIGNED(data0) && EINSUM_IS_SSE_ALIGNED(data1)) {
+        /* Unroll the loop by 8 */
+        while (count >= 8) {
+            count -= 8;
+
+            _mm_prefetch(data0 + 512, _MM_HINT_T0);
+            _mm_prefetch(data1 + 512, _MM_HINT_T0);
+
+/**begin repeat2
+ * #i = 0, 4#
+ */
+            /*
+             * NOTE: This accumulation changes the order, so will likely
+             *       produce slightly different results.
+             */
+            a = _mm_mul_ps(_mm_load_ps(data0+@i@), _mm_load_ps(data1+@i@));
+            accum_sse = _mm_add_ps(accum_sse, a);
+/**end repeat2**/
+            data0 += 8;
+            data1 += 8;
+        }
+
+        /* Add the four SSE values and put in accum */
+        a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(2,3,0,1));
+        accum_sse = _mm_add_ps(a, accum_sse);
+        a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(1,0,3,2));
+        accum_sse = _mm_add_ps(a, accum_sse);
+        _mm_store_ss(&accum, accum_sse);
+
+        /* Finish off the loop */
+        goto finish_after_unrolled_loop;
+    }
+#elif EINSUM_USE_SSE2 && @float64@
+    /* Use aligned instructions if possible */
+    if (EINSUM_IS_SSE_ALIGNED(data0) && EINSUM_IS_SSE_ALIGNED(data1)) {
+        /* Unroll the loop by 8 */
+        while (count >= 8) {
+            count -= 8;
+
+            _mm_prefetch(data0 + 512, _MM_HINT_T0);
+            _mm_prefetch(data1 + 512, _MM_HINT_T0);
+
+/**begin repeat2
+ * #i = 0, 2, 4, 6#
+ */
+            /*
+             * NOTE: This accumulation changes the order, so will likely
+             *       produce slightly different results.
+             */
+            a = _mm_mul_pd(_mm_load_pd(data0+@i@), _mm_load_pd(data1+@i@));
+            accum_sse = _mm_add_pd(accum_sse, a);
+/**end repeat2**/
+            data0 += 8;
+            data1 += 8;
+        }
+
+        /* Add the two SSE2 values and put in accum */
+        a = _mm_shuffle_pd(accum_sse, accum_sse, _MM_SHUFFLE2(0,1));
+        accum_sse = _mm_add_pd(a, accum_sse);
+        _mm_store_sd(&accum, accum_sse);
+
+        /* Finish off the loop */
+        goto finish_after_unrolled_loop;
+    }
+#endif
+
+    /* Unroll the loop by 8 */
+    while (count >= 8) {
+        count -= 8;
+
+#if EINSUM_USE_SSE1 && @float32@
+        _mm_prefetch(data0 + 512, _MM_HINT_T0);
+        _mm_prefetch(data1 + 512, _MM_HINT_T0);
+
+/**begin repeat2
+ * #i = 0, 4#
+ */
+        /*
+         * NOTE: This accumulation changes the order, so will likely
+         *       produce slightly different results.
+         */
+        a = _mm_mul_ps(_mm_loadu_ps(data0+@i@), _mm_loadu_ps(data1+@i@));
+        accum_sse = _mm_add_ps(accum_sse, a);
+/**end repeat2**/
+#elif EINSUM_USE_SSE2 && @float64@
+        _mm_prefetch(data0 + 512, _MM_HINT_T0);
+        _mm_prefetch(data1 + 512, _MM_HINT_T0);
+
+/**begin repeat2
+ * #i = 0, 2, 4, 6#
+ */
+        /*
+         * NOTE: This accumulation changes the order, so will likely
+         *       produce slightly different results.
+         */
+        a = _mm_mul_pd(_mm_loadu_pd(data0+@i@), _mm_loadu_pd(data1+@i@));
+        accum_sse = _mm_add_pd(accum_sse, a);
+/**end repeat2**/
+#else
+/**begin repeat2
+ * #i = 0, 1, 2, 3, 4, 5, 6, 7#
+ */
+        accum += @from@(data0[@i@]) * @from@(data1[@i@]);
+/**end repeat2**/
+#endif
+        data0 += 8;
+        data1 += 8;
+    }
+
+#if EINSUM_USE_SSE1 && @float32@
+    /* Add the four SSE values and put in accum */
+    a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(2,3,0,1));
+    accum_sse = _mm_add_ps(a, accum_sse);
+    a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(1,0,3,2));
+    accum_sse = _mm_add_ps(a, accum_sse);
+    _mm_store_ss(&accum, accum_sse);
+#elif EINSUM_USE_SSE2 && @float64@
+    /* Add the two SSE2 values and put in accum */
+    a = _mm_shuffle_pd(accum_sse, accum_sse, _MM_SHUFFLE2(0,1));
+    accum_sse = _mm_add_pd(a, accum_sse);
+    _mm_store_sd(&accum, accum_sse);
+#endif
+
+    /* Finish off the loop */
+    goto finish_after_unrolled_loop;
+}
+
+static void
+@name@_sum_of_products_stride0_contig_outstride0_two(int nop, char **dataptr,
+                                npy_intp const *NPY_UNUSED(strides), npy_intp count)
+{
+    @temptype@ value0 = @from@(*(@type@ *)dataptr[0]);
+    @type@ *data1 = (@type@ *)dataptr[1];
+    @temptype@ accum = 0;
+
+#if EINSUM_USE_SSE1 && @float32@
+    __m128 a, accum_sse = _mm_setzero_ps();
+#elif EINSUM_USE_SSE2 && @float64@
+    __m128d a, accum_sse = _mm_setzero_pd();
+#endif
+
+    NPY_EINSUM_DBG_PRINT1("@name@_sum_of_products_stride0_contig_outstride0_two (%d)\n",
+                                                    (int)count);
+
+/* This is placed before the main loop to make small counts faster */
+finish_after_unrolled_loop:
+    switch (count) {
+/**begin repeat2
+ * #i = 6, 5, 4, 3, 2, 1, 0#
+ */
+        case @i@+1:
+            accum += @from@(data1[@i@]);
+/**end repeat2**/
+        case 0:
+            *(@type@ *)dataptr[2] = @to@(@from@(*(@type@ *)dataptr[2]) + value0 * accum);
+            return;
+    }
+
+#if EINSUM_USE_SSE1 && @float32@
+    /* Use aligned instructions if possible */
+    if (EINSUM_IS_SSE_ALIGNED(data1)) {
+        /* Unroll the loop by 8 */
+        while (count >= 8) {
+            count -= 8;
+
+/**begin repeat2
+ * #i = 0, 4#
+ */
+            /*
+             * NOTE: This accumulation changes the order, so will likely
+             *       produce slightly different results.
+             */
+            accum_sse = _mm_add_ps(accum_sse, _mm_load_ps(data1+@i@));
+/**end repeat2**/
+            data1 += 8;
+        }
+        /* Add the four SSE values and put in accum */
+        a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(2,3,0,1));
+        accum_sse = _mm_add_ps(a, accum_sse);
+        a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(1,0,3,2));
+        accum_sse = _mm_add_ps(a, accum_sse);
+        _mm_store_ss(&accum, accum_sse);
+
+        /* Finish off the loop */
+        goto finish_after_unrolled_loop;
+    }
+#elif EINSUM_USE_SSE2 && @float64@
+    /* Use aligned instructions if possible */
+    if (EINSUM_IS_SSE_ALIGNED(data1)) {
+        /* Unroll the loop by 8 */
+        while (count >= 8) {
+            count -= 8;
+
+/**begin repeat2
+ * #i = 0, 2, 4, 6#
+ */
+            /*
+             * NOTE: This accumulation changes the order, so will likely
+             *       produce slightly different results.
+             */
+            accum_sse = _mm_add_pd(accum_sse, _mm_load_pd(data1+@i@));
+/**end repeat2**/
+            data1 += 8;
+        }
+        /* Add the two SSE2 values and put in accum */
+        a = _mm_shuffle_pd(accum_sse, accum_sse, _MM_SHUFFLE2(0,1));
+        accum_sse = _mm_add_pd(a, accum_sse);
+        _mm_store_sd(&accum, accum_sse);
+
+        /* Finish off the loop */
+        goto finish_after_unrolled_loop;
+    }
+#endif
+
+    /* Unroll the loop by 8 */
+    while (count >= 8) {
+        count -= 8;
+
+#if EINSUM_USE_SSE1 && @float32@
+/**begin repeat2
+ * #i = 0, 4#
+ */
+        /*
+         * NOTE: This accumulation changes the order, so will likely
+         *       produce slightly different results.
+         */
+        accum_sse = _mm_add_ps(accum_sse, _mm_loadu_ps(data1+@i@));
+/**end repeat2**/
+#elif EINSUM_USE_SSE2 && @float64@
+/**begin repeat2
+ * #i = 0, 2, 4, 6#
+ */
+        /*
+         * NOTE: This accumulation changes the order, so will likely
+         *       produce slightly different results.
+         */
+        accum_sse = _mm_add_pd(accum_sse, _mm_loadu_pd(data1+@i@));
+/**end repeat2**/
+#else
+/**begin repeat2
+ * #i = 0, 1, 2, 3, 4, 5, 6, 7#
+ */
+        accum += @from@(data1[@i@]);
+/**end repeat2**/
+#endif
+        data1 += 8;
+    }
+
+#if EINSUM_USE_SSE1 && @float32@
+    /* Add the four SSE values and put in accum */
+    a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(2,3,0,1));
+    accum_sse = _mm_add_ps(a, accum_sse);
+    a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(1,0,3,2));
+    accum_sse = _mm_add_ps(a, accum_sse);
+    _mm_store_ss(&accum, accum_sse);
+#elif EINSUM_USE_SSE2 && @float64@
+    /* Add the two SSE2 values and put in accum */
+    a = _mm_shuffle_pd(accum_sse, accum_sse, _MM_SHUFFLE2(0,1));
+    accum_sse = _mm_add_pd(a, accum_sse);
+    _mm_store_sd(&accum, accum_sse);
+#endif
+
+    /* Finish off the loop */
+    goto finish_after_unrolled_loop;
+}
+
+static void
+@name@_sum_of_products_contig_stride0_outstride0_two(int nop, char **dataptr,
+                                npy_intp const *NPY_UNUSED(strides), npy_intp count)
+{
+    @type@ *data0 = (@type@ *)dataptr[0];
+    @temptype@ value1 = @from@(*(@type@ *)dataptr[1]);
+    @temptype@ accum = 0;
+
+#if EINSUM_USE_SSE1 && @float32@
+    __m128 a, accum_sse = _mm_setzero_ps();
+#elif EINSUM_USE_SSE2 && @float64@
+    __m128d a, accum_sse = _mm_setzero_pd();
+#endif
+
+    NPY_EINSUM_DBG_PRINT1("@name@_sum_of_products_contig_stride0_outstride0_two (%d)\n",
+                                                    (int)count);
+
+/* This is placed before the main loop to make small counts faster */
+finish_after_unrolled_loop:
+    switch (count) {
+/**begin repeat2
+ * #i = 6, 5, 4, 3, 2, 1, 0#
+ */
+        case @i@+1:
+            accum += @from@(data0[@i@]);
+/**end repeat2**/
+        case 0:
+            *(@type@ *)dataptr[2] = @to@(@from@(*(@type@ *)dataptr[2]) + accum * value1);
+            return;
+    }
+
+#if EINSUM_USE_SSE1 && @float32@
+    /* Use aligned instructions if possible */
+    if (EINSUM_IS_SSE_ALIGNED(data0)) {
+        /* Unroll the loop by 8 */
+        while (count >= 8) {
+            count -= 8;
+
+/**begin repeat2
+ * #i = 0, 4#
+ */
+            /*
+             * NOTE: This accumulation changes the order, so will likely
+             *       produce slightly different results.
+             */
+            accum_sse = _mm_add_ps(accum_sse, _mm_load_ps(data0+@i@));
+/**end repeat2**/
+            data0 += 8;
+        }
+        /* Add the four SSE values and put in accum */
+        a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(2,3,0,1));
+        accum_sse = _mm_add_ps(a, accum_sse);
+        a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(1,0,3,2));
+        accum_sse = _mm_add_ps(a, accum_sse);
+        _mm_store_ss(&accum, accum_sse);
+        /* Finish off the loop */
+        goto finish_after_unrolled_loop;
+    }
+#elif EINSUM_USE_SSE2 && @float64@
+    /* Use aligned instructions if possible */
+    if (EINSUM_IS_SSE_ALIGNED(data0)) {
+        /* Unroll the loop by 8 */
+        while (count >= 8) {
+            count -= 8;
+
+/**begin repeat2
+ * #i = 0, 2, 4, 6#
+ */
+            /*
+             * NOTE: This accumulation changes the order, so will likely
+             *       produce slightly different results.
+             */
+            accum_sse = _mm_add_pd(accum_sse, _mm_load_pd(data0+@i@));
+/**end repeat2**/
+            data0 += 8;
+        }
+        /* Add the two SSE2 values and put in accum */
+        a = _mm_shuffle_pd(accum_sse, accum_sse, _MM_SHUFFLE2(0,1));
+        accum_sse = _mm_add_pd(a, accum_sse);
+        _mm_store_sd(&accum, accum_sse);
+        /* Finish off the loop */
+        goto finish_after_unrolled_loop;
+    }
+#endif
+
+    /* Unroll the loop by 8 */
+    while (count >= 8) {
+        count -= 8;
+
+#if EINSUM_USE_SSE1 && @float32@
+/**begin repeat2
+ * #i = 0, 4#
+ */
+        /*
+         * NOTE: This accumulation changes the order, so will likely
+         *       produce slightly different results.
+         */
+        accum_sse = _mm_add_ps(accum_sse, _mm_loadu_ps(data0+@i@));
+/**end repeat2**/
+#elif EINSUM_USE_SSE2 && @float64@
+/**begin repeat2
+ * #i = 0, 2, 4, 6#
+ */
+        /*
+         * NOTE: This accumulation changes the order, so will likely
+         *       produce slightly different results.
+         */
+        accum_sse = _mm_add_pd(accum_sse, _mm_loadu_pd(data0+@i@));
+/**end repeat2**/
+#else
+/**begin repeat2
+ * #i = 0, 1, 2, 3, 4, 5, 6, 7#
+ */
+        accum += @from@(data0[@i@]);
+/**end repeat2**/
+#endif
+        data0 += 8;
+    }
+
+#if EINSUM_USE_SSE1 && @float32@
+    /* Add the four SSE values and put in accum */
+    a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(2,3,0,1));
+    accum_sse = _mm_add_ps(a, accum_sse);
+    a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(1,0,3,2));
+    accum_sse = _mm_add_ps(a, accum_sse);
+    _mm_store_ss(&accum, accum_sse);
+#elif EINSUM_USE_SSE2 && @float64@
+    /* Add the two SSE2 values and put in accum */
+    a = _mm_shuffle_pd(accum_sse, accum_sse, _MM_SHUFFLE2(0,1));
+    accum_sse = _mm_add_pd(a, accum_sse);
+    _mm_store_sd(&accum, accum_sse);
+#endif
+
+    /* Finish off the loop */
+    goto finish_after_unrolled_loop;
+}
+
+#elif @nop@ == 3 && !@complex@
+
+static void
+@name@_sum_of_products_contig_three(int nop, char **dataptr,
+                                npy_intp const *NPY_UNUSED(strides), npy_intp count)
+{
+    @type@ *data0 = (@type@ *)dataptr[0];
+    @type@ *data1 = (@type@ *)dataptr[1];
+    @type@ *data2 = (@type@ *)dataptr[2];
+    @type@ *data_out = (@type@ *)dataptr[3];
+
+    /* Unroll the loop by 8 */
+    while (count >= 8) {
+        count -= 8;
+
+/**begin repeat2
+ * #i = 0, 1, 2, 3, 4, 5, 6, 7#
+ */
+        data_out[@i@] = @to@(@from@(data0[@i@]) *
+                             @from@(data1[@i@]) *
+                             @from@(data2[@i@]) +
+                             @from@(data_out[@i@]));
+/**end repeat2**/
+        data0 += 8;
+        data1 += 8;
+        data2 += 8;
+        data_out += 8;
+    }
+
+    /* Finish off the loop */
+
+/**begin repeat2
+ * #i = 0, 1, 2, 3, 4, 5, 6, 7#
+ */
+    if (count-- == 0) {
+        return;
+    }
+    data_out[@i@] = @to@(@from@(data0[@i@]) *
+                         @from@(data1[@i@]) *
+                         @from@(data2[@i@]) +
+                         @from@(data_out[@i@]));
+/**end repeat2**/
+}
+
+#else /* @nop@ > 3 || @complex */
+
+static void
+@name@_sum_of_products_contig_@noplabel@(int nop, char **dataptr,
+                                npy_intp const *NPY_UNUSED(strides), npy_intp count)
+{
+    NPY_EINSUM_DBG_PRINT1("@name@_sum_of_products_contig_@noplabel@ (%d)\n",
+                                                    (int)count);
+
+    while (count--) {
+#if !@complex@
+        @temptype@ temp = @from@(*(@type@ *)dataptr[0]);
+        int i;
+        for (i = 1; i < nop; ++i) {
+            temp *= @from@(*(@type@ *)dataptr[i]);
+        }
+        *(@type@ *)dataptr[nop] = @to@(temp +
+                                           @from@(*(@type@ *)dataptr[i]));
+        for (i = 0; i <= nop; ++i) {
+            dataptr[i] += sizeof(@type@);
+        }
+#else /* complex */
+#  if @nop@ <= 3
+#    define _SUMPROD_NOP @nop@
+#  else
+#    define _SUMPROD_NOP nop
+#  endif
+        @temptype@ re, im, tmp;
+        int i;
+        re = ((@temptype@ *)dataptr[0])[0];
+        im = ((@temptype@ *)dataptr[0])[1];
+        for (i = 1; i < _SUMPROD_NOP; ++i) {
+            tmp = re * ((@temptype@ *)dataptr[i])[0] -
+                  im * ((@temptype@ *)dataptr[i])[1];
+            im = re * ((@temptype@ *)dataptr[i])[1] +
+                 im * ((@temptype@ *)dataptr[i])[0];
+            re = tmp;
+        }
+        ((@temptype@ *)dataptr[_SUMPROD_NOP])[0] = re +
+                                     ((@temptype@ *)dataptr[_SUMPROD_NOP])[0];
+        ((@temptype@ *)dataptr[_SUMPROD_NOP])[1] = im +
+                                     ((@temptype@ *)dataptr[_SUMPROD_NOP])[1];
+
+        for (i = 0; i <= _SUMPROD_NOP; ++i) {
+            dataptr[i] += sizeof(@type@);
+        }
+#  undef _SUMPROD_NOP
+#endif
+    }
+}
+
+#endif /* functions for various @nop@ */
+
+#if @nop@ == 1
+
+static void
+@name@_sum_of_products_contig_outstride0_one(int nop, char **dataptr,
+                                npy_intp const *strides, npy_intp count)
+{
+#if @complex@
+    @temptype@ accum_re = 0, accum_im = 0;
+    @temptype@ *data0 = (@temptype@ *)dataptr[0];
+#else
+    @temptype@ accum = 0;
+    @type@ *data0 = (@type@ *)dataptr[0];
+#endif
+
+#if EINSUM_USE_SSE1 && @float32@
+    __m128 a, accum_sse = _mm_setzero_ps();
+#elif EINSUM_USE_SSE2 && @float64@
+    __m128d a, accum_sse = _mm_setzero_pd();
+#endif
+
+
+    NPY_EINSUM_DBG_PRINT1("@name@_sum_of_products_contig_outstride0_one (%d)\n",
+                                                    (int)count);
+
+/* This is placed before the main loop to make small counts faster */
+finish_after_unrolled_loop:
+    switch (count) {
+/**begin repeat2
+ * #i = 6, 5, 4, 3, 2, 1, 0#
+ */
+        case @i@+1:
+#if !@complex@
+            accum += @from@(data0[@i@]);
+#else /* complex */
+            accum_re += data0[2*@i@+0];
+            accum_im += data0[2*@i@+1];
+#endif
+/**end repeat2**/
+        case 0:
+#if @complex@
+            ((@temptype@ *)dataptr[1])[0] += accum_re;
+            ((@temptype@ *)dataptr[1])[1] += accum_im;
+#else
+            *((@type@ *)dataptr[1]) = @to@(accum +
+                                    @from@(*((@type@ *)dataptr[1])));
+#endif
+            return;
+    }
+
+#if EINSUM_USE_SSE1 && @float32@
+    /* Use aligned instructions if possible */
+    if (EINSUM_IS_SSE_ALIGNED(data0)) {
+        /* Unroll the loop by 8 */
+        while (count >= 8) {
+            count -= 8;
+
+            _mm_prefetch(data0 + 512, _MM_HINT_T0);
+
+/**begin repeat2
+ * #i = 0, 4#
+ */
+            /*
+             * NOTE: This accumulation changes the order, so will likely
+             *       produce slightly different results.
+             */
+            accum_sse = _mm_add_ps(accum_sse, _mm_load_ps(data0+@i@));
+/**end repeat2**/
+            data0 += 8;
+        }
+
+        /* Add the four SSE values and put in accum */
+        a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(2,3,0,1));
+        accum_sse = _mm_add_ps(a, accum_sse);
+        a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(1,0,3,2));
+        accum_sse = _mm_add_ps(a, accum_sse);
+        _mm_store_ss(&accum, accum_sse);
+
+        /* Finish off the loop */
+        goto finish_after_unrolled_loop;
+    }
+#elif EINSUM_USE_SSE2 && @float64@
+    /* Use aligned instructions if possible */
+    if (EINSUM_IS_SSE_ALIGNED(data0)) {
+        /* Unroll the loop by 8 */
+        while (count >= 8) {
+            count -= 8;
+
+            _mm_prefetch(data0 + 512, _MM_HINT_T0);
+
+/**begin repeat2
+ * #i = 0, 2, 4, 6#
+ */
+            /*
+             * NOTE: This accumulation changes the order, so will likely
+             *       produce slightly different results.
+             */
+            accum_sse = _mm_add_pd(accum_sse, _mm_load_pd(data0+@i@));
+/**end repeat2**/
+            data0 += 8;
+        }
+
+        /* Add the two SSE2 values and put in accum */
+        a = _mm_shuffle_pd(accum_sse, accum_sse, _MM_SHUFFLE2(0,1));
+        accum_sse = _mm_add_pd(a, accum_sse);
+        _mm_store_sd(&accum, accum_sse);
+
+        /* Finish off the loop */
+        goto finish_after_unrolled_loop;
+    }
+#endif
+
+    /* Unroll the loop by 8 */
+    while (count >= 8) {
+        count -= 8;
+
+#if EINSUM_USE_SSE1 && @float32@
+        _mm_prefetch(data0 + 512, _MM_HINT_T0);
+
+/**begin repeat2
+ * #i = 0, 4#
+ */
+        /*
+         * NOTE: This accumulation changes the order, so will likely
+         *       produce slightly different results.
+         */
+        accum_sse = _mm_add_ps(accum_sse, _mm_loadu_ps(data0+@i@));
+/**end repeat2**/
+#elif EINSUM_USE_SSE2 && @float64@
+        _mm_prefetch(data0 + 512, _MM_HINT_T0);
+
+/**begin repeat2
+ * #i = 0, 2, 4, 6#
+ */
+        /*
+         * NOTE: This accumulation changes the order, so will likely
+         *       produce slightly different results.
+         */
+        accum_sse = _mm_add_pd(accum_sse, _mm_loadu_pd(data0+@i@));
+/**end repeat2**/
+#else
+/**begin repeat2
+ * #i = 0, 1, 2, 3, 4, 5, 6, 7#
+ */
+#  if !@complex@
+        accum += @from@(data0[@i@]);
+#  else /* complex */
+        accum_re += data0[2*@i@+0];
+        accum_im += data0[2*@i@+1];
+#  endif
+/**end repeat2**/
+#endif
+
+#if !@complex@
+        data0 += 8;
+#else
+        data0 += 8*2;
+#endif
+    }
+
+#if EINSUM_USE_SSE1 && @float32@
+    /* Add the four SSE values and put in accum */
+    a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(2,3,0,1));
+    accum_sse = _mm_add_ps(a, accum_sse);
+    a = _mm_shuffle_ps(accum_sse, accum_sse, _MM_SHUFFLE(1,0,3,2));
+    accum_sse = _mm_add_ps(a, accum_sse);
+    _mm_store_ss(&accum, accum_sse);
+#elif EINSUM_USE_SSE2 && @float64@
+    /* Add the two SSE2 values and put in accum */
+    a = _mm_shuffle_pd(accum_sse, accum_sse, _MM_SHUFFLE2(0,1));
+    accum_sse = _mm_add_pd(a, accum_sse);
+    _mm_store_sd(&accum, accum_sse);
+#endif
+
+    /* Finish off the loop */
+    goto finish_after_unrolled_loop;
+}
+
+#endif /* @nop@ == 1 */
+
+static void
+@name@_sum_of_products_outstride0_@noplabel@(int nop, char **dataptr,
+                                npy_intp const *strides, npy_intp count)
+{
+#if @complex@
+    @temptype@ accum_re = 0, accum_im = 0;
+#else
+    @temptype@ accum = 0;
+#endif
+
+#if (@nop@ == 1) || (@nop@ <= 3 && !@complex@)
+    char *data0 = dataptr[0];
+    npy_intp stride0 = strides[0];
+#endif
+#if (@nop@ == 2 || @nop@ == 3) && !@complex@
+    char *data1 = dataptr[1];
+    npy_intp stride1 = strides[1];
+#endif
+#if (@nop@ == 3) && !@complex@
+    char *data2 = dataptr[2];
+    npy_intp stride2 = strides[2];
+#endif
+
+    NPY_EINSUM_DBG_PRINT1("@name@_sum_of_products_outstride0_@noplabel@ (%d)\n",
+                                                    (int)count);
+
+    while (count--) {
+#if !@complex@
+#  if @nop@ == 1
+        accum += @from@(*(@type@ *)data0);
+        data0 += stride0;
+#  elif @nop@ == 2
+        accum += @from@(*(@type@ *)data0) *
+                 @from@(*(@type@ *)data1);
+        data0 += stride0;
+        data1 += stride1;
+#  elif @nop@ == 3
+        accum += @from@(*(@type@ *)data0) *
+                 @from@(*(@type@ *)data1) *
+                 @from@(*(@type@ *)data2);
+        data0 += stride0;
+        data1 += stride1;
+        data2 += stride2;
+#  else
+        @temptype@ temp = @from@(*(@type@ *)dataptr[0]);
+        int i;
+        for (i = 1; i < nop; ++i) {
+            temp *= @from@(*(@type@ *)dataptr[i]);
+        }
+        accum += temp;
+        for (i = 0; i < nop; ++i) {
+            dataptr[i] += strides[i];
+        }
+#  endif
+#else /* complex */
+#  if @nop@ == 1
+        accum_re += ((@temptype@ *)data0)[0];
+        accum_im += ((@temptype@ *)data0)[1];
+        data0 += stride0;
+#  else
+#    if @nop@ <= 3
+#define _SUMPROD_NOP @nop@
+#    else
+#define _SUMPROD_NOP nop
+#    endif
+        @temptype@ re, im, tmp;
+        int i;
+        re = ((@temptype@ *)dataptr[0])[0];
+        im = ((@temptype@ *)dataptr[0])[1];
+        for (i = 1; i < _SUMPROD_NOP; ++i) {
+            tmp = re * ((@temptype@ *)dataptr[i])[0] -
+                  im * ((@temptype@ *)dataptr[i])[1];
+            im = re * ((@temptype@ *)dataptr[i])[1] +
+                 im * ((@temptype@ *)dataptr[i])[0];
+            re = tmp;
+        }
+        accum_re += re;
+        accum_im += im;
+        for (i = 0; i < _SUMPROD_NOP; ++i) {
+            dataptr[i] += strides[i];
+        }
+#undef _SUMPROD_NOP
+#  endif
+#endif
+    }
+
+#if @complex@
+#  if @nop@ <= 3
+    ((@temptype@ *)dataptr[@nop@])[0] += accum_re;
+    ((@temptype@ *)dataptr[@nop@])[1] += accum_im;
+#  else
+    ((@temptype@ *)dataptr[nop])[0] += accum_re;
+    ((@temptype@ *)dataptr[nop])[1] += accum_im;
+#  endif
+#else
+#  if @nop@ <= 3
+    *((@type@ *)dataptr[@nop@]) = @to@(accum +
+                                    @from@(*((@type@ *)dataptr[@nop@])));
+#  else
+    *((@type@ *)dataptr[nop]) = @to@(accum +
+                                    @from@(*((@type@ *)dataptr[nop])));
+#  endif
+#endif
+
+}
+
+/**end repeat1**/
+
+/**end repeat**/
+
+
+/* Do OR of ANDs for the boolean type */
+
+/**begin repeat
+ * #nop = 1, 2, 3, 1000#
+ * #noplabel = one, two, three, any#
+ */
+
+static void
+bool_sum_of_products_@noplabel@(int nop, char **dataptr,
+                                npy_intp const *strides, npy_intp count)
+{
+#if (@nop@ <= 3)
+    char *data0 = dataptr[0];
+    npy_intp stride0 = strides[0];
+#endif
+#if (@nop@ == 2 || @nop@ == 3)
+    char *data1 = dataptr[1];
+    npy_intp stride1 = strides[1];
+#endif
+#if (@nop@ == 3)
+    char *data2 = dataptr[2];
+    npy_intp stride2 = strides[2];
+#endif
+#if (@nop@ <= 3)
+    char *data_out = dataptr[@nop@];
+    npy_intp stride_out = strides[@nop@];
+#endif
+
+    while (count--) {
+#if @nop@ == 1
+        *(npy_bool *)data_out = *(npy_bool *)data0 ||
+                                  *(npy_bool *)data_out;
+        data0 += stride0;
+        data_out += stride_out;
+#elif @nop@ == 2
+        *(npy_bool *)data_out = (*(npy_bool *)data0 &&
+                                   *(npy_bool *)data1) ||
+                                   *(npy_bool *)data_out;
+        data0 += stride0;
+        data1 += stride1;
+        data_out += stride_out;
+#elif @nop@ == 3
+        *(npy_bool *)data_out = (*(npy_bool *)data0 &&
+                                   *(npy_bool *)data1 &&
+                                   *(npy_bool *)data2) ||
+                                   *(npy_bool *)data_out;
+        data0 += stride0;
+        data1 += stride1;
+        data2 += stride2;
+        data_out += stride_out;
+#else
+        npy_bool temp = *(npy_bool *)dataptr[0];
+        int i;
+        for (i = 1; i < nop; ++i) {
+            temp = temp && *(npy_bool *)dataptr[i];
+        }
+        *(npy_bool *)dataptr[nop] = temp || *(npy_bool *)dataptr[i];
+        for (i = 0; i <= nop; ++i) {
+            dataptr[i] += strides[i];
+        }
+#endif
+    }
+}
+
+static void
+bool_sum_of_products_contig_@noplabel@(int nop, char **dataptr,
+                                npy_intp const *strides, npy_intp count)
+{
+#if (@nop@ <= 3)
+    char *data0 = dataptr[0];
+#endif
+#if (@nop@ == 2 || @nop@ == 3)
+    char *data1 = dataptr[1];
+#endif
+#if (@nop@ == 3)
+    char *data2 = dataptr[2];
+#endif
+#if (@nop@ <= 3)
+    char *data_out = dataptr[@nop@];
+#endif
+
+#if (@nop@ <= 3)
+/* This is placed before the main loop to make small counts faster */
+finish_after_unrolled_loop:
+    switch (count) {
+/**begin repeat1
+ * #i = 6, 5, 4, 3, 2, 1, 0#
+ */
+        case @i@+1:
+#  if @nop@ == 1
+            ((npy_bool *)data_out)[@i@] = ((npy_bool *)data0)[@i@] ||
+                                            ((npy_bool *)data_out)[@i@];
+#  elif @nop@ == 2
+            ((npy_bool *)data_out)[@i@] =
+                            (((npy_bool *)data0)[@i@] &&
+                             ((npy_bool *)data1)[@i@]) ||
+                                ((npy_bool *)data_out)[@i@];
+#  elif @nop@ == 3
+            ((npy_bool *)data_out)[@i@] =
+                           (((npy_bool *)data0)[@i@] &&
+                            ((npy_bool *)data1)[@i@] &&
+                            ((npy_bool *)data2)[@i@]) ||
+                                ((npy_bool *)data_out)[@i@];
+#  endif
+/**end repeat1**/
+        case 0:
+            return;
+    }
+#endif
+
+/* Unroll the loop by 8 for fixed-size nop */
+#if (@nop@ <= 3)
+    while (count >= 8) {
+        count -= 8;
+#else
+    while (count--) {
+#endif
+
+#  if @nop@ == 1
+/**begin repeat1
+ * #i = 0, 1, 2, 3, 4, 5, 6, 7#
+ */
+        *((npy_bool *)data_out + @i@) = (*((npy_bool *)data0 + @i@)) ||
+                                        (*((npy_bool *)data_out + @i@));
+/**end repeat1**/
+        data0 += 8*sizeof(npy_bool);
+        data_out += 8*sizeof(npy_bool);
+#  elif @nop@ == 2
+/**begin repeat1
+ * #i = 0, 1, 2, 3, 4, 5, 6, 7#
+ */
+        *((npy_bool *)data_out + @i@) =
+                        ((*((npy_bool *)data0 + @i@)) &&
+                         (*((npy_bool *)data1 + @i@))) ||
+                            (*((npy_bool *)data_out + @i@));
+/**end repeat1**/
+        data0 += 8*sizeof(npy_bool);
+        data1 += 8*sizeof(npy_bool);
+        data_out += 8*sizeof(npy_bool);
+#  elif @nop@ == 3
+/**begin repeat1
+ * #i = 0, 1, 2, 3, 4, 5, 6, 7#
+ */
+        *((npy_bool *)data_out + @i@) =
+                       ((*((npy_bool *)data0 + @i@)) &&
+                        (*((npy_bool *)data1 + @i@)) &&
+                        (*((npy_bool *)data2 + @i@))) ||
+                            (*((npy_bool *)data_out + @i@));
+/**end repeat1**/
+        data0 += 8*sizeof(npy_bool);
+        data1 += 8*sizeof(npy_bool);
+        data2 += 8*sizeof(npy_bool);
+        data_out += 8*sizeof(npy_bool);
+#  else
+        npy_bool temp = *(npy_bool *)dataptr[0];
+        int i;
+        for (i = 1; i < nop; ++i) {
+            temp = temp && *(npy_bool *)dataptr[i];
+        }
+        *(npy_bool *)dataptr[nop] = temp || *(npy_bool *)dataptr[i];
+        for (i = 0; i <= nop; ++i) {
+            dataptr[i] += sizeof(npy_bool);
+        }
+#  endif
+    }
+
+    /* If the loop was unrolled, we need to finish it off */
+#if (@nop@ <= 3)
+    goto finish_after_unrolled_loop;
+#endif
+}
+
+static void
+bool_sum_of_products_outstride0_@noplabel@(int nop, char **dataptr,
+                                npy_intp const *strides, npy_intp count)
+{
+    npy_bool accum = 0;
+
+#if (@nop@ <= 3)
+    char *data0 = dataptr[0];
+    npy_intp stride0 = strides[0];
+#endif
+#if (@nop@ == 2 || @nop@ == 3)
+    char *data1 = dataptr[1];
+    npy_intp stride1 = strides[1];
+#endif
+#if (@nop@ == 3)
+    char *data2 = dataptr[2];
+    npy_intp stride2 = strides[2];
+#endif
+
+    while (count--) {
+#if @nop@ == 1
+        accum = *(npy_bool *)data0 || accum;
+        data0 += stride0;
+#elif @nop@ == 2
+        accum = (*(npy_bool *)data0 && *(npy_bool *)data1) || accum;
+        data0 += stride0;
+        data1 += stride1;
+#elif @nop@ == 3
+        accum = (*(npy_bool *)data0 &&
+                 *(npy_bool *)data1 &&
+                 *(npy_bool *)data2) || accum;
+        data0 += stride0;
+        data1 += stride1;
+        data2 += stride2;
+#else
+        npy_bool temp = *(npy_bool *)dataptr[0];
+        int i;
+        for (i = 1; i < nop; ++i) {
+            temp = temp && *(npy_bool *)dataptr[i];
+        }
+        accum = temp || accum;
+        for (i = 0; i <= nop; ++i) {
+            dataptr[i] += strides[i];
+        }
+#endif
+    }
+
+#  if @nop@ <= 3
+    *((npy_bool *)dataptr[@nop@]) = accum || *((npy_bool *)dataptr[@nop@]);
+#  else
+    *((npy_bool *)dataptr[nop]) = accum || *((npy_bool *)dataptr[nop]);
+#  endif
+}
+
+/**end repeat**/
+
+/* These tables need to match up with the type enum */
+static sum_of_products_fn
+_contig_outstride0_unary_specialization_table[NPY_NTYPES] = {
+/**begin repeat
+ * #name = bool,
+ *         byte, ubyte,
+ *         short, ushort,
+ *         int, uint,
+ *         long, ulong,
+ *         longlong, ulonglong,
+ *         float, double, longdouble,
+ *         cfloat, cdouble, clongdouble,
+ *         object, string, unicode, void,
+ *         datetime, timedelta, half#
+ * #use = 0,
+ *        1, 1,
+ *        1, 1,
+ *        1, 1,
+ *        1, 1,
+ *        1, 1,
+ *        1, 1, 1,
+ *        1, 1, 1,
+ *        0, 0, 0, 0,
+ *        0, 0, 1#
+ */
+#if @use@
+    &@name@_sum_of_products_contig_outstride0_one,
+#else
+    NULL,
+#endif
+/**end repeat**/
+}; /* End of _contig_outstride0_unary_specialization_table */
+
+static sum_of_products_fn _binary_specialization_table[NPY_NTYPES][5] = {
+/**begin repeat
+ * #name = bool,
+ *         byte, ubyte,
+ *         short, ushort,
+ *         int, uint,
+ *         long, ulong,
+ *         longlong, ulonglong,
+ *         float, double, longdouble,
+ *         cfloat, cdouble, clongdouble,
+ *         object, string, unicode, void,
+ *         datetime, timedelta, half#
+ * #use = 0,
+ *        1, 1,
+ *        1, 1,
+ *        1, 1,
+ *        1, 1,
+ *        1, 1,
+ *        1, 1, 1,
+ *        0, 0, 0,
+ *        0, 0, 0, 0,
+ *        0, 0, 1#
+ */
+#if @use@
+{
+    &@name@_sum_of_products_stride0_contig_outstride0_two,
+    &@name@_sum_of_products_stride0_contig_outcontig_two,
+    &@name@_sum_of_products_contig_stride0_outstride0_two,
+    &@name@_sum_of_products_contig_stride0_outcontig_two,
+    &@name@_sum_of_products_contig_contig_outstride0_two,
+},
+#else
+    {NULL, NULL, NULL, NULL, NULL},
+#endif
+/**end repeat**/
+}; /* End of _binary_specialization_table */
+
+static sum_of_products_fn _outstride0_specialized_table[NPY_NTYPES][4] = {
+/**begin repeat
+ * #name = bool,
+ *         byte, ubyte,
+ *         short, ushort,
+ *         int, uint,
+ *         long, ulong,
+ *         longlong, ulonglong,
+ *         float, double, longdouble,
+ *         cfloat, cdouble, clongdouble,
+ *         object, string, unicode, void,
+ *         datetime, timedelta, half#
+ * #use = 1,
+ *        1, 1,
+ *        1, 1,
+ *        1, 1,
+ *        1, 1,
+ *        1, 1,
+ *        1, 1, 1,
+ *        1, 1, 1,
+ *        0, 0, 0, 0,
+ *        0, 0, 1#
+ */
+#if @use@
+{
+    &@name@_sum_of_products_outstride0_any,
+    &@name@_sum_of_products_outstride0_one,
+    &@name@_sum_of_products_outstride0_two,
+    &@name@_sum_of_products_outstride0_three
+},
+#else
+    {NULL, NULL, NULL, NULL},
+#endif
+/**end repeat**/
+}; /* End of _outstride0_specialized_table */
+
+static sum_of_products_fn _allcontig_specialized_table[NPY_NTYPES][4] = {
+/**begin repeat
+ * #name = bool,
+ *         byte, ubyte,
+ *         short, ushort,
+ *         int, uint,
+ *         long, ulong,
+ *         longlong, ulonglong,
+ *         float, double, longdouble,
+ *         cfloat, cdouble, clongdouble,
+ *         object, string, unicode, void,
+ *         datetime, timedelta, half#
+ * #use = 1,
+ *        1, 1,
+ *        1, 1,
+ *        1, 1,
+ *        1, 1,
+ *        1, 1,
+ *        1, 1, 1,
+ *        1, 1, 1,
+ *        0, 0, 0, 0,
+ *        0, 0, 1#
+ */
+#if @use@
+{
+    &@name@_sum_of_products_contig_any,
+    &@name@_sum_of_products_contig_one,
+    &@name@_sum_of_products_contig_two,
+    &@name@_sum_of_products_contig_three
+},
+#else
+    {NULL, NULL, NULL, NULL},
+#endif
+/**end repeat**/
+}; /* End of _allcontig_specialized_table */
+
+static sum_of_products_fn _unspecialized_table[NPY_NTYPES][4] = {
+/**begin repeat
+ * #name = bool,
+ *         byte, ubyte,
+ *         short, ushort,
+ *         int, uint,
+ *         long, ulong,
+ *         longlong, ulonglong,
+ *         float, double, longdouble,
+ *         cfloat, cdouble, clongdouble,
+ *         object, string, unicode, void,
+ *         datetime, timedelta, half#
+ * #use = 1,
+ *        1, 1,
+ *        1, 1,
+ *        1, 1,
+ *        1, 1,
+ *        1, 1,
+ *        1, 1, 1,
+ *        1, 1, 1,
+ *        0, 0, 0, 0,
+ *        0, 0, 1#
+ */
+#if @use@
+{
+    &@name@_sum_of_products_any,
+    &@name@_sum_of_products_one,
+    &@name@_sum_of_products_two,
+    &@name@_sum_of_products_three
+},
+#else
+    {NULL, NULL, NULL, NULL},
+#endif
+/**end repeat**/
+}; /* End of _unnspecialized_table */
+
+NPY_VISIBILITY_HIDDEN sum_of_products_fn
+get_sum_of_products_function(int nop, int type_num,
+                             npy_intp itemsize, npy_intp const *fixed_strides)
+{
+    int iop;
+
+    if (type_num >= NPY_NTYPES) {
+        return NULL;
+    }
+
+    /* contiguous reduction */
+    if (nop == 1 && fixed_strides[0] == itemsize && fixed_strides[1] == 0) {
+        sum_of_products_fn ret =
+            _contig_outstride0_unary_specialization_table[type_num];
+        if (ret != NULL) {
+            return ret;
+        }
+    }
+
+    /* nop of 2 has more specializations */
+    if (nop == 2) {
+        /* Encode the zero/contiguous strides */
+        int code;
+        code = (fixed_strides[0] == 0) ? 0 :
+                    (fixed_strides[0] == itemsize) ? 2*2*1 : 8;
+        code += (fixed_strides[1] == 0) ? 0 :
+                    (fixed_strides[1] == itemsize) ? 2*1 : 8;
+        code += (fixed_strides[2] == 0) ? 0 :
+                    (fixed_strides[2] == itemsize) ? 1 : 8;
+        if (code >= 2 && code < 7) {
+            sum_of_products_fn ret =
+                        _binary_specialization_table[type_num][code-2];
+            if (ret != NULL) {
+                return ret;
+            }
+        }
+    }
+
+    /* Inner loop with an output stride of 0 */
+    if (fixed_strides[nop] == 0) {
+        return _outstride0_specialized_table[type_num][nop <= 3 ? nop : 0];
+    }
+
+    /* Check for all contiguous */
+    for (iop = 0; iop < nop + 1; ++iop) {
+        if (fixed_strides[iop] != itemsize) {
+            break;
+        }
+    }
+
+    /* Contiguous loop */
+    if (iop == nop + 1) {
+        return _allcontig_specialized_table[type_num][nop <= 3 ? nop : 0];
+    }
+
+    /* None of the above specializations caught it, general loops */
+    return _unspecialized_table[type_num][nop <= 3 ? nop : 0];
+}
diff --git a/numpy/core/src/multiarray/einsum_sumprod.h b/numpy/core/src/multiarray/einsum_sumprod.h
new file mode 100644
index 000000000..c6cf18ec6
--- /dev/null
+++ b/numpy/core/src/multiarray/einsum_sumprod.h
@@ -0,0 +1,12 @@
+#ifndef _NPY_MULTIARRAY_EINSUM_SUMPROD_H
+#define _NPY_MULTIARRAY_EINSUM_SUMPROD_H
+
+#include <numpy/npy_common.h>
+
+typedef void (*sum_of_products_fn)(int, char **, npy_intp const*, npy_intp);
+
+NPY_VISIBILITY_HIDDEN sum_of_products_fn
+get_sum_of_products_function(int nop, int type_num,
+                             npy_intp itemsize, npy_intp const *fixed_strides);
+
+#endif
diff --git a/numpy/core/src/multiarray/flagsobject.c b/numpy/core/src/multiarray/flagsobject.c
index bec0523d5..9b7d8deae 100644
--- a/numpy/core/src/multiarray/flagsobject.c
+++ b/numpy/core/src/multiarray/flagsobject.c
@@ -711,7 +711,7 @@ arrayflags_print(PyArrayFlagsObject *self)
     if (fl & NPY_ARRAY_WARN_ON_WRITE) {
         _warn_on_write = "  (with WARN_ON_WRITE=True)";
     }
-    return PyUString_FromFormat(
+    return PyUnicode_FromFormat(
                         "  %s : %s\n  %s : %s\n"
                         "  %s : %s\n  %s : %s%s\n"
                         "  %s : %s\n  %s : %s\n"
diff --git a/numpy/core/src/multiarray/getset.c b/numpy/core/src/multiarray/getset.c
index c8533539b..3575d6fad 100644
--- a/numpy/core/src/multiarray/getset.c
+++ b/numpy/core/src/multiarray/getset.c
@@ -217,7 +217,7 @@ array_protocol_descr_get(PyArrayObject *self)
     if (dobj == NULL) {
         return NULL;
     }
-    PyTuple_SET_ITEM(dobj, 0, PyString_FromString(""));
+    PyTuple_SET_ITEM(dobj, 0, PyUnicode_FromString(""));
     PyTuple_SET_ITEM(dobj, 1, array_typestr_get(self));
     res = PyList_New(1);
     if (res == NULL) {
@@ -621,7 +621,6 @@ static PyObject *
 array_struct_get(PyArrayObject *self)
 {
     PyArrayInterface *inter;
-    PyObject *ret;
 
     inter = (PyArrayInterface *)PyArray_malloc(sizeof(PyArrayInterface));
     if (inter==NULL) {
@@ -673,8 +672,14 @@ array_struct_get(PyArrayObject *self)
     else {
         inter->descr = NULL;
     }
+    PyObject *ret = PyCapsule_New(inter, NULL, gentype_struct_free);
+    if (ret == NULL) {
+        return NULL;
+    }
     Py_INCREF(self);
-    ret = NpyCapsule_FromVoidPtrAndDesc(inter, self, gentype_struct_free);
+    if (PyCapsule_SetContext(ret, self) < 0) {
+        return NULL;
+    }
     return ret;
 }
 
diff --git a/numpy/core/src/multiarray/mapping.c b/numpy/core/src/multiarray/mapping.c
index db15ff1d5..c0cea0f21 100644
--- a/numpy/core/src/multiarray/mapping.c
+++ b/numpy/core/src/multiarray/mapping.c
@@ -1418,7 +1418,7 @@ _get_field_view(PyArrayObject *arr, PyObject *ind, PyArrayObject **view)
             return 0;
         }
         else if (tup == NULL){
-            PyObject *errmsg = PyUString_FromString("no field of name ");
+            PyObject *errmsg = PyUnicode_FromString("no field of name ");
             PyUString_Concat(&errmsg, ind);
             PyErr_SetObject(PyExc_ValueError, errmsg);
             Py_DECREF(errmsg);
@@ -2438,7 +2438,7 @@ mapiter_fill_info(PyArrayMapIterObject *mit, npy_index_info *indices,
      * Attempt to set a meaningful exception. Could also find out
      * if a boolean index was converted.
      */
-    errmsg = PyUString_FromString("shape mismatch: indexing arrays could not "
+    errmsg = PyUnicode_FromString("shape mismatch: indexing arrays could not "
                                   "be broadcast together with shapes ");
     if (errmsg == NULL) {
         return -1;
@@ -3183,7 +3183,7 @@ PyArray_MapIterNew(npy_index_info *indices , int index_num, int index_type,
     goto finish;
 
   broadcast_error:
-    errmsg = PyUString_FromString("shape mismatch: value array "
+    errmsg = PyUnicode_FromString("shape mismatch: value array "
                     "of shape ");
     if (errmsg == NULL) {
         goto finish;
@@ -3204,7 +3204,7 @@ PyArray_MapIterNew(npy_index_info *indices , int index_num, int index_type,
         goto finish;
     }
 
-    tmp = PyUString_FromString("could not be broadcast to indexing "
+    tmp = PyUnicode_FromString("could not be broadcast to indexing "
                     "result of shape ");
     PyUString_ConcatAndDel(&errmsg, tmp);
     if (errmsg == NULL) {
diff --git a/numpy/core/src/multiarray/methods.c b/numpy/core/src/multiarray/methods.c
index 0519434e8..ae2dceb10 100644
--- a/numpy/core/src/multiarray/methods.c
+++ b/numpy/core/src/multiarray/methods.c
@@ -2585,9 +2585,10 @@ array_complex(PyArrayObject *self, PyObject *NPY_UNUSED(args))
     PyArrayObject *arr;
     PyArray_Descr *dtype;
     PyObject *c;
+
     if (PyArray_SIZE(self) != 1) {
-        PyErr_SetString(PyExc_TypeError, "only length-1 arrays can "\
-                        "be converted to Python scalars");
+        PyErr_SetString(PyExc_TypeError,
+                "only length-1 arrays can be converted to Python scalars");
         return NULL;
     }
 
@@ -2598,38 +2599,18 @@ array_complex(PyArrayObject *self, PyObject *NPY_UNUSED(args))
 
     if (!PyArray_CanCastArrayTo(self, dtype, NPY_SAME_KIND_CASTING) &&
             !(PyArray_TYPE(self) == NPY_OBJECT)) {
-        PyObject *err, *msg_part;
+        PyObject *descr = (PyObject*)PyArray_DESCR(self);
+
         Py_DECREF(dtype);
-        err = PyString_FromString("unable to convert ");
-        if (err == NULL) {
-            return NULL;
-        }
-        msg_part = PyObject_Repr((PyObject*)PyArray_DESCR(self));
-        if (msg_part == NULL) {
-            Py_DECREF(err);
-            return NULL;
-        }
-        PyString_ConcatAndDel(&err, msg_part);
-        if (err == NULL) {
-            return NULL;
-        }
-        msg_part = PyString_FromString(", to complex.");
-        if (msg_part == NULL) {
-            Py_DECREF(err);
-            return NULL;
-        }
-        PyString_ConcatAndDel(&err, msg_part);
-        if (err == NULL) {
-            return NULL;
-        }
-        PyErr_SetObject(PyExc_TypeError, err);
-        Py_DECREF(err);
+        PyErr_Format(PyExc_TypeError,
+                "Unable to convert %R to complex", descr);
         return NULL;
     }
 
     if (PyArray_TYPE(self) == NPY_OBJECT) {
         /* let python try calling __complex__ on the object. */
         PyObject *args, *res;
+
         Py_DECREF(dtype);
         args = Py_BuildValue("(O)", *((PyObject**)PyArray_DATA(self)));
         if (args == NULL) {
diff --git a/numpy/core/src/multiarray/multiarraymodule.c b/numpy/core/src/multiarray/multiarraymodule.c
index 923469edf..db419636d 100644
--- a/numpy/core/src/multiarray/multiarraymodule.c
+++ b/numpy/core/src/multiarray/multiarraymodule.c
@@ -4335,18 +4335,18 @@ NPY_VISIBILITY_HIDDEN PyObject * npy_ma_str_axis2 = NULL;
 static int
 intern_strings(void)
 {
-    npy_ma_str_array = PyUString_InternFromString("__array__");
-    npy_ma_str_array_prepare = PyUString_InternFromString("__array_prepare__");
-    npy_ma_str_array_wrap = PyUString_InternFromString("__array_wrap__");
-    npy_ma_str_array_finalize = PyUString_InternFromString("__array_finalize__");
-    npy_ma_str_ufunc = PyUString_InternFromString("__array_ufunc__");
-    npy_ma_str_implementation = PyUString_InternFromString("_implementation");
-    npy_ma_str_order = PyUString_InternFromString("order");
-    npy_ma_str_copy = PyUString_InternFromString("copy");
-    npy_ma_str_dtype = PyUString_InternFromString("dtype");
-    npy_ma_str_ndmin = PyUString_InternFromString("ndmin");
-    npy_ma_str_axis1 = PyUString_InternFromString("axis1");
-    npy_ma_str_axis2 = PyUString_InternFromString("axis2");
+    npy_ma_str_array = PyUnicode_InternFromString("__array__");
+    npy_ma_str_array_prepare = PyUnicode_InternFromString("__array_prepare__");
+    npy_ma_str_array_wrap = PyUnicode_InternFromString("__array_wrap__");
+    npy_ma_str_array_finalize = PyUnicode_InternFromString("__array_finalize__");
+    npy_ma_str_ufunc = PyUnicode_InternFromString("__array_ufunc__");
+    npy_ma_str_implementation = PyUnicode_InternFromString("_implementation");
+    npy_ma_str_order = PyUnicode_InternFromString("order");
+    npy_ma_str_copy = PyUnicode_InternFromString("copy");
+    npy_ma_str_dtype = PyUnicode_InternFromString("dtype");
+    npy_ma_str_ndmin = PyUnicode_InternFromString("ndmin");
+    npy_ma_str_axis1 = PyUnicode_InternFromString("axis1");
+    npy_ma_str_axis2 = PyUnicode_InternFromString("axis2");
 
     return npy_ma_str_array && npy_ma_str_array_prepare &&
            npy_ma_str_array_wrap && npy_ma_str_array_finalize &&
@@ -4477,14 +4477,14 @@ PyMODINIT_FUNC PyInit__multiarray_umath(void) {
         goto err;
     }
 
-    c_api = NpyCapsule_FromVoidPtr((void *)PyArray_API, NULL);
+    c_api = PyCapsule_New((void *)PyArray_API, NULL, NULL);
     if (c_api == NULL) {
         goto err;
     }
     PyDict_SetItemString(d, "_ARRAY_API", c_api);
     Py_DECREF(c_api);
 
-    c_api = NpyCapsule_FromVoidPtr((void *)PyUFunc_API, NULL);
+    c_api = PyCapsule_New((void *)PyUFunc_API, NULL, NULL);
     if (c_api == NULL) {
         goto err;
     }
@@ -4506,7 +4506,7 @@ PyMODINIT_FUNC PyInit__multiarray_umath(void) {
     PyDict_SetItemString(d, "tracemalloc_domain", s);
     Py_DECREF(s);
 
-    s = PyUString_FromString("3.1");
+    s = PyUnicode_FromString("3.1");
     PyDict_SetItemString(d, "__version__", s);
     Py_DECREF(s);
 
@@ -4540,7 +4540,7 @@ PyMODINIT_FUNC PyInit__multiarray_umath(void) {
     }
     Py_DECREF(s);
 
-    s = NpyCapsule_FromVoidPtr((void *)_datetime_strings, NULL);
+    s = PyCapsule_New((void *)_datetime_strings, NULL, NULL);
     if (s == NULL) {
         goto err;
     }
diff --git a/numpy/core/src/multiarray/nditer_constr.c b/numpy/core/src/multiarray/nditer_constr.c
index a0dda4090..4bc6d2ca1 100644
--- a/numpy/core/src/multiarray/nditer_constr.c
+++ b/numpy/core/src/multiarray/nditer_constr.c
@@ -1755,7 +1755,7 @@ broadcast_error: {
         char *tmpstr;
 
         if (op_axes == NULL) {
-            errmsg = PyUString_FromString("operands could not be broadcast "
+            errmsg = PyUnicode_FromString("operands could not be broadcast "
                                           "together with shapes ");
             if (errmsg == NULL) {
                 return 0;
@@ -1776,7 +1776,7 @@ broadcast_error: {
                 }
             }
             if (itershape != NULL) {
-                tmp = PyUString_FromString("and requested shape ");
+                tmp = PyUnicode_FromString("and requested shape ");
                 if (tmp == NULL) {
                     Py_DECREF(errmsg);
                     return 0;
@@ -1801,7 +1801,7 @@ broadcast_error: {
             Py_DECREF(errmsg);
         }
         else {
-            errmsg = PyUString_FromString("operands could not be broadcast "
+            errmsg = PyUnicode_FromString("operands could not be broadcast "
                                           "together with remapped shapes "
                                           "[original->remapped]: ");
             for (iop = 0; iop < nop; ++iop) {
@@ -1843,7 +1843,7 @@ broadcast_error: {
                 }
             }
             if (itershape != NULL) {
-                tmp = PyUString_FromString("and requested shape ");
+                tmp = PyUnicode_FromString("and requested shape ");
                 if (tmp == NULL) {
                     Py_DECREF(errmsg);
                     return 0;
@@ -1877,11 +1877,11 @@ operand_different_than_broadcast: {
 
         /* Start of error message */
         if (op_flags[iop] & NPY_ITER_READONLY) {
-            errmsg = PyUString_FromString("non-broadcastable operand "
+            errmsg = PyUnicode_FromString("non-broadcastable operand "
                                           "with shape ");
         }
         else {
-            errmsg = PyUString_FromString("non-broadcastable output "
+            errmsg = PyUnicode_FromString("non-broadcastable output "
                                           "operand with shape ");
         }
         if (errmsg == NULL) {
@@ -1913,7 +1913,7 @@ operand_different_than_broadcast: {
                 }
             }
 
-            tmp = PyUString_FromString(" [remapped to ");
+            tmp = PyUnicode_FromString(" [remapped to ");
             if (tmp == NULL) {
                 return 0;
             }
@@ -1932,7 +1932,7 @@ operand_different_than_broadcast: {
             }
         }
 
-        tmp = PyUString_FromString(" doesn't match the broadcast shape ");
+        tmp = PyUnicode_FromString(" doesn't match the broadcast shape ");
         if (tmp == NULL) {
             return 0;
         }
diff --git a/numpy/core/src/multiarray/nditer_pywrap.c b/numpy/core/src/multiarray/nditer_pywrap.c
index e271906c1..5b4836cc9 100644
--- a/numpy/core/src/multiarray/nditer_pywrap.c
+++ b/numpy/core/src/multiarray/nditer_pywrap.c
@@ -1142,7 +1142,7 @@ npyiter_dealloc(NewNpyArrayIterObject *self)
                     "results.", 1) < 0) {
                 PyObject *s;
 
-                s = PyUString_FromString("npyiter_dealloc");
+                s = PyUnicode_FromString("npyiter_dealloc");
                 if (s) {
                     PyErr_WriteUnraisable(s);
                     Py_DECREF(s);
diff --git a/numpy/core/src/multiarray/scalartypes.c.src b/numpy/core/src/multiarray/scalartypes.c.src
index 58b9e2c30..c1bff1e42 100644
--- a/numpy/core/src/multiarray/scalartypes.c.src
+++ b/numpy/core/src/multiarray/scalartypes.c.src
@@ -447,7 +447,7 @@ _void_to_hex(const char* argbuf, const Py_ssize_t arglen,
     }
     memcpy(&retbuf[j], echars, strlen(echars));
 
-    retval = PyUString_FromStringAndSize(retbuf, slen);
+    retval = PyUnicode_FromStringAndSize(retbuf, slen);
     PyMem_Free(retbuf);
 
     return retval;
@@ -518,21 +518,21 @@ datetimetype_repr(PyObject *self)
      */
     if ((scal->obmeta.num == 1 && scal->obmeta.base != NPY_FR_h) ||
             scal->obmeta.base == NPY_FR_GENERIC) {
-        ret = PyUString_FromString("numpy.datetime64('");
+        ret = PyUnicode_FromString("numpy.datetime64('");
         PyUString_ConcatAndDel(&ret,
-                PyUString_FromString(iso));
+                PyUnicode_FromString(iso));
         PyUString_ConcatAndDel(&ret,
-                PyUString_FromString("')"));
+                PyUnicode_FromString("')"));
     }
     else {
-        ret = PyUString_FromString("numpy.datetime64('");
+        ret = PyUnicode_FromString("numpy.datetime64('");
         PyUString_ConcatAndDel(&ret,
-                PyUString_FromString(iso));
+                PyUnicode_FromString(iso));
         PyUString_ConcatAndDel(&ret,
-                PyUString_FromString("','"));
+                PyUnicode_FromString("','"));
         ret = append_metastr_to_string(&scal->obmeta, 1, ret);
         PyUString_ConcatAndDel(&ret,
-                PyUString_FromString("')"));
+                PyUnicode_FromString("')"));
     }
 
     return ret;
@@ -554,31 +554,31 @@ timedeltatype_repr(PyObject *self)
 
     /* The value */
     if (scal->obval == NPY_DATETIME_NAT) {
-        ret = PyUString_FromString("numpy.timedelta64('NaT'");
+        ret = PyUnicode_FromString("numpy.timedelta64('NaT'");
     }
     else {
         /*
          * Can't use "%lld" if HAVE_LONG_LONG is not defined
          */
 #if defined(HAVE_LONG_LONG)
-        ret = PyUString_FromFormat("numpy.timedelta64(%lld",
+        ret = PyUnicode_FromFormat("numpy.timedelta64(%lld",
                                             (long long)scal->obval);
 #else
-        ret = PyUString_FromFormat("numpy.timedelta64(%ld",
+        ret = PyUnicode_FromFormat("numpy.timedelta64(%ld",
                                             (long)scal->obval);
 #endif
     }
     /* The metadata unit */
     if (scal->obmeta.base == NPY_FR_GENERIC) {
         PyUString_ConcatAndDel(&ret,
-                PyUString_FromString(")"));
+                PyUnicode_FromString(")"));
     }
     else {
         PyUString_ConcatAndDel(&ret,
-                PyUString_FromString(",'"));
+                PyUnicode_FromString(",'"));
         ret = append_metastr_to_string(&scal->obmeta, 1, ret);
         PyUString_ConcatAndDel(&ret,
-                PyUString_FromString("')"));
+                PyUnicode_FromString("')"));
     }
 
     return ret;
@@ -611,7 +611,7 @@ datetimetype_str(PyObject *self)
         return NULL;
     }
 
-    return PyUString_FromString(iso);
+    return PyUnicode_FromString(iso);
 }
 
 static char *_datetime_verbose_strings[NPY_DATETIME_NUMUNITS] = {
@@ -657,21 +657,21 @@ timedeltatype_str(PyObject *self)
     }
 
     if (scal->obval == NPY_DATETIME_NAT) {
-        ret = PyUString_FromString("NaT");
+        ret = PyUnicode_FromString("NaT");
     }
     else {
         /*
          * Can't use "%lld" if HAVE_LONG_LONG is not defined
          */
 #if defined(HAVE_LONG_LONG)
-        ret = PyUString_FromFormat("%lld ",
+        ret = PyUnicode_FromFormat("%lld ",
                                 (long long)(scal->obval * scal->obmeta.num));
 #else
-        ret = PyUString_FromFormat("%ld ",
+        ret = PyUnicode_FromFormat("%ld ",
                                 (long)(scal->obval * scal->obmeta.num));
 #endif
         PyUString_ConcatAndDel(&ret,
-                PyUString_FromString(basestr));
+                PyUnicode_FromString(basestr));
     }
 
     return ret;
@@ -795,7 +795,7 @@ legacy_@name@_format@kind@(@type@ val)
         PyOS_snprintf(buf, sizeof(buf), "(%s%sj)", re, im);
     }
 
-    return PyUString_FromString(buf);
+    return PyUnicode_FromString(buf);
 }
 
 #undef _FMT1
@@ -836,7 +836,7 @@ legacy_@name@_format@kind@(npy_@name@ val){
         strcpy(&buf[cnt],".0");
     }
 
-    return PyUString_FromString(buf);
+    return PyUnicode_FromString(buf);
 }
 
 #undef _FMT1
@@ -904,7 +904,7 @@ c@name@type_@kind@(PyObject *self)
             return NULL;
         }
 
-        PyUString_ConcatAndDel(&istr, PyUString_FromString("j"));
+        PyUString_ConcatAndDel(&istr, PyUnicode_FromString("j"));
         return istr;
     }
 
@@ -915,13 +915,13 @@ c@name@type_@kind@(PyObject *self)
         }
     }
     else if (npy_isnan(val.real)) {
-        rstr = PyUString_FromString("nan");
+        rstr = PyUnicode_FromString("nan");
     }
     else if (val.real > 0){
-        rstr = PyUString_FromString("inf");
+        rstr = PyUnicode_FromString("inf");
     }
     else {
-        rstr = PyUString_FromString("-inf");
+        rstr = PyUnicode_FromString("-inf");
     }
 
     if (npy_isfinite(val.imag)) {
@@ -931,19 +931,19 @@ c@name@type_@kind@(PyObject *self)
         }
     }
     else if (npy_isnan(val.imag)) {
-        istr = PyUString_FromString("+nan");
+        istr = PyUnicode_FromString("+nan");
     }
     else if (val.imag > 0){
-        istr = PyUString_FromString("+inf");
+        istr = PyUnicode_FromString("+inf");
     }
     else {
-        istr = PyUString_FromString("-inf");
+        istr = PyUnicode_FromString("-inf");
     }
 
-    ret = PyUString_FromString("(");
+    ret = PyUnicode_FromString("(");
     PyUString_ConcatAndDel(&ret, rstr);
     PyUString_ConcatAndDel(&ret, istr);
-    PyUString_ConcatAndDel(&ret, PyUString_FromString("j)"));
+    PyUString_ConcatAndDel(&ret, PyUnicode_FromString("j)"));
     return ret;
 }
 
@@ -1147,12 +1147,16 @@ gentype_sizeof(PyObject *self)
 NPY_NO_EXPORT void
 gentype_struct_free(PyObject *ptr)
 {
-    PyArrayInterface *arrif;
-    PyObject *context;
-
-    arrif = (PyArrayInterface*)PyCapsule_GetPointer(ptr, NULL);
-    context = (PyObject *)PyCapsule_GetContext(ptr);
-    Py_DECREF(context);
+    PyArrayInterface *arrif = (PyArrayInterface*)PyCapsule_GetPointer(ptr, NULL);
+    if (arrif == NULL) {
+        PyErr_WriteUnraisable(ptr);
+        return;
+    }
+    PyObject *context = (PyObject *)PyCapsule_GetContext(ptr);
+    if (context == NULL && PyErr_Occurred()) {
+        PyErr_WriteUnraisable(ptr);
+    }
+    Py_XDECREF(context);
     Py_XDECREF(arrif->descr);
     PyArray_free(arrif->shape);
     PyArray_free(arrif);
diff --git a/numpy/core/src/multiarray/shape.c b/numpy/core/src/multiarray/shape.c
index 73bb7933f..397d539c1 100644
--- a/numpy/core/src/multiarray/shape.c
+++ b/numpy/core/src/multiarray/shape.c
@@ -458,7 +458,7 @@ _attempt_nocopy_reshape(PyArrayObject *self, int newnd, const npy_intp *newdims,
 static void
 raise_reshape_size_mismatch(PyArray_Dims *newshape, PyArrayObject *arr)
 {
-    PyObject *msg = PyUString_FromFormat("cannot reshape array of size %zd "
+    PyObject *msg = PyUnicode_FromFormat("cannot reshape array of size %zd "
                                          "into shape ", PyArray_SIZE(arr));
     PyObject *tmp = convert_shape_to_string(newshape->len, newshape->ptr, "");
 
@@ -997,10 +997,10 @@ build_shape_string(npy_intp n, npy_intp const *vals)
     }
 
     if (i == n) {
-        return PyUString_FromFormat("()");
+        return PyUnicode_FromFormat("()");
     }
     else {
-        ret = PyUString_FromFormat("(%" NPY_INTP_FMT, vals[i++]);
+        ret = PyUnicode_FromFormat("(%" NPY_INTP_FMT, vals[i++]);
         if (ret == NULL) {
             return NULL;
         }
@@ -1008,10 +1008,10 @@ build_shape_string(npy_intp n, npy_intp const *vals)
 
     for (; i < n; ++i) {
         if (vals[i] < 0) {
-            tmp = PyUString_FromString(",newaxis");
+            tmp = PyUnicode_FromString(",newaxis");
         }
         else {
-            tmp = PyUString_FromFormat(",%" NPY_INTP_FMT, vals[i]);
+            tmp = PyUnicode_FromFormat(",%" NPY_INTP_FMT, vals[i]);
         }
         if (tmp == NULL) {
             Py_DECREF(ret);
@@ -1024,7 +1024,7 @@ build_shape_string(npy_intp n, npy_intp const *vals)
         }
     }
 
-    tmp = PyUString_FromFormat(")");
+    tmp = PyUnicode_FromFormat(")");
     PyUString_ConcatAndDel(&ret, tmp);
     return ret;
 }
diff --git a/numpy/core/src/multiarray/strfuncs.c b/numpy/core/src/multiarray/strfuncs.c
index 363cbdba2..d9d9b7c0a 100644
--- a/numpy/core/src/multiarray/strfuncs.c
+++ b/numpy/core/src/multiarray/strfuncs.c
@@ -3,14 +3,25 @@
 
 #include <Python.h>
 #include <numpy/arrayobject.h>
-
 #include "npy_pycompat.h"
-
+#include "npy_import.h"
 #include "strfuncs.h"
 
 static PyObject *PyArray_StrFunction = NULL;
 static PyObject *PyArray_ReprFunction = NULL;
 
+
+static void
+npy_PyErr_SetStringChained(PyObject *type, const char *message)
+{
+    PyObject *exc, *val, *tb;
+
+    PyErr_Fetch(&exc, &val, &tb);
+    PyErr_SetString(type, message);
+    npy_PyErr_ChainExceptionsCause(exc, val, tb);
+}
+
+
 /*NUMPY_API
  * Set the array print function to be a Python function.
  */
@@ -36,164 +47,52 @@ PyArray_SetStringFunction(PyObject *op, int repr)
 }
 
 
-/*
- * Extend string. On failure, returns NULL and leaves *strp alone.
- * XXX we do this in multiple places; time for a string library?
- */
-static char *
-extend_str(char **strp, Py_ssize_t n, Py_ssize_t *maxp)
-{
-    char *str = *strp;
-    Py_ssize_t new_cap;
-
-    if (n >= *maxp - 16) {
-        new_cap = *maxp * 2;
-
-        if (new_cap <= *maxp) {     /* overflow */
-            return NULL;
-        }
-        str = PyArray_realloc(*strp, new_cap);
-        if (str != NULL) {
-            *strp = str;
-            *maxp = new_cap;
-        }
-    }
-    return str;
-}
-
-
-static int
-dump_data(char **string, Py_ssize_t *n, Py_ssize_t *max_n, char *data, int nd,
-          npy_intp const *dimensions, npy_intp const *strides, PyArrayObject* self)
-{
-    PyObject *op = NULL, *sp = NULL;
-    char *ostring;
-    npy_intp i, N, ret = 0;
-
-#define CHECK_MEMORY do {                           \
-        if (extend_str(string, *n, max_n) == NULL) {    \
-            ret = -1;                               \
-            goto end;                               \
-        }                                           \
-    } while (0)
-
-    if (nd == 0) {
-        if ((op = PyArray_GETITEM(self, data)) == NULL) {
-            return -1;
-        }
-        sp = PyObject_Repr(op);
-        if (sp == NULL) {
-            ret = -1;
-            goto end;
-        }
-        ostring = PyString_AsString(sp);
-        N = PyString_Size(sp)*sizeof(char);
-        *n += N;
-        CHECK_MEMORY;
-        memmove(*string + (*n - N), ostring, N);
-    }
-    else {
-        CHECK_MEMORY;
-        (*string)[*n] = '[';
-        *n += 1;
-        for (i = 0; i < dimensions[0]; i++) {
-            if (dump_data(string, n, max_n,
-                          data + (*strides)*i,
-                          nd - 1, dimensions + 1,
-                          strides + 1, self) < 0) {
-                return -1;
-            }
-            CHECK_MEMORY;
-            if (i < dimensions[0] - 1) {
-                (*string)[*n] = ',';
-                (*string)[*n+1] = ' ';
-                *n += 2;
-            }
-        }
-        CHECK_MEMORY;
-        (*string)[*n] = ']';
-        *n += 1;
-    }
-
-#undef CHECK_MEMORY
-
-end:
-    Py_XDECREF(op);
-    Py_XDECREF(sp);
-    return ret;
-}
-
-
-static PyObject *
-array_repr_builtin(PyArrayObject *self, int repr)
-{
-    PyObject *ret;
-    char *string;
-    /* max_n initial value is arbitrary, dump_data will extend it */
-    Py_ssize_t n = 0, max_n = PyArray_NBYTES(self) * 4 + 7;
-
-    if ((string = PyArray_malloc(max_n)) == NULL) {
-        return PyErr_NoMemory();
-    }
-
-    if (dump_data(&string, &n, &max_n, PyArray_DATA(self),
-                  PyArray_NDIM(self), PyArray_DIMS(self),
-                  PyArray_STRIDES(self), self) < 0) {
-        PyArray_free(string);
-        return NULL;
-    }
-
-    if (repr) {
-        if (PyArray_ISEXTENDED(self)) {
-            ret = PyUString_FromFormat("array(%s, '%c%d')",
-                                       string,
-                                       PyArray_DESCR(self)->type,
-                                       PyArray_DESCR(self)->elsize);
-        }
-        else {
-            ret = PyUString_FromFormat("array(%s, '%c')",
-                                       string,
-                                       PyArray_DESCR(self)->type);
-        }
-    }
-    else {
-        ret = PyUString_FromStringAndSize(string, n);
-    }
-
-    PyArray_free(string);
-    return ret;
-}
-
-
 NPY_NO_EXPORT PyObject *
 array_repr(PyArrayObject *self)
 {
-    PyObject *s;
+    static PyObject *repr = NULL;
 
-    if (PyArray_ReprFunction == NULL) {
-        s = array_repr_builtin(self, 1);
+    if (PyArray_ReprFunction != NULL) {
+        return PyObject_CallFunctionObjArgs(PyArray_ReprFunction, self, NULL);
     }
-    else {
-        s = PyObject_CallFunctionObjArgs(PyArray_ReprFunction, self, NULL);
+
+    /*
+     * We need to do a delayed import here as initialization on module load
+     * leads to circular import problems.
+     */
+    npy_cache_import("numpy.core.arrayprint", "_default_array_repr", &repr);
+    if (repr == NULL) {
+        npy_PyErr_SetStringChained(PyExc_RuntimeError,
+                "Unable to configure default ndarray.__repr__");
+        return NULL;
     }
-    return s;
+    return PyObject_CallFunctionObjArgs(repr, self, NULL);
 }
 
 
 NPY_NO_EXPORT PyObject *
 array_str(PyArrayObject *self)
 {
-    PyObject *s;
+    static PyObject *str = NULL;
 
-    if (PyArray_StrFunction == NULL) {
-        s = array_repr_builtin(self, 0);
+    if (PyArray_StrFunction != NULL) {
+        return PyObject_CallFunctionObjArgs(PyArray_StrFunction, self, NULL);
     }
-    else {
-        s = PyObject_CallFunctionObjArgs(PyArray_StrFunction, self, NULL);
+
+    /*
+     * We need to do a delayed import here as initialization on module load leads
+     * to circular import problems.
+     */
+    npy_cache_import("numpy.core.arrayprint", "_default_array_str", &str);
+    if (str == NULL) {
+        npy_PyErr_SetStringChained(PyExc_RuntimeError,
+                "Unable to configure default ndarray.__str__");
+        return NULL;
     }
-    return s;
+    return PyObject_CallFunctionObjArgs(str, self, NULL);
 }
 
+
 NPY_NO_EXPORT PyObject *
 array_format(PyArrayObject *self, PyObject *args)
 {
@@ -221,4 +120,3 @@ array_format(PyArrayObject *self, PyObject *args)
         );
     }
 }
-
diff --git a/numpy/core/src/multiarray/usertypes.c b/numpy/core/src/multiarray/usertypes.c
index b97f0f8b8..6b6c6bd9d 100644
--- a/numpy/core/src/multiarray/usertypes.c
+++ b/numpy/core/src/multiarray/usertypes.c
@@ -272,7 +272,7 @@ PyArray_RegisterCastFunc(PyArray_Descr *descr, int totype,
     if (PyErr_Occurred()) {
         return -1;
     }
-    cobj = NpyCapsule_FromVoidPtr((void *)castfunc, NULL);
+    cobj = PyCapsule_New((void *)castfunc, NULL, NULL);
     if (cobj == NULL) {
         Py_DECREF(key);
         return -1;
diff --git a/numpy/core/src/umath/_rational_tests.c.src b/numpy/core/src/umath/_rational_tests.c.src
index 13e33d0a5..cbb6d9d17 100644
--- a/numpy/core/src/umath/_rational_tests.c.src
+++ b/numpy/core/src/umath/_rational_tests.c.src
@@ -526,11 +526,11 @@ static PyObject*
 pyrational_repr(PyObject* self) {
     rational x = ((PyRational*)self)->r;
     if (d(x)!=1) {
-        return PyUString_FromFormat(
+        return PyUnicode_FromFormat(
                 "rational(%ld,%ld)",(long)x.n,(long)d(x));
     }
     else {
-        return PyUString_FromFormat(
+        return PyUnicode_FromFormat(
                 "rational(%ld)",(long)x.n);
     }
 }
@@ -539,11 +539,11 @@ static PyObject*
 pyrational_str(PyObject* self) {
     rational x = ((PyRational*)self)->r;
     if (d(x)!=1) {
-        return PyUString_FromFormat(
+        return PyUnicode_FromFormat(
                 "%ld/%ld",(long)x.n,(long)d(x));
     }
     else {
-        return PyUString_FromFormat(
+        return PyUnicode_FromFormat(
                 "%ld",(long)x.n);
     }
 }
@@ -1126,7 +1126,7 @@ PyMODINIT_FUNC PyInit__rational_tests(void) {
     if (PyErr_Occurred()) {
         goto fail;
     }
-    numpy_str = PyUString_FromString("numpy");
+    numpy_str = PyUnicode_FromString("numpy");
     if (!numpy_str) {
         goto fail;
     }
diff --git a/numpy/core/src/umath/_umath_tests.c.src b/numpy/core/src/umath/_umath_tests.c.src
index d08aabd64..932c3b5ab 100644
--- a/numpy/core/src/umath/_umath_tests.c.src
+++ b/numpy/core/src/umath/_umath_tests.c.src
@@ -671,7 +671,7 @@ PyMODINIT_FUNC PyInit__umath_tests(void) {
 
     d = PyModule_GetDict(m);
 
-    version = PyString_FromString("0.1");
+    version = PyUnicode_FromString("0.1");
     PyDict_SetItemString(d, "__version__", version);
     Py_DECREF(version);
 
diff --git a/numpy/core/src/umath/extobj.c b/numpy/core/src/umath/extobj.c
index 4a953410a..cd81f7734 100644
--- a/numpy/core/src/umath/extobj.c
+++ b/numpy/core/src/umath/extobj.c
@@ -109,7 +109,7 @@ _error_handler(int method, PyObject *errobj, char *errtype, int retstatus, int *
                     errtype, name);
             goto fail;
         }
-        args = Py_BuildValue("NN", PyUString_FromString(errtype),
+        args = Py_BuildValue("NN", PyUnicode_FromString(errtype),
                 PyLong_FromLong((long) retstatus));
         if (args == NULL) {
             goto fail;
diff --git a/numpy/core/src/umath/override.c b/numpy/core/src/umath/override.c
index bf6e5a698..a0090e302 100644
--- a/numpy/core/src/umath/override.c
+++ b/numpy/core/src/umath/override.c
@@ -605,7 +605,7 @@ PyUFunc_CheckOverride(PyUFuncObject *ufunc, char *method,
         goto fail;
     }
 
-    method_name = PyUString_FromString(method);
+    method_name = PyUnicode_FromString(method);
     if (method_name == NULL) {
         goto fail;
     }
diff --git a/numpy/core/src/umath/ufunc_object.c b/numpy/core/src/umath/ufunc_object.c
index 005556fb6..f693eb5c2 100644
--- a/numpy/core/src/umath/ufunc_object.c
+++ b/numpy/core/src/umath/ufunc_object.c
@@ -3318,7 +3318,6 @@ get_binary_op_function(PyUFuncObject *ufunc, int *otype,
                         void **out_innerloopdata)
 {
     int i;
-    PyUFunc_Loop1d *funcdata;
 
     NPY_UF_DBG_PRINT1("Getting binary op function for type number %d\n",
                                 *otype);
@@ -3336,7 +3335,10 @@ get_binary_op_function(PyUFuncObject *ufunc, int *otype,
             return -1;
         }
         else if (obj != NULL) {
-            funcdata = (PyUFunc_Loop1d *)NpyCapsule_AsVoidPtr(obj);
+            PyUFunc_Loop1d *funcdata = PyCapsule_GetPointer(obj, NULL);
+            if (funcdata == NULL) {
+                return -1;
+            }
             while (funcdata != NULL) {
                 int *types = funcdata->arg_types;
 
@@ -5190,9 +5192,12 @@ PyUFunc_RegisterLoopForDescr(PyUFuncObject *ufunc,
             result = -1;
         }
         else {
-            PyUFunc_Loop1d *current;
             int cmp = 1;
-            current = (PyUFunc_Loop1d *)NpyCapsule_AsVoidPtr(cobj);
+            PyUFunc_Loop1d *current = PyCapsule_GetPointer(cobj, NULL);
+            if (current == NULL) {
+                result = -1;
+                goto done;
+            }
             while (current != NULL) {
                 cmp = cmp_arg_types(current->arg_types,
                     arg_typenums, ufunc->nargs);
@@ -5226,6 +5231,7 @@ PyUFunc_RegisterLoopForDescr(PyUFuncObject *ufunc,
         }
     }
 
+done:
     PyArray_free(arg_typenums);
 
     Py_DECREF(key);
@@ -5294,7 +5300,7 @@ PyUFunc_RegisterLoopForType(PyUFuncObject *ufunc,
     }
     /* If it's not there, then make one and return. */
     else if (cobj == NULL) {
-        cobj = NpyCapsule_FromVoidPtr((void *)funcdata, _loop1d_list_free);
+        cobj = PyCapsule_New((void *)funcdata, NULL, _loop1d_list_free);
         if (cobj == NULL) {
             goto fail;
         }
@@ -5312,7 +5318,10 @@ PyUFunc_RegisterLoopForType(PyUFuncObject *ufunc,
          * is exactly like this one, then just replace.
          * Otherwise insert.
          */
-        current = (PyUFunc_Loop1d *)NpyCapsule_AsVoidPtr(cobj);
+        current = PyCapsule_GetPointer(cobj, NULL);
+        if (current == NULL) {
+            goto fail;
+        }
         while (current != NULL) {
             cmp = cmp_arg_types(current->arg_types, newtypes, ufunc->nargs);
             if (cmp >= 0) {
@@ -5383,7 +5392,7 @@ ufunc_dealloc(PyUFuncObject *ufunc)
 static PyObject *
 ufunc_repr(PyUFuncObject *ufunc)
 {
-    return PyUString_FromFormat("<ufunc '%s'>", ufunc->name);
+    return PyUnicode_FromFormat("<ufunc '%s'>", ufunc->name);
 }
 
 static int
@@ -5995,7 +6004,7 @@ ufunc_get_doc(PyUFuncObject *ufunc)
     }
     if (ufunc->doc != NULL) {
         PyUString_ConcatAndDel(&doc,
-            PyUString_FromFormat("\n\n%s", ufunc->doc));
+            PyUnicode_FromFormat("\n\n%s", ufunc->doc));
     }
     return doc;
 }
@@ -6051,7 +6060,7 @@ ufunc_get_types(PyUFuncObject *ufunc)
             t[ni + 2 + j] = _typecharfromnum(ufunc->types[n]);
             n++;
         }
-        str = PyUString_FromStringAndSize(t, no + ni + 2);
+        str = PyUnicode_FromStringAndSize(t, no + ni + 2);
         PyList_SET_ITEM(list, k, str);
     }
     PyArray_free(t);
@@ -6061,7 +6070,7 @@ ufunc_get_types(PyUFuncObject *ufunc)
 static PyObject *
 ufunc_get_name(PyUFuncObject *ufunc)
 {
-    return PyUString_FromString(ufunc->name);
+    return PyUnicode_FromString(ufunc->name);
 }
 
 static PyObject *
@@ -6077,7 +6086,7 @@ ufunc_get_signature(PyUFuncObject *ufunc)
     if (!ufunc->core_enabled) {
         Py_RETURN_NONE;
     }
-    return PyUString_FromString(ufunc->core_signature);
+    return PyUnicode_FromString(ufunc->core_signature);
 }
 
 #undef _typecharfromnum
diff --git a/numpy/core/src/umath/ufunc_type_resolution.c b/numpy/core/src/umath/ufunc_type_resolution.c
index fec3caef2..aa6f34d59 100644
--- a/numpy/core/src/umath/ufunc_type_resolution.c
+++ b/numpy/core/src/umath/ufunc_type_resolution.c
@@ -36,15 +36,15 @@ npy_casting_to_py_object(NPY_CASTING casting)
 {
     switch (casting) {
         case NPY_NO_CASTING:
-            return PyUString_FromString("no");
+            return PyUnicode_FromString("no");
         case NPY_EQUIV_CASTING:
-            return PyUString_FromString("equiv");
+            return PyUnicode_FromString("equiv");
         case NPY_SAFE_CASTING:
-            return PyUString_FromString("safe");
+            return PyUnicode_FromString("safe");
         case NPY_SAME_KIND_CASTING:
-            return PyUString_FromString("same_kind");
+            return PyUnicode_FromString("same_kind");
         case NPY_UNSAFE_CASTING:
-            return PyUString_FromString("unsafe");
+            return PyUnicode_FromString("unsafe");
         default:
             return PyLong_FromLong(casting);
     }
@@ -1336,7 +1336,6 @@ find_userloop(PyUFuncObject *ufunc,
                 void **out_innerloopdata)
 {
     npy_intp i, nin = ufunc->nin, j, nargs = nin + ufunc->nout;
-    PyUFunc_Loop1d *funcdata;
 
     /* Use this to try to avoid repeating the same userdef loop search */
     int last_userdef = -1;
@@ -1368,9 +1367,11 @@ find_userloop(PyUFuncObject *ufunc,
             else if (obj == NULL) {
                 continue;
             }
-            for (funcdata = (PyUFunc_Loop1d *)NpyCapsule_AsVoidPtr(obj);
-                 funcdata != NULL;
-                 funcdata = funcdata->next) {
+            PyUFunc_Loop1d *funcdata = PyCapsule_GetPointer(obj, NULL);
+            if (funcdata == NULL) {
+                return -1;
+            }
+            for (; funcdata != NULL; funcdata = funcdata->next) {
                 int *types = funcdata->arg_types;
 
                 for (j = 0; j < nargs; ++j) {
@@ -1744,7 +1745,6 @@ linear_search_userloop_type_resolver(PyUFuncObject *self,
                         char *out_err_dst_typecode)
 {
     npy_intp i, nop = self->nin + self->nout;
-    PyUFunc_Loop1d *funcdata;
 
     /* Use this to try to avoid repeating the same userdef loop search */
     int last_userdef = -1;
@@ -1776,9 +1776,11 @@ linear_search_userloop_type_resolver(PyUFuncObject *self,
             else if (obj == NULL) {
                 continue;
             }
-            for (funcdata = (PyUFunc_Loop1d *)NpyCapsule_AsVoidPtr(obj);
-                 funcdata != NULL;
-                 funcdata = funcdata->next) {
+            PyUFunc_Loop1d *funcdata = PyCapsule_GetPointer(obj, NULL);
+            if (funcdata == NULL) {
+                return -1;
+            }
+            for (; funcdata != NULL; funcdata = funcdata->next) {
                 int *types = funcdata->arg_types;
                 switch (ufunc_loop_matches(self, op,
                             input_casting, output_casting,
@@ -1816,7 +1818,6 @@ type_tuple_userloop_type_resolver(PyUFuncObject *self,
                         PyArray_Descr **out_dtype)
 {
     int i, j, nin = self->nin, nop = nin + self->nout;
-    PyUFunc_Loop1d *funcdata;
 
     /* Use this to try to avoid repeating the same userdef loop search */
     int last_userdef = -1;
@@ -1844,9 +1845,11 @@ type_tuple_userloop_type_resolver(PyUFuncObject *self,
                 continue;
             }
 
-            for (funcdata = (PyUFunc_Loop1d *)NpyCapsule_AsVoidPtr(obj);
-                 funcdata != NULL;
-                 funcdata = funcdata->next) {
+            PyUFunc_Loop1d *funcdata = PyCapsule_GetPointer(obj, NULL);
+            if (funcdata == NULL) {
+                return -1;
+            }
+            for (; funcdata != NULL; funcdata = funcdata->next) {
                 int *types = funcdata->arg_types;
                 int matched = 1;
 
diff --git a/numpy/core/src/umath/umathmodule.c b/numpy/core/src/umath/umathmodule.c
index 708a27ad0..ba7ac1706 100644
--- a/numpy/core/src/umath/umathmodule.c
+++ b/numpy/core/src/umath/umathmodule.c
@@ -237,23 +237,23 @@ NPY_VISIBILITY_HIDDEN PyObject *npy_um_str_pyvals_name = NULL;
 static int
 intern_strings(void)
 {
-    if (!(npy_um_str_out = PyUString_InternFromString("out"))) return -1;
-    if (!(npy_um_str_where = PyUString_InternFromString("where"))) return -1;
-    if (!(npy_um_str_axes = PyUString_InternFromString("axes"))) return -1;
-    if (!(npy_um_str_axis = PyUString_InternFromString("axis"))) return -1;
-    if (!(npy_um_str_keepdims = PyUString_InternFromString("keepdims"))) return -1;
-    if (!(npy_um_str_casting = PyUString_InternFromString("casting"))) return -1;
-    if (!(npy_um_str_order = PyUString_InternFromString("order"))) return -1;
-    if (!(npy_um_str_dtype = PyUString_InternFromString("dtype"))) return -1;
-    if (!(npy_um_str_subok = PyUString_InternFromString("subok"))) return -1;
-    if (!(npy_um_str_signature = PyUString_InternFromString("signature"))) return -1;
-    if (!(npy_um_str_sig = PyUString_InternFromString("sig"))) return -1;
-    if (!(npy_um_str_extobj = PyUString_InternFromString("extobj"))) return -1;
-    if (!(npy_um_str_array_prepare = PyUString_InternFromString("__array_prepare__"))) return -1;
-    if (!(npy_um_str_array_wrap = PyUString_InternFromString("__array_wrap__"))) return -1;
-    if (!(npy_um_str_array_finalize = PyUString_InternFromString("__array_finalize__"))) return -1;
-    if (!(npy_um_str_ufunc = PyUString_InternFromString("__array_ufunc__"))) return -1;
-    if (!(npy_um_str_pyvals_name = PyUString_InternFromString(UFUNC_PYVALS_NAME))) return -1;
+    if (!(npy_um_str_out = PyUnicode_InternFromString("out"))) return -1;
+    if (!(npy_um_str_where = PyUnicode_InternFromString("where"))) return -1;
+    if (!(npy_um_str_axes = PyUnicode_InternFromString("axes"))) return -1;
+    if (!(npy_um_str_axis = PyUnicode_InternFromString("axis"))) return -1;
+    if (!(npy_um_str_keepdims = PyUnicode_InternFromString("keepdims"))) return -1;
+    if (!(npy_um_str_casting = PyUnicode_InternFromString("casting"))) return -1;
+    if (!(npy_um_str_order = PyUnicode_InternFromString("order"))) return -1;
+    if (!(npy_um_str_dtype = PyUnicode_InternFromString("dtype"))) return -1;
+    if (!(npy_um_str_subok = PyUnicode_InternFromString("subok"))) return -1;
+    if (!(npy_um_str_signature = PyUnicode_InternFromString("signature"))) return -1;
+    if (!(npy_um_str_sig = PyUnicode_InternFromString("sig"))) return -1;
+    if (!(npy_um_str_extobj = PyUnicode_InternFromString("extobj"))) return -1;
+    if (!(npy_um_str_array_prepare = PyUnicode_InternFromString("__array_prepare__"))) return -1;
+    if (!(npy_um_str_array_wrap = PyUnicode_InternFromString("__array_wrap__"))) return -1;
+    if (!(npy_um_str_array_finalize = PyUnicode_InternFromString("__array_finalize__"))) return -1;
+    if (!(npy_um_str_ufunc = PyUnicode_InternFromString("__array_ufunc__"))) return -1;
+    if (!(npy_um_str_pyvals_name = PyUnicode_InternFromString(UFUNC_PYVALS_NAME))) return -1;
     return 0;
 }
 
diff --git a/numpy/core/tests/test_regression.py b/numpy/core/tests/test_regression.py
index 51cf7039f..a97198076 100644
--- a/numpy/core/tests/test_regression.py
+++ b/numpy/core/tests/test_regression.py
@@ -14,7 +14,7 @@ from numpy.testing import (
         assert_raises_regex, assert_warns, suppress_warnings,
         _assert_valid_refcount, HAS_REFCOUNT,
         )
-from numpy.testing._private.utils import _no_tracing
+from numpy.testing._private.utils import _no_tracing, requires_memory
 from numpy.compat import asbytes, asunicode, pickle
 
 try:
@@ -2488,3 +2488,29 @@ class TestRegression:
         assert arr.size * arr.itemsize > 2 ** 31
         c_arr = np.ctypeslib.as_ctypes(arr)
         assert_equal(c_arr._length_, arr.size)
+
+    def test_complex_conversion_error(self):
+        # gh-17068
+        with pytest.raises(TypeError, match=r"Unable to convert dtype.*"):
+            complex(np.array("now", np.datetime64))
+
+    def test__array_interface__descr(self):
+        # gh-17068
+        dt = np.dtype(dict(names=['a', 'b'],
+                           offsets=[0, 0],
+                           formats=[np.int64, np.int64]))
+        descr = np.array((1, 1), dtype=dt).__array_interface__['descr']
+        assert descr == [('', '|V8')]  # instead of [(b'', '|V8')]
+
+    @pytest.mark.skipif(sys.maxsize < 2 ** 31 + 1, reason='overflows 32-bit python')
+    @requires_memory(free_bytes=9e9)
+    def test_dot_big_stride(self):
+        # gh-17111
+        # blas stride = stride//itemsize > int32 max
+        int32_max = np.iinfo(np.int32).max
+        n = int32_max + 3
+        a = np.empty([n], dtype=np.float32)
+        b = a[::n-1]
+        b[...] = 1
+        assert b.strides[0] > int32_max * b.dtype.itemsize
+        assert np.dot(b, b) == 2.0
diff --git a/numpy/ctypeslib.py b/numpy/ctypeslib.py
index 76ba838b7..e8f7750fe 100644
--- a/numpy/ctypeslib.py
+++ b/numpy/ctypeslib.py
@@ -49,12 +49,11 @@ Then, we're ready to call ``foo_func``:
 >>> _lib.foo_func(out, len(out))                #doctest: +SKIP
 
 """
-__all__ = ['load_library', 'ndpointer', 'ctypes_load_library',
-           'c_intp', 'as_ctypes', 'as_array']
+__all__ = ['load_library', 'ndpointer', 'c_intp', 'as_ctypes', 'as_array']
 
 import os
 from numpy import (
-    integer, ndarray, dtype as _dtype, deprecate, array, frombuffer
+    integer, ndarray, dtype as _dtype, array, frombuffer
 )
 from numpy.core.multiarray import _flagdict, flagsobj
 
@@ -75,7 +74,6 @@ if ctypes is None:
 
         """
         raise ImportError("ctypes is not available.")
-    ctypes_load_library = _dummy
     load_library = _dummy
     as_ctypes = _dummy
     as_array = _dummy
@@ -154,8 +152,6 @@ else:
         ## if no successful return in the libname_ext loop:
         raise OSError("no file with expected extension")
 
-    ctypes_load_library = deprecate(load_library, 'ctypes_load_library',
-                                    'load_library')
 
 def _num_fromflags(flaglist):
     num = 0
diff --git a/numpy/f2py/cfuncs.py b/numpy/f2py/cfuncs.py
index ccbc9b0fb..9f5c73a45 100644
--- a/numpy/f2py/cfuncs.py
+++ b/numpy/f2py/cfuncs.py
@@ -320,10 +320,10 @@ cppmacros[
     'pyobj_from_complex_float1'] = '#define pyobj_from_complex_float1(v) (PyComplex_FromDoubles(v.r,v.i))'
 needs['pyobj_from_string1'] = ['string']
 cppmacros[
-    'pyobj_from_string1'] = '#define pyobj_from_string1(v) (PyString_FromString((char *)v))'
+    'pyobj_from_string1'] = '#define pyobj_from_string1(v) (PyUnicode_FromString((char *)v))'
 needs['pyobj_from_string1size'] = ['string']
 cppmacros[
-    'pyobj_from_string1size'] = '#define pyobj_from_string1size(v,len) (PyUString_FromStringAndSize((char *)v, len))'
+    'pyobj_from_string1size'] = '#define pyobj_from_string1size(v,len) (PyUnicode_FromStringAndSize((char *)v, len))'
 needs['TRYPYARRAYTEMPLATE'] = ['PRINTPYOBJERR']
 cppmacros['TRYPYARRAYTEMPLATE'] = """\
 /* New SciPy */
diff --git a/numpy/f2py/rules.py b/numpy/f2py/rules.py
index 56f2033ff..a14f60194 100755
--- a/numpy/f2py/rules.py
+++ b/numpy/f2py/rules.py
@@ -202,7 +202,7 @@ PyMODINIT_FUNC PyInit_#modulename#(void) {
 \tif (PyErr_Occurred())
 \t\t{PyErr_SetString(PyExc_ImportError, \"can't initialize module #modulename# (failed to import numpy)\"); return m;}
 \td = PyModule_GetDict(m);
-\ts = PyString_FromString(\"$R""" + """evision: $\");
+\ts = PyUnicode_FromString(\"$R""" + """evision: $\");
 \tPyDict_SetItemString(d, \"__version__\", s);
 \tPy_DECREF(s);
 \ts = PyUnicode_FromString(
diff --git a/numpy/f2py/src/test/foomodule.c b/numpy/f2py/src/test/foomodule.c
index caf3590d4..88ec62440 100644
--- a/numpy/f2py/src/test/foomodule.c
+++ b/numpy/f2py/src/test/foomodule.c
@@ -121,7 +121,7 @@ void initfoo() {
     m = Py_InitModule("foo", foo_module_methods);
 
     d = PyModule_GetDict(m);
-    s = PyString_FromString("This module 'foo' demonstrates the usage of fortranobject.");
+    s = PyUnicode_FromString("This module 'foo' demonstrates the usage of fortranobject.");
     PyDict_SetItemString(d, "__doc__", s);
 
     /* Fortran objects: */
diff --git a/numpy/f2py/tests/src/array_from_pyobj/wrapmodule.c b/numpy/f2py/tests/src/array_from_pyobj/wrapmodule.c
index 8b089d334..0411b62e0 100644
--- a/numpy/f2py/tests/src/array_from_pyobj/wrapmodule.c
+++ b/numpy/f2py/tests/src/array_from_pyobj/wrapmodule.c
@@ -1,14 +1,9 @@
-/* File: wrapmodule.c
- * This file is auto-generated with f2py (version:2_1330).
- * Hand edited by Pearu.
- * f2py is a Fortran to Python Interface Generator (FPIG), Second Edition,
- * written by Pearu Peterson <pearu@cens.ioc.ee>.
- * See http://cens.ioc.ee/projects/f2py2e/
- * Generation date: Fri Oct 21 22:41:12 2005
- * $Revision:$
- * $Date:$
- * Do not edit this file directly unless you know what you are doing!!!
+/*
+ * This file was auto-generated with f2py (version:2_1330) and hand edited by
+ * Pearu for testing purposes.  Do not edit this file unless you know what you
+ * are doing!!!
  */
+
 #ifdef __cplusplus
 extern "C" {
 #endif
@@ -149,9 +144,9 @@ PyMODINIT_FUNC PyInit_test_array_from_pyobj_ext(void) {
   if (PyErr_Occurred())
     Py_FatalError("can't initialize module wrap (failed to import numpy)");
   d = PyModule_GetDict(m);
-  s = PyString_FromString("This module 'wrap' is auto-generated with f2py (version:2_1330).\nFunctions:\n"
-"  arr = call(type_num,dims,intent,obj)\n"
-".");
+  s = PyUnicode_FromString("This module 'wrap' is auto-generated with f2py (version:2_1330).\nFunctions:\n"
+                           "  arr = call(type_num,dims,intent,obj)\n"
+                           ".");
   PyDict_SetItemString(d, "__doc__", s);
   wrap_error = PyErr_NewException ("wrap.error", NULL, NULL);
   Py_DECREF(s);
diff --git a/numpy/lib/arraysetops.py b/numpy/lib/arraysetops.py
index df9a110c5..6a2ad004c 100644
--- a/numpy/lib/arraysetops.py
+++ b/numpy/lib/arraysetops.py
@@ -278,7 +278,7 @@ def unique(ar, return_index=False, return_inverse=False,
         ar = np.moveaxis(ar, axis, 0)
     except np.AxisError:
         # this removes the "axis1" or "axis2" prefix from the error message
-        raise np.AxisError(axis, ar.ndim)
+        raise np.AxisError(axis, ar.ndim) from None
 
     # Must reshape to a contiguous 2D array for this to work...
     orig_shape, orig_dtype = ar.shape, ar.dtype
@@ -300,10 +300,10 @@ def unique(ar, return_index=False, return_inverse=False,
             # array with shape `(len(ar),)`.  Because `dtype` in this case has
             # itemsize 0, the total size of the result is still 0 bytes.
             consolidated = np.empty(len(ar), dtype=dtype)
-    except TypeError:
+    except TypeError as e:
         # There's no good way to do this for object arrays, etc...
         msg = 'The axis argument to unique is not supported for dtype {dt}'
-        raise TypeError(msg.format(dt=ar.dtype))
+        raise TypeError(msg.format(dt=ar.dtype)) from e
 
     def reshape_uniq(uniq):
         n = len(uniq)
diff --git a/numpy/lib/function_base.py b/numpy/lib/function_base.py
index 556227c0d..0db00a0f2 100644
--- a/numpy/lib/function_base.py
+++ b/numpy/lib/function_base.py
@@ -1991,8 +1991,8 @@ class vectorize:
         .. versionadded:: 1.7.0
 
     cache : bool, optional
-       If `True`, then cache the first function call that determines the number
-       of outputs if `otypes` is not provided.
+        If `True`, then cache the first function call that determines the number
+        of outputs if `otypes` is not provided.
 
         .. versionadded:: 1.7.0
 
diff --git a/numpy/lib/tests/test_financial_expired.py b/numpy/lib/tests/test_financial_expired.py
index e1d05da0c..66bb08026 100644
--- a/numpy/lib/tests/test_financial_expired.py
+++ b/numpy/lib/tests/test_financial_expired.py
@@ -3,10 +3,11 @@ import pytest
 import numpy as np
 
 
+@pytest.mark.skipif(sys.version_info[:2] < (3, 7),
+                    reason="requires python 3.7 or higher")
 def test_financial_expired():
-    if sys.version_info[:2] >= (3, 7):
-        match = 'NEP 32'
-    else:
-        match = None
-    with pytest.raises(AttributeError, match=match):
-        np.fv
+    match = 'NEP 32'
+    with pytest.warns(RuntimeWarning, match=match):
+        func = np.fv
+    with pytest.raises(RuntimeError, match=match):
+        func(1, 2, 3)
diff --git a/numpy/linalg/umath_linalg.c.src b/numpy/linalg/umath_linalg.c.src
index 59647c67d..1807aadcf 100644
--- a/numpy/linalg/umath_linalg.c.src
+++ b/numpy/linalg/umath_linalg.c.src
@@ -3665,7 +3665,7 @@ PyObject *PyInit__umath_linalg(void)
         return NULL;
     }
 
-    version = PyString_FromString(umath_linalg_version_string);
+    version = PyUnicode_FromString(umath_linalg_version_string);
     if (version == NULL) {
         return NULL;
     }
diff --git a/numpy/ma/timer_comparison.py b/numpy/ma/timer_comparison.py
index 83bd7852e..f5855efcf 100644
--- a/numpy/ma/timer_comparison.py
+++ b/numpy/ma/timer_comparison.py
@@ -100,9 +100,9 @@ class ModuleTester:
                                     header=header,
                                     names=('x', 'y'))
                 assert cond, msg
-        except ValueError:
+        except ValueError as e:
             msg = build_err_msg([x, y], err_msg, header=header, names=('x', 'y'))
-            raise ValueError(msg)
+            raise ValueError(msg) from e
 
     def assert_array_equal(self, x, y, err_msg=''):
         """
diff --git a/numpy/tests/typing/fail/fromnumeric.py b/numpy/tests/typing/fail/fromnumeric.py
index 66f8a89d0..c9156895d 100644
--- a/numpy/tests/typing/fail/fromnumeric.py
+++ b/numpy/tests/typing/fail/fromnumeric.py
@@ -124,3 +124,31 @@ np.amin(a, keepdims=1.0)  # E: No overload variant of "amin" matches argument ty
 np.amin(a, out=1.0)  # E: No overload variant of "amin" matches argument type
 np.amin(a, initial=[1.0])  # E: No overload variant of "amin" matches argument type
 np.amin(a, where=[1.0])  # E: List item 0 has incompatible type
+
+np.prod(a, axis=1.0)  # E: No overload variant of "prod" matches argument type
+np.prod(a, out=False)  # E: No overload variant of "prod" matches argument type
+np.prod(a, keepdims=1.0)  # E: No overload variant of "prod" matches argument type
+np.prod(a, initial=int)  # E: No overload variant of "prod" matches argument type
+np.prod(a, where=1.0)  # E: No overload variant of "prod" matches argument type
+
+np.cumprod(a, axis=1.0)  # E: Argument "axis" to "cumprod" has incompatible type
+np.cumprod(a, out=False)  # E: Argument "out" to "cumprod" has incompatible type
+
+np.size(a, axis=1.0)  # E: Argument "axis" to "size" has incompatible type
+
+np.around(a, decimals=1.0)  # E: No overload variant of "around" matches argument type
+np.around(a, out=type)  # E: No overload variant of "around" matches argument type
+
+np.mean(a, axis=1.0)  # E: No overload variant of "mean" matches argument type
+np.mean(a, out=False)  # E: No overload variant of "mean" matches argument type
+np.mean(a, keepdims=1.0)  # E: No overload variant of "mean" matches argument type
+
+np.std(a, axis=1.0)  # E: No overload variant of "std" matches argument type
+np.std(a, out=False)  # E: No overload variant of "std" matches argument type
+np.std(a, ddof='test')  # E: No overload variant of "std" matches argument type
+np.std(a, keepdims=1.0)  # E: No overload variant of "std" matches argument type
+
+np.var(a, axis=1.0)  # E: No overload variant of "var" matches argument type
+np.var(a, out=False)  # E: No overload variant of "var" matches argument type
+np.var(a, ddof='test')  # E: No overload variant of "var" matches argument type
+np.var(a, keepdims=1.0)  # E: No overload variant of "var" matches argument type
diff --git a/numpy/tests/typing/pass/fromnumeric.py b/numpy/tests/typing/pass/fromnumeric.py
index d9dd45c54..9e936e684 100644
--- a/numpy/tests/typing/pass/fromnumeric.py
+++ b/numpy/tests/typing/pass/fromnumeric.py
@@ -10,6 +10,7 @@ B.setflags(write=False)
 a = np.bool_(True)
 b = np.float32(1.0)
 c = 1.0
+d = np.array(1.0, dtype=np.float32)  # writeable
 
 np.take(a, 0)
 np.take(b, 0)
@@ -183,3 +184,77 @@ np.amin(A, axis=0)
 np.amin(B, axis=0)
 np.amin(A, keepdims=True)
 np.amin(B, keepdims=True)
+
+np.prod(a)
+np.prod(b)
+np.prod(c)
+np.prod(A)
+np.prod(B)
+np.prod(a, dtype=None)
+np.prod(A, dtype=None)
+np.prod(A, axis=0)
+np.prod(B, axis=0)
+np.prod(A, keepdims=True)
+np.prod(B, keepdims=True)
+np.prod(b, out=d)
+np.prod(B, out=d)
+
+np.cumprod(a)
+np.cumprod(b)
+np.cumprod(c)
+np.cumprod(A)
+np.cumprod(B)
+
+np.ndim(a)
+np.ndim(b)
+np.ndim(c)
+np.ndim(A)
+np.ndim(B)
+
+np.size(a)
+np.size(b)
+np.size(c)
+np.size(A)
+np.size(B)
+
+np.around(a)
+np.around(b)
+np.around(c)
+np.around(A)
+np.around(B)
+
+np.mean(a)
+np.mean(b)
+np.mean(c)
+np.mean(A)
+np.mean(B)
+np.mean(A, axis=0)
+np.mean(B, axis=0)
+np.mean(A, keepdims=True)
+np.mean(B, keepdims=True)
+np.mean(b, out=d)
+np.mean(B, out=d)
+
+np.std(a)
+np.std(b)
+np.std(c)
+np.std(A)
+np.std(B)
+np.std(A, axis=0)
+np.std(B, axis=0)
+np.std(A, keepdims=True)
+np.std(B, keepdims=True)
+np.std(b, out=d)
+np.std(B, out=d)
+
+np.var(a)
+np.var(b)
+np.var(c)
+np.var(A)
+np.var(B)
+np.var(A, axis=0)
+np.var(B, axis=0)
+np.var(A, keepdims=True)
+np.var(B, keepdims=True)
+np.var(b, out=d)
+np.var(B, out=d)
diff --git a/numpy/tests/typing/reveal/fromnumeric.py b/numpy/tests/typing/reveal/fromnumeric.py
index f5feb3f5f..06501f6e2 100644
--- a/numpy/tests/typing/reveal/fromnumeric.py
+++ b/numpy/tests/typing/reveal/fromnumeric.py
@@ -10,6 +10,7 @@ B.setflags(write=False)
 a = np.bool_(True)
 b = np.float32(1.0)
 c = 1.0
+d = np.array(1.0, dtype=np.float32)  # writeable
 
 reveal_type(np.take(a, 0))  # E: numpy.bool_
 reveal_type(np.take(b, 0))  # E: numpy.float32
@@ -203,3 +204,75 @@ reveal_type(np.amin(A, axis=0))  # E: Union[numpy.number, numpy.ndarray]
 reveal_type(np.amin(B, axis=0))  # E: Union[numpy.number, numpy.ndarray]
 reveal_type(np.amin(A, keepdims=True))  # E: Union[numpy.number, numpy.ndarray]
 reveal_type(np.amin(B, keepdims=True))  # E: Union[numpy.number, numpy.ndarray]
+
+reveal_type(np.prod(a))  # E: numpy.number
+reveal_type(np.prod(b))  # E: numpy.float32
+reveal_type(np.prod(c))  # E: numpy.number
+reveal_type(np.prod(A))  # E: numpy.number
+reveal_type(np.prod(B))  # E: numpy.number
+reveal_type(np.prod(A, axis=0))  # E: Union[numpy.number, numpy.ndarray]
+reveal_type(np.prod(B, axis=0))  # E: Union[numpy.number, numpy.ndarray]
+reveal_type(np.prod(A, keepdims=True))  # E: Union[numpy.number, numpy.ndarray]
+reveal_type(np.prod(B, keepdims=True))  # E: Union[numpy.number, numpy.ndarray]
+reveal_type(np.prod(b, out=d))  # E: numpy.ndarray
+reveal_type(np.prod(B, out=d))  # E: numpy.ndarray
+
+reveal_type(np.cumprod(a))  # E: numpy.ndarray
+reveal_type(np.cumprod(b))  # E: numpy.ndarray
+reveal_type(np.cumprod(c))  # E: numpy.ndarray
+reveal_type(np.cumprod(A))  # E: numpy.ndarray
+reveal_type(np.cumprod(B))  # E: numpy.ndarray
+
+reveal_type(np.ndim(a))  # E: int
+reveal_type(np.ndim(b))  # E: int
+reveal_type(np.ndim(c))  # E: int
+reveal_type(np.ndim(A))  # E: int
+reveal_type(np.ndim(B))  # E: int
+
+reveal_type(np.size(a))  # E: int
+reveal_type(np.size(b))  # E: int
+reveal_type(np.size(c))  # E: int
+reveal_type(np.size(A))  # E: int
+reveal_type(np.size(B))  # E: int
+
+reveal_type(np.around(a))  # E: numpy.number
+reveal_type(np.around(b))  # E: numpy.float32
+reveal_type(np.around(c))  # E: numpy.number
+reveal_type(np.around(A))  # E: numpy.ndarray
+reveal_type(np.around(B))  # E: numpy.ndarray
+
+reveal_type(np.mean(a))  # E: numpy.number
+reveal_type(np.mean(b))  # E: numpy.number
+reveal_type(np.mean(c))  # E: numpy.number
+reveal_type(np.mean(A))  # E: numpy.number
+reveal_type(np.mean(B))  # E: numpy.number
+reveal_type(np.mean(A, axis=0))  # E: Union[numpy.number, numpy.ndarray]
+reveal_type(np.mean(B, axis=0))  # E: Union[numpy.number, numpy.ndarray]
+reveal_type(np.mean(A, keepdims=True))  # E: Union[numpy.number, numpy.ndarray]
+reveal_type(np.mean(B, keepdims=True))  # E: Union[numpy.number, numpy.ndarray]
+reveal_type(np.mean(b, out=d))  # E: numpy.ndarray
+reveal_type(np.mean(B, out=d))  # E: numpy.ndarray
+
+reveal_type(np.std(a))  # E: numpy.number
+reveal_type(np.std(b))  # E: numpy.number
+reveal_type(np.std(c))  # E: numpy.number
+reveal_type(np.std(A))  # E: numpy.number
+reveal_type(np.std(B))  # E: numpy.number
+reveal_type(np.std(A, axis=0))  # E: Union[numpy.number, numpy.ndarray]
+reveal_type(np.std(B, axis=0))  # E: Union[numpy.number, numpy.ndarray]
+reveal_type(np.std(A, keepdims=True))  # E: Union[numpy.number, numpy.ndarray]
+reveal_type(np.std(B, keepdims=True))  # E: Union[numpy.number, numpy.ndarray]
+reveal_type(np.std(b, out=d))  # E: numpy.ndarray
+reveal_type(np.std(B, out=d))  # E: numpy.ndarray
+
+reveal_type(np.var(a))  # E: numpy.number
+reveal_type(np.var(b))  # E: numpy.number
+reveal_type(np.var(c))  # E: numpy.number
+reveal_type(np.var(A))  # E: numpy.number
+reveal_type(np.var(B))  # E: numpy.number
+reveal_type(np.var(A, axis=0))  # E: Union[numpy.number, numpy.ndarray]
+reveal_type(np.var(B, axis=0))  # E: Union[numpy.number, numpy.ndarray]
+reveal_type(np.var(A, keepdims=True))  # E: Union[numpy.number, numpy.ndarray]
+reveal_type(np.var(B, keepdims=True))  # E: Union[numpy.number, numpy.ndarray]
+reveal_type(np.var(b, out=d))  # E: numpy.ndarray
+reveal_type(np.var(B, out=d))  # E: numpy.ndarray
diff --git a/runtests.py b/runtests.py
index 8aefab0db..ce351e3c4 100755
--- a/runtests.py
+++ b/runtests.py
@@ -108,6 +108,8 @@ def main(argv):
                         help="Start IPython shell with PYTHONPATH set")
     parser.add_argument("--shell", action="store_true",
                         help="Start Unix shell with PYTHONPATH set")
+    parser.add_argument("--mypy", action="store_true",
+                        help="Run mypy on files with NumPy on the MYPYPATH")
     parser.add_argument("--debug", "-g", action="store_true",
                         help="Debug build")
     parser.add_argument("--parallel", "-j", type=int, default=0,
@@ -131,7 +133,7 @@ def main(argv):
                               "COMMIT. Note that you need to commit your "
                               "changes first!"))
     parser.add_argument("args", metavar="ARGS", default=[], nargs=REMAINDER,
-                        help="Arguments to pass to Nose, asv, Python or shell")
+                        help="Arguments to pass to pytest, asv, mypy, Python or shell")
     args = parser.parse_args(argv)
 
     if args.durations < 0:
@@ -211,6 +213,35 @@ def main(argv):
         subprocess.call([shell] + extra_argv)
         sys.exit(0)
 
+    if args.mypy:
+        try:
+            import mypy.api
+        except ImportError:
+            raise RuntimeError(
+                "Mypy not found. Please install it by running "
+                "pip install -r test_requirements.txt from the repo root"
+            )
+
+        os.environ['MYPYPATH'] = site_dir
+        # By default mypy won't color the output since it isn't being
+        # invoked from a tty.
+        os.environ['MYPY_FORCE_COLOR'] = '1'
+
+        config = os.path.join(
+            site_dir,
+            "numpy",
+            "tests",
+            "typing",
+            "mypy.ini",
+        )
+
+        report, errors, status = mypy.api.run(
+            ['--config-file', config] + args.args
+        )
+        print(report, end='')
+        print(errors, end='', file=sys.stderr)
+        sys.exit(status)
+
     if args.coverage:
         dst_dir = os.path.join(ROOT_DIR, 'build', 'coverage')
         fn = os.path.join(dst_dir, 'coverage_html.js')