summaryrefslogtreecommitdiff
path: root/doc/source
diff options
context:
space:
mode:
Diffstat (limited to 'doc/source')
-rw-r--r--doc/source/user/basics.interoperability.rst352
-rw-r--r--doc/source/user/basics.rst1
2 files changed, 353 insertions, 0 deletions
diff --git a/doc/source/user/basics.interoperability.rst b/doc/source/user/basics.interoperability.rst
new file mode 100644
index 000000000..444574e32
--- /dev/null
+++ b/doc/source/user/basics.interoperability.rst
@@ -0,0 +1,352 @@
+
+***************************
+Interoperability with NumPy
+***************************
+
+NumPy’s ndarray objects provide both a high-level API for operations on
+array-structured data and a concrete implementation of the API based on
+`strided in-RAM storage <https://numpy.org/doc/stable/reference/arrays.html>`__.
+While this API is powerful and fairly general, its concrete implementation has
+limitations. As datasets grow and NumPy becomes used in a variety of new
+environments and architectures, there are cases where the strided in-RAM storage
+strategy is inappropriate, which has caused different libraries to reimplement
+this API for their own uses. This includes GPU arrays (CuPy_), Sparse arrays
+(`scipy.sparse`, `PyData/Sparse <Sparse_>`_) and parallel arrays (Dask_ arrays)
+as well as various NumPy-like implementations in deep learning frameworks, like
+TensorFlow_ and PyTorch_. Similarly, there are many projects that build on top
+of the NumPy API for labeled and indexed arrays (XArray_), automatic
+differentiation (JAX_), masked arrays (`numpy.ma`), physical units
+(astropy.units_, pint_, unyt_), among others that add additional functionality
+on top of the NumPy API.
+
+Yet, users still want to work with these arrays using the familiar NumPy API and
+re-use existing code with minimal (ideally zero) porting overhead. With this
+goal in mind, various protocols are defined for implementations of
+multi-dimensional arrays with high-level APIs matching NumPy.
+
+Using arbitrary objects in NumPy
+--------------------------------
+
+When NumPy functions encounter a foreign object, they will try (in order):
+
+1. The buffer protocol, described `in the Python C-API documentation
+ <https://docs.python.org/3/c-api/buffer.html>`__.
+2. The ``__array_interface__`` protocol, described
+ :ref:`in this page <arrays.interface>`. A precursor to Python’s buffer
+ protocol, it defines a way to access the contents of a NumPy array from other
+ C extensions.
+3. The ``__array__`` protocol, which asks an arbitrary object to convert itself
+ into an array.
+
+For both the buffer and the ``__array_interface__`` protocols, the object
+describes its memory layout and NumPy does everything else (zero-copy if
+possible). If that’s not possible, the object itself is responsible for
+returning a ``ndarray`` from ``__array__()``.
+
+The array interface
+~~~~~~~~~~~~~~~~~~~
+
+The :ref:`array interface <arrays.interface>` defines a protocol for array-like
+objects to re-use each other’s data buffers. Its implementation relies on the
+existence of the following attributes or methods:
+
+- ``__array_interface__``: a Python dictionary containing the shape, the
+ element type, and optionally, the data buffer address and the strides of an
+ array-like object;
+- ``__array__()``: a method returning the NumPy ndarray view of an array-like
+ object;
+- ``__array_struct__``: a ``PyCapsule`` containing a pointer to a
+ ``PyArrayInterface`` C-structure.
+
+The ``__array_interface__`` and ``__array_struct__`` attributes can be inspected
+directly:
+
+ >>> import numpy as np
+ >>> x = np.array([1, 2, 5.0, 8])
+ >>> x.__array_interface__
+ {'data': (94708397920832, False), 'strides': None, 'descr': [('', '<f8')], 'typestr': '<f8', 'shape': (4,), 'version': 3}
+ >>> x.__array_struct__
+ <capsule object NULL at 0x7f798800be40>
+
+The ``__array_interface__`` attribute can also be used to manipulate the object
+data in place:
+
+ >>> class wrapper():
+ ... pass
+ ...
+ >>> arr = np.array([1, 2, 3, 4])
+ >>> buf = arr.__array_interface__
+ >>> buf
+ {'data': (140497590272032, False), 'strides': None, 'descr': [('', '<i8')], 'typestr': '<i8', 'shape': (4,), 'version': 3}
+ >>> buf['shape'] = (2, 2)
+ >>> w = wrapper()
+ >>> w.__array_interface__ = buf
+ >>> new_arr = np.array(w, copy=False)
+ >>> new_arr
+ array([[1, 2],
+ [3, 4]])
+
+We can check that ``arr`` and ``new_arr`` share the same data buffer:
+
+ >>> new_arr[0, 0] = 1000
+ >>> new_arr
+ array([[1000, 2],
+ [ 3, 4]])
+ >>> arr
+ array([1000, 2, 3, 4])
+
+
+The ``__array__`` protocol
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``__array__`` protocol acts as a dispatch mechanism and ensures that any
+NumPy-like object (an array, any object exposing the array interface, an object
+whose ``__array__`` method returns an array or any nested sequence) that
+implements it can be used as a NumPy array. If possible, this will mean using
+``__array__`` to create a NumPy ndarray view of the array-like object.
+Otherwise, this copies the data into a new ndarray object. This is not optimal,
+as coercing arrays into ndarrays may cause performance problems or create the
+need for copies and loss of metadata.
+
+To see an example of a custom array implementation including the use of the
+``__array__`` protocol, see `Writing custom array containers
+<https://numpy.org/devdocs/user/basics.dispatch.html>`__.
+
+Operating on foreign objects without converting
+-----------------------------------------------
+
+Consider the following function.
+
+ >>> import numpy as np
+ >>> def f(x):
+ ... return np.mean(np.exp(x))
+
+We can apply it to a NumPy ndarray object directly:
+
+ >>> x = np.array([1, 2, 3, 4])
+ >>> f(x)
+ 21.1977562209304
+
+We would like this function to work equally well with any NumPy-like array
+object. Some of this is possible today with various protocol mechanisms within
+NumPy.
+
+NumPy allows a class to indicate that it would like to handle computations in a
+custom-defined way through the following interfaces:
+
+- ``__array_ufunc__``: allows third-party objects to support and override
+ :ref:`ufuncs <ufuncs-basics>`.
+- ``__array_function__``: a catch-all for NumPy functionality that is not
+ covered by the ``__array_ufunc__`` protocol for universal functions.
+
+As long as foreign objects implement the ``__array_ufunc__`` or
+``__array_function__`` protocols, it is possible to operate on them without the
+need for explicit conversion.
+
+The ``__array_ufunc__`` protocol
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A :ref:`universal function (or ufunc for short) <ufuncs-basics>` is a
+“vectorized” wrapper for a function that takes a fixed number of specific inputs
+and produces a fixed number of specific outputs. The output of the ufunc (and
+its methods) is not necessarily an ndarray, if all input arguments are not
+ndarrays. Indeed, if any input defines an ``__array_ufunc__`` method, control
+will be passed completely to that function, i.e., the ufunc is overridden.
+
+A subclass can override what happens when executing NumPy ufuncs on it by
+overriding the default ``ndarray.__array_ufunc__`` method. This method is
+executed instead of the ufunc and should return either the result of the
+operation, or ``NotImplemented`` if the operation requested is not implemented.
+
+The ``__array_function__`` protocol
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To achieve enough coverage of the NumPy API to support downstream projects,
+there is a need to go beyond ``__array_ufunc__`` and implement a protocol that
+allows arguments of a NumPy function to take control and divert execution to
+another function (for example, a GPU or parallel implementation) in a way that
+is safe and consistent across projects.
+
+The semantics of ``__array_function__`` are very similar to ``__array_ufunc__``,
+except the operation is specified by an arbitrary callable object rather than a
+ufunc instance and method. For more details, see `NEP 18
+<https://numpy.org/neps/nep-0018-array-function-protocol.html>`__.
+
+
+Interoperability examples
+-------------------------
+
+Example: Pandas ``Series`` objects
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Consider the following:
+
+ >>> import pandas as pd
+ >>> ser = pd.Series([1, 2, 3, 4])
+ >>> type(ser)
+ pandas.core.series.Series
+
+Now, ``ser`` is **not** a ndarray, but because it
+`implements the __array_ufunc__ protocol
+<https://pandas.pydata.org/docs/user_guide/dsintro.html#dataframe-interoperability-with-numpy-functions>`__,
+we can apply ufuncs to it as if it were a ndarray:
+
+ >>> np.exp(ser)
+ 0 2.718282
+ 1 7.389056
+ 2 20.085537
+ 3 54.598150
+ dtype: float64
+ >>> np.sin(ser)
+ 0 0.841471
+ 1 0.909297
+ 2 0.141120
+ 3 -0.756802
+ dtype: float64
+
+We can even do operations with other ndarrays:
+
+ >>> np.add(ser, np.array([5, 6, 7, 8]))
+ 0 6
+ 1 8
+ 2 10
+ 3 12
+ dtype: int64
+ >>> f(ser)
+ 21.1977562209304
+ >>> result = ser.__array__()
+ >>> type(result)
+ numpy.ndarray
+
+Example: PyTorch tensors
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+`PyTorch <https://pytorch.org/>`__ is an optimized tensor library for deep
+learning using GPUs and CPUs. PyTorch arrays are commonly called *tensors*.
+Tensors are similar to NumPy’s ndarrays, except that tensors can run on GPUs or
+other hardware accelerators. In fact, tensors and NumPy arrays can often share
+the same underlying memory, eliminating the need to copy data.
+
+ >>> import torch
+ >>> data = [[1, 2],[3, 4]]
+ >>> x_np = np.array(data)
+ >>> x_tensor = torch.tensor(data)
+
+Note that ``x_np`` and ``x_tensor`` are different kinds of objects:
+
+ >>> x_np
+ array([[1, 2],
+ [3, 4]])
+ >>> x_tensor
+ tensor([[1, 2],
+ [3, 4]])
+
+However, we can treat PyTorch tensors as NumPy arrays without the need for
+explicit conversion:
+
+ >>> np.exp(x_tensor)
+ tensor([[ 2.7183, 7.3891],
+ [20.0855, 54.5982]], dtype=torch.float64)
+
+Also, note that the return type of this function is compatible with the initial
+data type.
+
+**Note** PyTorch does not implement ``__array_function__`` or
+``__array_ufunc__``. Under the hood, the ``Tensor.__array__()`` method returns a
+NumPy ndarray as a view of the tensor data buffer. See `this issue
+<https://github.com/pytorch/pytorch/issues/24015>`__ and the
+`__torch_function__ implementation
+<https://github.com/pytorch/pytorch/blob/master/torch/overrides.py>`__
+for details.
+
+Example: CuPy arrays
+~~~~~~~~~~~~~~~~~~~~
+
+CuPy is a NumPy/SciPy-compatible array library for GPU-accelerated computing
+with Python. CuPy implements a subset of the NumPy interface by implementing
+``cupy.ndarray``, `a counterpart to NumPy ndarrays
+<https://docs.cupy.dev/en/stable/reference/ndarray.html>`__.
+
+ >>> import cupy as cp
+ >>> x_gpu = cp.array([1, 2, 3, 4])
+
+The ``cupy.ndarray`` object implements the ``__array_ufunc__`` interface. This
+enables NumPy ufuncs to be directly operated on CuPy arrays:
+
+ >>> np.mean(np.exp(x_gpu))
+ array(21.19775622)
+
+Note that the return type of these operations is still consistent with the
+initial type:
+
+ >>> arr = cp.random.randn(1, 2, 3, 4).astype(cp.float32)
+ >>> result = np.sum(arr)
+ >>> print(type(result))
+ <class 'cupy._core.core.ndarray'>
+
+See `this page in the CuPy documentation for details
+<https://docs.cupy.dev/en/stable/reference/ufunc.html>`__.
+
+``cupy.ndarray`` also implements the ``__array_function__`` interface, meaning
+it is possible to do operations such as
+
+ >>> a = np.random.randn(100, 100)
+ >>> a_gpu = cp.asarray(a)
+ >>> qr_gpu = np.linalg.qr(a_gpu)
+
+CuPy implements many NumPy functions on ``cupy.ndarray`` objects, but not all.
+See `the CuPy documentation
+<https://docs.cupy.dev/en/stable/user_guide/difference.html>`__
+for details.
+
+Example: Dask arrays
+~~~~~~~~~~~~~~~~~~~~
+
+Dask is a flexible library for parallel computing in Python. Dask Array
+implements a subset of the NumPy ndarray interface using blocked algorithms,
+cutting up the large array into many small arrays. This allows computations on
+larger-than-memory arrays using multiple cores.
+
+Dask supports array protocols like ``__array__`` and
+``__array_ufunc__``.
+
+ >>> import dask.array as da
+ >>> x = da.random.normal(1, 0.1, size=(20, 20), chunks=(10, 10))
+ >>> np.mean(np.exp(x))
+ dask.array<mean_agg-aggregate, shape=(), dtype=float64, chunksize=(), chunktype=numpy.ndarray>
+ >>> np.mean(np.exp(x)).compute()
+ 5.090097550553843
+
+**Note** Dask is lazily evaluated, and the result from a computation isn’t
+computed until you ask for it by invoking ``compute()``.
+
+See `the Dask array documentation
+<https://docs.dask.org/en/stable/array.html>`__
+and the `scope of Dask arrays interoperability with NumPy arrays
+<https://docs.dask.org/en/stable/array.html#scope>`__ for details.
+
+Further reading
+---------------
+
+- `The Array interface
+ <https://numpy.org/doc/stable/reference/arrays.interface.html>`__
+- `Writing custom array containers
+ <https://numpy.org/devdocs/user/basics.dispatch.html>`__.
+- `Special array attributes
+ <https://numpy.org/devdocs/reference/arrays.classes.html#special-attributes-and-methods>`__
+ (details on the ``__array_ufunc__`` and ``__array_function__`` protocols)
+- `NumPy roadmap: interoperability
+ <https://numpy.org/neps/roadmap.html#interoperability>`__
+- `PyTorch documentation on the Bridge with NumPy
+ <https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#bridge-to-np-label>`__
+
+.. _CuPy: https://cupy.dev/
+.. _Sparse: https://sparse.pydata.org/
+.. _Dask: https://docs.dask.org/
+.. _TensorFlow: https://www.tensorflow.org/
+.. _PyTorch: https://pytorch.org/
+.. _XArray: http://xarray.pydata.org/
+.. _JAX: https://jax.readthedocs.io/
+.. _astropy.units: https://docs.astropy.org/en/stable/units/
+.. _pint: https://pint.readthedocs.io/
+.. _unyt: https://unyt.readthedocs.io/
diff --git a/doc/source/user/basics.rst b/doc/source/user/basics.rst
index affb85db2..c004d8978 100644
--- a/doc/source/user/basics.rst
+++ b/doc/source/user/basics.rst
@@ -20,3 +20,4 @@ fundamental NumPy ideas and philosophy.
basics.subclassing
basics.ufuncs
basics.copies
+ basics.interoperability