summaryrefslogtreecommitdiff
path: root/doc/source/dev
diff options
context:
space:
mode:
authorMukulika <mukulikapahari@gmail.com>2021-09-13 15:31:56 +0530
committerMukulika <mukulikapahari@gmail.com>2021-09-13 15:42:42 +0530
commit50850c0fbb274484dada41b1e2f3567de30aa0c5 (patch)
tree8fddc5061b122e5a49f6d1c7d4b006d3c833b22f /doc/source/dev
parentb892ed2c7fa27b2e0d73c12d12ace4b4d4e12897 (diff)
downloadnumpy-50850c0fbb274484dada41b1e2f3567de30aa0c5.tar.gz
DOC: Moved NumPy Internals into Under-the-hood docs
Diffstat (limited to 'doc/source/dev')
-rw-r--r--doc/source/dev/alignment.rst113
-rw-r--r--doc/source/dev/internals.code-explanations.rst646
-rw-r--r--doc/source/dev/internals.rst175
-rw-r--r--doc/source/dev/underthehood.rst10
4 files changed, 943 insertions, 1 deletions
diff --git a/doc/source/dev/alignment.rst b/doc/source/dev/alignment.rst
new file mode 100644
index 000000000..bb1198ebf
--- /dev/null
+++ b/doc/source/dev/alignment.rst
@@ -0,0 +1,113 @@
+.. currentmodule:: numpy
+
+.. _alignment:
+
+****************
+Memory Alignment
+****************
+
+NumPy alignment goals
+=====================
+
+There are three use-cases related to memory alignment in NumPy (as of 1.14):
+
+ 1. Creating :term:`structured datatypes <structured data type>` with
+ :term:`fields <field>` aligned like in a C-struct.
+ 2. Speeding up copy operations by using :class:`uint` assignment in instead of
+ ``memcpy``.
+ 3. Guaranteeing safe aligned access for ufuncs/setitem/casting code.
+
+NumPy uses two different forms of alignment to achieve these goals:
+"True alignment" and "Uint alignment".
+
+"True" alignment refers to the architecture-dependent alignment of an
+equivalent C-type in C. For example, in x64 systems :attr:`float64` is
+equivalent to ``double`` in C. On most systems, this has either an alignment of
+4 or 8 bytes (and this can be controlled in GCC by the option
+``malign-double``). A variable is aligned in memory if its memory offset is a
+multiple of its alignment. On some systems (eg. sparc) memory alignment is
+required; on others, it gives a speedup.
+
+"Uint" alignment depends on the size of a datatype. It is defined to be the
+"True alignment" of the uint used by NumPy's copy-code to copy the datatype, or
+undefined/unaligned if there is no equivalent uint. Currently, NumPy uses
+``uint8``, ``uint16``, ``uint32``, ``uint64``, and ``uint64`` to copy data of
+size 1, 2, 4, 8, 16 bytes respectively, and all other sized datatypes cannot
+be uint-aligned.
+
+For example, on a (typical Linux x64 GCC) system, the NumPy :attr:`complex64`
+datatype is implemented as ``struct { float real, imag; }``. This has "true"
+alignment of 4 and "uint" alignment of 8 (equal to the true alignment of
+``uint64``).
+
+Some cases where uint and true alignment are different (default GCC Linux):
+ ====== ========= ======== ========
+ arch type true-aln uint-aln
+ ====== ========= ======== ========
+ x86_64 complex64 4 8
+ x86_64 float128 16 8
+ x86 float96 4 \-
+ ====== ========= ======== ========
+
+
+Variables in NumPy which control and describe alignment
+=======================================================
+
+There are 4 relevant uses of the word ``align`` used in NumPy:
+
+ * The :attr:`dtype.alignment` attribute (``descr->alignment`` in C). This is
+ meant to reflect the "true alignment" of the type. It has arch-dependent
+ default values for all datatypes, except for the structured types created
+ with ``align=True`` as described below.
+ * The ``ALIGNED`` flag of an ndarray, computed in ``IsAligned`` and checked
+ by :c:func:`PyArray_ISALIGNED`. This is computed from
+ :attr:`dtype.alignment`.
+ It is set to ``True`` if every item in the array is at a memory location
+ consistent with :attr:`dtype.alignment`, which is the case if the
+ ``data ptr`` and all strides of the array are multiples of that alignment.
+ * The ``align`` keyword of the dtype constructor, which only affects
+ :ref:`structured_arrays`. If the structure's field offsets are not manually
+ provided, NumPy determines offsets automatically. In that case,
+ ``align=True`` pads the structure so that each field is "true" aligned in
+ memory and sets :attr:`dtype.alignment` to be the largest of the field
+ "true" alignments. This is like what C-structs usually do. Otherwise if
+ offsets or itemsize were manually provided ``align=True`` simply checks that
+ all the fields are "true" aligned and that the total itemsize is a multiple
+ of the largest field alignment. In either case :attr:`dtype.isalignedstruct`
+ is also set to True.
+ * ``IsUintAligned`` is used to determine if an ndarray is "uint aligned" in
+ an analogous way to how ``IsAligned`` checks for true alignment.
+
+Consequences of alignment
+=========================
+
+Here is how the variables above are used:
+
+ 1. Creating aligned structs: To know how to offset a field when
+ ``align=True``, NumPy looks up ``field.dtype.alignment``. This includes
+ fields that are nested structured arrays.
+ 2. Ufuncs: If the ``ALIGNED`` flag of an array is False, ufuncs will
+ buffer/cast the array before evaluation. This is needed since ufunc inner
+ loops access raw elements directly, which might fail on some archs if the
+ elements are not true-aligned.
+ 3. Getitem/setitem/copyswap function: Similar to ufuncs, these functions
+ generally have two code paths. If ``ALIGNED`` is False they will
+ use a code path that buffers the arguments so they are true-aligned.
+ 4. Strided copy code: Here, "uint alignment" is used instead. If the itemsize
+ of an array is equal to 1, 2, 4, 8 or 16 bytes and the array is uint
+ aligned then instead NumPy will do ``*(uintN*)dst) = *(uintN*)src)`` for
+ appropriate N. Otherwise, NumPy copies by doing ``memcpy(dst, src, N)``.
+ 5. Nditer code: Since this often calls the strided copy code, it must
+ check for "uint alignment".
+ 6. Cast code: This checks for "true" alignment, as it does
+ ``*dst = CASTFUNC(*src)`` if aligned. Otherwise, it does
+ ``memmove(srcval, src); dstval = CASTFUNC(srcval); memmove(dst, dstval)``
+ where dstval/srcval are aligned.
+
+Note that the strided-copy and strided-cast code are deeply intertwined and so
+any arrays being processed by them must be both uint and true aligned, even
+though the copy-code only needs uint alignment and the cast code only true
+alignment. If there is ever a big rewrite of this code it would be good to
+allow them to use different alignments.
+
+
diff --git a/doc/source/dev/internals.code-explanations.rst b/doc/source/dev/internals.code-explanations.rst
new file mode 100644
index 000000000..b6edd61b1
--- /dev/null
+++ b/doc/source/dev/internals.code-explanations.rst
@@ -0,0 +1,646 @@
+.. currentmodule:: numpy
+
+.. _c-code-explanations:
+
+*************************
+NumPy C code explanations
+*************************
+
+ Fanaticism consists of redoubling your efforts when you have forgotten
+ your aim.
+ --- *George Santayana*
+
+ An authority is a person who can tell you more about something than
+ you really care to know.
+ --- *Unknown*
+
+This page attempts to explain the logic behind some of the new
+pieces of code. The purpose behind these explanations is to enable
+somebody to be able to understand the ideas behind the implementation
+somewhat more easily than just staring at the code. Perhaps in this
+way, the algorithms can be improved on, borrowed from, and/or
+optimized by more people.
+
+
+Memory model
+============
+
+.. index::
+ pair: ndarray; memory model
+
+One fundamental aspect of the :class:`ndarray` is that an array is seen as a
+"chunk" of memory starting at some location. The interpretation of
+this memory depends on the :term:`stride` information. For each dimension in
+an :math:`N`-dimensional array, an integer (:term:`stride`) dictates how many
+bytes must be skipped to get to the next element in that dimension.
+Unless you have a single-segment array, this :term:`stride` information must
+be consulted when traversing through an array. It is not difficult to
+write code that accepts strides, you just have to use ``char*``
+pointers because strides are in units of bytes. Keep in mind also that
+strides do not have to be unit-multiples of the element size. Also,
+remember that if the number of dimensions of the array is 0 (sometimes
+called a ``rank-0`` array), then the :term:`strides <stride>` and
+:term:`dimensions <dimension>` variables are ``NULL``.
+
+Besides the structural information contained in the strides and
+dimensions members of the :c:type:`PyArrayObject`, the flags contain
+important information about how the data may be accessed. In particular,
+the :c:data:`NPY_ARRAY_ALIGNED` flag is set when the memory is on a
+suitable boundary according to the datatype array. Even if you have
+a :term:`contiguous` chunk of memory, you cannot just assume it is safe to
+dereference a datatype-specific pointer to an element. Only if the
+:c:data:`NPY_ARRAY_ALIGNED` flag is set, this is a safe operation. On
+some platforms it will work but on others, like Solaris, it will cause
+a bus error. The :c:data:`NPY_ARRAY_WRITEABLE` should also be ensured
+if you plan on writing to the memory area of the array. It is also
+possible to obtain a pointer to an unwritable memory area. Sometimes,
+writing to the memory area when the :c:data:`NPY_ARRAY_WRITEABLE` flag is not
+set will just be rude. Other times it can cause program crashes (*e.g.*
+a data-area that is a read-only memory-mapped file).
+
+
+Data-type encapsulation
+=======================
+
+.. seealso:: :ref:`arrays.dtypes`
+
+.. index::
+ single: dtype
+
+The :ref:`datatype <arrays.dtypes>` is an important abstraction of the
+:class:`ndarray`. Operations
+will look to the datatype to provide the key functionality that is
+needed to operate on the array. This functionality is provided in the
+list of function pointers pointed to by the ``f`` member of the
+:c:type:`PyArray_Descr` structure. In this way, the number of datatypes can be
+extended simply by providing a :c:type:`PyArray_Descr` structure with suitable
+function pointers in the ``f`` member. For built-in types, there are some
+optimizations that bypass this mechanism, but the point of the datatype
+abstraction is to allow new datatypes to be added.
+
+One of the built-in datatypes, the :class:`void` datatype allows for
+arbitrary :term:`structured types <structured data type>` containing 1 or more
+fields as elements of the array. A :term:`field` is simply another datatype
+object along with an offset into the current structured type. In order to
+support arbitrarily nested fields, several recursive implementations of
+datatype access are implemented for the void type. A common idiom is to cycle
+through the elements of the dictionary and perform a specific operation based on
+the datatype object stored at the given offset. These offsets can be
+arbitrary numbers. Therefore, the possibility of encountering misaligned
+data must be recognized and taken into account if necessary.
+
+
+N-D Iterators
+=============
+
+.. seealso:: :ref:`arrays.nditer`
+
+.. index::
+ single: array iterator
+
+A very common operation in much of NumPy code is the need to iterate
+over all the elements of a general, strided, N-dimensional array. This
+operation of a general-purpose N-dimensional loop is abstracted in the
+notion of an iterator object. To write an N-dimensional loop, you only
+have to create an iterator object from an ndarray, work with the
+:c:member:`dataptr <PyArrayIterObject.dataptr>` member of the iterator object
+structure and call the macro :c:func:`PyArray_ITER_NEXT` on the iterator
+object to move to the next element. The ``next`` element is always in
+C-contiguous order. The macro works by first special-casing the C-contiguous,
+1-D, and 2-D cases which work very simply.
+
+For the general case, the iteration works by keeping track of a list
+of coordinate counters in the iterator object. At each iteration, the
+last coordinate counter is increased (starting from 0). If this
+counter is smaller than one less than the size of the array in that
+dimension (a pre-computed and stored value), then the counter is
+increased and the :c:member:`dataptr <PyArrayIterObject.dataptr>` member is
+increased by the strides in that
+dimension and the macro ends. If the end of a dimension is reached,
+the counter for the last dimension is reset to zero and the
+:c:member:`dataptr <PyArrayIterObject.dataptr>` is
+moved back to the beginning of that dimension by subtracting the
+strides value times one less than the number of elements in that
+dimension (this is also pre-computed and stored in the
+:c:member:`backstrides <PyArrayIterObject.backstrides>`
+member of the iterator object). In this case, the macro does not end,
+but a local dimension counter is decremented so that the next-to-last
+dimension replaces the role that the last dimension played and the
+previously-described tests are executed again on the next-to-last
+dimension. In this way, the :c:member:`dataptr <PyArrayIterObject.dataptr>`
+is adjusted appropriately for arbitrary striding.
+
+The :c:member:`coordinates <PyArrayIterObject.coordinates>` member of the
+:c:type:`PyArrayIterObject` structure maintains
+the current N-d counter unless the underlying array is C-contiguous in
+which case the coordinate counting is bypassed. The
+:c:member:`index <PyArrayIterObject.index>` member of
+the :c:type:`PyArrayIterObject` keeps track of the current flat index of the
+iterator. It is updated by the :c:func:`PyArray_ITER_NEXT` macro.
+
+
+Broadcasting
+============
+
+.. seealso:: :ref:`basics.broadcasting`
+
+.. index::
+ single: broadcasting
+
+In Numeric, the ancestor of NumPy, broadcasting was implemented in several
+lines of code buried deep in ``ufuncobject.c``. In NumPy, the notion of
+broadcasting has been abstracted so that it can be performed in multiple places.
+Broadcasting is handled by the function :c:func:`PyArray_Broadcast`. This
+function requires a :c:type:`PyArrayMultiIterObject` (or something that is a
+binary equivalent) to be passed in. The :c:type:`PyArrayMultiIterObject` keeps
+track of the broadcast number of dimensions and size in each
+dimension along with the total size of the broadcast result. It also
+keeps track of the number of arrays being broadcast and a pointer to
+an iterator for each of the arrays being broadcast.
+
+The :c:func:`PyArray_Broadcast` function takes the iterators that have already
+been defined and uses them to determine the broadcast shape in each
+dimension (to create the iterators at the same time that broadcasting
+occurs then use the :c:func:`PyArray_MultiIterNew` function).
+Then, the iterators are
+adjusted so that each iterator thinks it is iterating over an array
+with the broadcast size. This is done by adjusting the iterators
+number of dimensions, and the :term:`shape` in each dimension. This works
+because the iterator strides are also adjusted. Broadcasting only
+adjusts (or adds) length-1 dimensions. For these dimensions, the
+strides variable is simply set to 0 so that the data-pointer for the
+iterator over that array doesn't move as the broadcasting operation
+operates over the extended dimension.
+
+Broadcasting was always implemented in Numeric using 0-valued strides
+for the extended dimensions. It is done in exactly the same way in
+NumPy. The big difference is that now the array of strides is kept
+track of in a :c:type:`PyArrayIterObject`, the iterators involved in a
+broadcast result are kept track of in a :c:type:`PyArrayMultiIterObject`,
+and the :c:func:`PyArray_Broadcast` call implements the
+:ref:`general-broadcasting-rules`.
+
+
+Array Scalars
+=============
+
+.. seealso:: :ref:`arrays.scalars`
+
+.. index::
+ single: array scalars
+
+The array scalars offer a hierarchy of Python types that allow a one-to-one
+correspondence between the datatype stored in an array and the
+Python-type that is returned when an element is extracted from the
+array. An exception to this rule was made with object arrays. Object
+arrays are heterogeneous collections of arbitrary Python objects. When
+you select an item from an object array, you get back the original
+Python object (and not an object array scalar which does exist but is
+rarely used for practical purposes).
+
+The array scalars also offer the same methods and attributes as arrays
+with the intent that the same code can be used to support arbitrary
+dimensions (including 0-dimensions). The array scalars are read-only
+(immutable) with the exception of the void scalar which can also be
+written to so that structured array field setting works more naturally
+(``a[0]['f1'] = value``).
+
+
+Indexing
+========
+
+.. seealso:: :ref:`basics.indexing`, :ref:`arrays.indexing`
+
+.. index::
+ single: indexing
+
+All Python indexing operations ``arr[index]`` are organized by first preparing
+the index and finding the index type. The supported index types are:
+
+* integer
+* :const:`newaxis`
+* :term:`python:slice`
+* :py:data:`Ellipsis`
+* integer arrays/array-likes (advanced)
+* boolean (single boolean array); if there is more than one boolean array as
+ the index or the shape does not match exactly, the boolean array will be
+ converted to an integer array instead.
+* 0-d boolean (and also integer); 0-d boolean arrays are a special
+ case that has to be handled in the advanced indexing code. They signal
+ that a 0-d boolean array had to be interpreted as an integer array.
+
+As well as the scalar array special case signaling that an integer array
+was interpreted as an integer index, which is important because an integer
+array index forces a copy but is ignored if a scalar is returned (full integer
+index). The prepared index is guaranteed to be valid with the exception of
+out of bound values and broadcasting errors for advanced indexing. This
+includes that an :py:data:`Ellipsis` is added for incomplete indices for
+example when a two-dimensional array is indexed with a single integer.
+
+The next step depends on the type of index which was found. If all
+dimensions are indexed with an integer a scalar is returned or set. A
+single boolean indexing array will call specialized boolean functions.
+Indices containing an :py:data:`Ellipsis` or :term:`python:slice` but no
+advanced indexing will always create a view into the old array by calculating
+the new strides and memory offset. This view can then either be returned or,
+for assignments, filled using ``PyArray_CopyObject``. Note that
+``PyArray_CopyObject`` may also be called on temporary arrays in other branches
+to support complicated assignments when the array is of object :class:`dtype`.
+
+Advanced indexing
+-----------------
+
+By far the most complex case is advanced indexing, which may or may not be
+combined with typical view-based indexing. Here integer indices are
+interpreted as view-based. Before trying to understand this, you may want
+to make yourself familiar with its subtleties. The advanced indexing code
+has three different branches and one special case:
+
+* There is one indexing array and it, as well as the assignment array, can
+ be iterated trivially. For example, they may be contiguous. Also, the
+ indexing array must be of :class:`intp` type and the value array in
+ assignments should be of the correct type. This is purely a fast path.
+* There are only integer array indices so that no subarray exists.
+* View-based and advanced indexing is mixed. In this case, the view-based
+ indexing defines a collection of subarrays that are combined by the
+ advanced indexing. For example, ``arr[[1, 2, 3], :]`` is created by
+ vertically stacking the subarrays ``arr[1, :]``, ``arr[2, :]``, and
+ ``arr[3, :]``.
+* There is a subarray but it has exactly one element. This case can be handled
+ as if there is no subarray but needs some care during setup.
+
+Deciding what case applies, checking broadcasting, and determining the kind
+of transposition needed are all done in :c:func:`PyArray_MapIterNew`. After
+setting up, there are two cases. If there is no subarray or it only has one
+element, no subarray iteration is necessary and an iterator is prepared
+which iterates all indexing arrays *as well as* the result or value array.
+If there is a subarray, there are three iterators prepared. One for the
+indexing arrays, one for the result or value array (minus its subarray),
+and one for the subarrays of the original and the result/assignment array.
+The first two iterators give (or allow calculation) of the pointers into
+the start of the subarray, which then allows restarting the subarray
+iteration.
+
+When advanced indices are next to each other transposing may be necessary.
+All necessary transposing is handled by :c:func:`PyArray_MapIterSwapAxes` and
+has to be handled by the caller unless :c:func:`PyArray_MapIterNew` is asked to
+allocate the result.
+
+After preparation, getting and setting are relatively straightforward,
+although the different modes of iteration need to be considered. Unless
+there is only a single indexing array during item getting, the validity of
+the indices is checked beforehand. Otherwise, it is handled in the inner
+loop itself for optimization.
+
+.. _ufuncs-internals:
+
+Universal functions
+===================
+
+.. seealso:: :ref:`ufuncs`, :ref:`ufuncs-basics`
+
+.. index::
+ single: ufunc
+
+Universal functions are callable objects that take :math:`N` inputs
+and produce :math:`M` outputs by wrapping basic 1-D loops that work
+element-by-element into full easy-to-use functions that seamlessly
+implement :ref:`broadcasting <basics.broadcasting>`,
+:ref:`type-checking <ufuncs.casting>`,
+:ref:`buffered coercion <use-of-internal-buffers>`, and
+:ref:`output-argument handling <ufuncs-output-type>`. New universal functions
+are normally created in C, although there is a mechanism for creating ufuncs
+from Python functions (:func:`frompyfunc`). The user must supply a 1-D loop that
+implements the basic function taking the input scalar values and
+placing the resulting scalars into the appropriate output slots as
+explained in implementation.
+
+
+Setup
+-----
+
+Every :class:`ufunc` calculation involves some overhead related to setting up
+the calculation. The practical significance of this overhead is that
+even though the actual calculation of the ufunc is very fast, you will
+be able to write array and type-specific code that will work faster
+for small arrays than the ufunc. In particular, using ufuncs to
+perform many calculations on 0-D arrays will be slower than other
+Python-based solutions (the silently-imported ``scalarmath`` module exists
+precisely to give array scalars the look-and-feel of ufunc based
+calculations with significantly reduced overhead).
+
+When a :class:`ufunc` is called, many things must be done. The information
+collected from these setup operations is stored in a loop object. This
+loop object is a C-structure (that could become a Python object but is
+not initialized as such because it is only used internally). This loop
+object has the layout needed to be used with :c:func:`PyArray_Broadcast`
+so that the broadcasting can be handled in the same way as it is handled in
+other sections of code.
+
+The first thing done is to look up in the thread-specific global
+dictionary the current values for the buffer-size, the error mask, and
+the associated error object. The state of the error mask controls what
+happens when an error condition is found. It should be noted that
+checking of the hardware error flags is only performed after each 1-D
+loop is executed. This means that if the input and output arrays are
+contiguous and of the correct type so that a single 1-D loop is
+performed, then the flags may not be checked until all elements of the
+array have been calculated. Looking up these values in a thread-specific
+dictionary takes time which is easily ignored for all but
+very small arrays.
+
+After checking, the thread-specific global variables, the inputs are
+evaluated to determine how the ufunc should proceed and the input and
+output arrays are constructed if necessary. Any inputs which are not
+arrays are converted to arrays (using context if necessary). Which of
+the inputs are scalars (and therefore converted to 0-D arrays) is
+noted.
+
+Next, an appropriate 1-D loop is selected from the 1-D loops available
+to the :class:`ufunc` based on the input array types. This 1-D loop is selected
+by trying to match the signature of the datatypes of the inputs
+against the available signatures. The signatures corresponding to
+built-in types are stored in the :attr:`ufunc.types` member of the ufunc
+structure. The signatures corresponding to user-defined types are stored in a
+linked list of function information with the head element stored as a
+``CObject`` in the ``userloops`` dictionary keyed by the datatype number
+(the first user-defined type in the argument list is used as the key).
+The signatures are searched until a signature is found to which the
+input arrays can all be cast safely (ignoring any scalar arguments
+which are not allowed to determine the type of the result). The
+implication of this search procedure is that "lesser types" should be
+placed below "larger types" when the signatures are stored. If no 1-D
+loop is found, then an error is reported. Otherwise, the ``argument_list``
+is updated with the stored signature --- in case casting is necessary
+and to fix the output types assumed by the 1-D loop.
+
+If the ufunc has 2 inputs and 1 output and the second input is an
+``Object`` array then a special-case check is performed so that
+``NotImplemented`` is returned if the second input is not an ndarray, has
+the :obj:`~numpy.class.__array_priority__` attribute, and has an ``__r{op}__``
+special method. In this way, Python is signaled to give the other object a
+chance to complete the operation instead of using generic object-array
+calculations. This allows (for example) sparse matrices to override
+the multiplication operator 1-D loop.
+
+For input arrays that are smaller than the specified buffer size,
+copies are made of all non-contiguous, misaligned, or out-of-byteorder
+arrays to ensure that for small arrays, a single loop is
+used. Then, array iterators are created for all the input arrays and
+the resulting collection of iterators is broadcast to a single shape.
+
+The output arguments (if any) are then processed and any missing
+return arrays are constructed. If any provided output array doesn't
+have the correct type (or is misaligned) and is smaller than the
+buffer size, then a new output array is constructed with the special
+:c:data:`NPY_ARRAY_WRITEBACKIFCOPY` flag set. At the end of the function,
+:c:func:`PyArray_ResolveWritebackIfCopy` is called so that
+its contents will be copied back into the output array.
+Iterators for the output arguments are then processed.
+
+Finally, the decision is made about how to execute the looping
+mechanism to ensure that all elements of the input arrays are combined
+to produce the output arrays of the correct type. The options for loop
+execution are one-loop (for :term`contiguous`, aligned, and correct data
+type), strided-loop (for non-contiguous but still aligned and correct
+data type), and a buffered loop (for misaligned or incorrect data
+type situations). Depending on which execution method is called for,
+the loop is then set up and computed.
+
+
+Function call
+-------------
+
+This section describes how the basic universal function computation loop is
+set up and executed for each of the three different kinds of execution. If
+:c:data:`NPY_ALLOW_THREADS` is defined during compilation, then as long as
+no object arrays are involved, the Python Global Interpreter Lock (GIL) is
+released prior to calling the loops. It is re-acquired if necessary to
+handle error conditions. The hardware error flags are checked only after
+the 1-D loop is completed.
+
+
+One loop
+^^^^^^^^
+
+This is the simplest case of all. The ufunc is executed by calling the
+underlying 1-D loop exactly once. This is possible only when we have
+aligned data of the correct type (including byteorder) for both input
+and output and all arrays have uniform strides (either :term:`contiguous`,
+0-D, or 1-D). In this case, the 1-D computational loop is called once
+to compute the calculation for the entire array. Note that the
+hardware error flags are only checked after the entire calculation is
+complete.
+
+
+Strided loop
+^^^^^^^^^^^^
+
+When the input and output arrays are aligned and of the correct type,
+but the striding is not uniform (non-contiguous and 2-D or larger),
+then a second looping structure is employed for the calculation. This
+approach converts all of the iterators for the input and output
+arguments to iterate over all but the largest dimension. The inner
+loop is then handled by the underlying 1-D computational loop. The
+outer loop is a standard iterator loop on the converted iterators. The
+hardware error flags are checked after each 1-D loop is completed.
+
+
+Buffered loop
+^^^^^^^^^^^^^
+
+This is the code that handles the situation whenever the input and/or
+output arrays are either misaligned or of the wrong datatype
+(including being byteswapped) from what the underlying 1-D loop
+expects. The arrays are also assumed to be non-contiguous. The code
+works very much like the strided-loop except for the inner 1-D loop is
+modified so that pre-processing is performed on the inputs and post-processing
+is performed on the outputs in ``bufsize`` chunks (where
+``bufsize`` is a user-settable parameter). The underlying 1-D
+computational loop is called on data that is copied over (if it needs
+to be). The setup code and the loop code is considerably more
+complicated in this case because it has to handle:
+
+- memory allocation of the temporary buffers
+
+- deciding whether or not to use buffers on the input and output data
+ (misaligned and/or wrong datatype)
+
+- copying and possibly casting data for any inputs or outputs for which
+ buffers are necessary.
+
+- special-casing ``Object`` arrays so that reference counts are properly
+ handled when copies and/or casts are necessary.
+
+- breaking up the inner 1-D loop into ``bufsize`` chunks (with a possible
+ remainder).
+
+Again, the hardware error flags are checked at the end of each 1-D
+loop.
+
+
+Final output manipulation
+-------------------------
+
+Ufuncs allow other array-like classes to be passed seamlessly through
+the interface in that inputs of a particular class will induce the
+outputs to be of that same class. The mechanism by which this works is
+the following. If any of the inputs are not ndarrays and define the
+:obj:`~numpy.class.__array_wrap__` method, then the class with the largest
+:obj:`~numpy.class.__array_priority__` attribute determines the type of all the
+outputs (with the exception of any output arrays passed in). The
+:obj:`~numpy.class.__array_wrap__` method of the input array will be called
+with the ndarray being returned from the ufunc as its input. There are two
+calling styles of the :obj:`~numpy.class.__array_wrap__` function supported.
+The first takes the ndarray as the first argument and a tuple of "context" as
+the second argument. The context is (ufunc, arguments, output argument
+number). This is the first call tried. If a ``TypeError`` occurs, then the
+function is called with just the ndarray as the first argument.
+
+
+Methods
+-------
+
+There are three methods of ufuncs that require calculation similar to
+the general-purpose ufuncs. These are :meth:`ufunc.reduce`,
+:meth:`ufunc.accumulate`, and :meth:`ufunc.reduceat`. Each of these
+methods requires a setup command followed by a
+loop. There are four loop styles possible for the methods
+corresponding to no-elements, one-element, strided-loop, and buffered-loop.
+These are the same basic loop styles as implemented for the
+general-purpose function call except for the no-element and one-element
+cases which are special-cases occurring when the input array
+objects have 0 and 1 elements respectively.
+
+
+Setup
+^^^^^
+
+The setup function for all three methods is ``construct_reduce``.
+This function creates a reducing loop object and fills it with the
+parameters needed to complete the loop. All of the methods only work
+on ufuncs that take 2-inputs and return 1 output. Therefore, the
+underlying 1-D loop is selected assuming a signature of ``[otype,
+otype, otype]`` where ``otype`` is the requested reduction
+datatype. The buffer size and error handling are then retrieved from
+(per-thread) global storage. For small arrays that are misaligned or
+have incorrect datatype, a copy is made so that the un-buffered
+section of code is used. Then, the looping strategy is selected. If
+there is 1 element or 0 elements in the array, then a simple looping
+method is selected. If the array is not misaligned and has the
+correct datatype, then strided looping is selected. Otherwise,
+buffered looping must be performed. Looping parameters are then
+established, and the return array is constructed. The output array is
+of a different :term:`shape` depending on whether the method is
+:meth:`reduce <ufunc.reduce>`, :meth:`accumulate <ufunc.accumulate>`, or
+:meth:`reduceat <ufunc.reduceat>`. If an output array is already provided, then
+its shape is checked. If the output array is not C-contiguous,
+aligned, and of the correct data type, then a temporary copy is made
+with the :c:data:`NPY_ARRAY_WRITEBACKIFCOPY` flag set. In this way, the methods
+will be able to work with a well-behaved output array but the result will be
+copied back into the true output array when
+:c:func:`PyArray_ResolveWritebackIfCopy` is called at function completion.
+Finally, iterators are set up to loop over the correct :term:`axis`
+(depending on the value of axis provided to the method) and the setup
+routine returns to the actual computation routine.
+
+
+:meth:`Reduce <ufunc.reduce>`
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. index::
+ triple: ufunc; methods; reduce
+
+All of the ufunc methods use the same underlying 1-D computational
+loops with input and output arguments adjusted so that the appropriate
+reduction takes place. For example, the key to the functioning of
+:meth:`reduce <ufunc.reduce>` is that the 1-D loop is called with the output
+and the second input pointing to the same position in memory and both having
+a step-size of 0. The first input is pointing to the input array with a
+step-size given by the appropriate stride for the selected axis. In this
+way, the operation performed is
+
+.. math::
+ :nowrap:
+
+ \begin{align*}
+ o & = & i[0] \\
+ o & = & i[k]\textrm{<op>}o\quad k=1\ldots N
+ \end{align*}
+
+where :math:`N+1` is the number of elements in the input, :math:`i`,
+:math:`o` is the output, and :math:`i[k]` is the
+:math:`k^{\textrm{th}}` element of :math:`i` along the selected axis.
+This basic operation is repeated for arrays with greater than 1
+dimension so that the reduction takes place for every 1-D sub-array
+along the selected axis. An iterator with the selected dimension
+removed handles this looping.
+
+For buffered loops, care must be taken to copy and cast data before
+the loop function is called because the underlying loop expects
+aligned data of the correct datatype (including byteorder). The
+buffered loop must handle this copying and casting prior to calling
+the loop function on chunks no greater than the user-specified
+``bufsize``.
+
+
+:meth:`Accumulate <ufunc.accumulate>`
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. index::
+ triple: ufunc; methods; accumulate
+
+The :meth:`accumulate <ufunc.accumulate>` method is very similar to
+the :meth:`reduce <ufunc.reduce>` method in that
+the output and the second input both point to the output. The
+difference is that the second input points to memory one stride behind
+the current output pointer. Thus, the operation performed is
+
+.. math::
+ :nowrap:
+
+ \begin{align*}
+ o[0] & = & i[0] \\
+ o[k] & = & i[k]\textrm{<op>}o[k-1]\quad k=1\ldots N.
+ \end{align*}
+
+The output has the same shape as the input and each 1-D loop operates
+over :math:`N` elements when the shape in the selected axis is :math:`N+1`.
+Again, buffered loops take care to copy and cast the data before
+calling the underlying 1-D computational loop.
+
+
+:meth:`Reduceat <ufunc.reduceat>`
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. index::
+ triple: ufunc; methods; reduceat
+ single: ufunc
+
+The :meth:`reduceat <ufunc.reduceat>` function is a generalization of both the
+:meth:`reduce <ufunc.reduce>` and :meth:`accumulate <ufunc.accumulate>`
+functions. It implements a :meth:`reduce <ufunc.reduce>` over ranges of
+the input array specified by indices. The extra indices argument is checked to
+be sure that every input is not too large for the input array along
+the selected dimension before the loop calculations take place. The
+loop implementation is handled using code that is very similar to the
+:meth:`reduce <ufunc.reduce>` code repeated as many times as there are elements
+in the indices input. In particular: the first input pointer passed to the
+underlying 1-D computational loop points to the input array at the
+correct location indicated by the index array. In addition, the output
+pointer and the second input pointer passed to the underlying 1-D loop
+point to the same position in memory. The size of the 1-D
+computational loop is fixed to be the difference between the current
+index and the next index (when the current index is the last index,
+then the next index is assumed to be the length of the array along the
+selected dimension). In this way, the 1-D loop will implement a
+:meth:`reduce <ufunc.reduce>` over the specified indices.
+
+Misaligned or a loop datatype that does not match the input and/or
+output datatype is handled using buffered code wherein data is
+copied to a temporary buffer and cast to the correct datatype if
+necessary prior to calling the underlying 1-D function. The temporary
+buffers are created in (element) sizes no bigger than the user
+settable buffer-size value. Thus, the loop must be flexible enough to
+call the underlying 1-D computational loop enough times to complete
+the total calculation in chunks no bigger than the buffer-size.
diff --git a/doc/source/dev/internals.rst b/doc/source/dev/internals.rst
new file mode 100644
index 000000000..14e5f3141
--- /dev/null
+++ b/doc/source/dev/internals.rst
@@ -0,0 +1,175 @@
+.. currentmodule:: numpy
+
+.. _numpy-internals:
+
+*************************************
+Internal organization of NumPy arrays
+*************************************
+
+It helps to understand a bit about how NumPy arrays are handled under the covers
+to help understand NumPy better. This section will not go into great detail.
+Those wishing to understand the full details are requested to refer to Travis
+Oliphant's book `Guide to NumPy <http://web.mit.edu/dvp/Public/numpybook.pdf>`_.
+
+NumPy arrays consist of two major components: the raw array data (from now on,
+referred to as the data buffer), and the information about the raw array data.
+The data buffer is typically what people think of as arrays in C or Fortran,
+a :term:`contiguous` (and fixed) block of memory containing fixed-sized data
+items. NumPy also contains a significant set of data that describes how to
+interpret the data in the data buffer. This extra information contains (among
+other things):
+
+ 1) The basic data element's size in bytes.
+ 2) The start of the data within the data buffer (an offset relative to the
+ beginning of the data buffer).
+ 3) The number of :term:`dimensions <dimension>` and the size of each dimension.
+ 4) The separation between elements for each dimension (the :term:`stride`).
+ This does not have to be a multiple of the element size.
+ 5) The byte order of the data (which may not be the native byte order).
+ 6) Whether the buffer is read-only.
+ 7) Information (via the :class:`dtype` object) about the interpretation of the
+ basic data element. The basic data element may be as simple as an int or a
+ float, or it may be a compound object (e.g.,
+ :term:`struct-like <structured data type>`), a fixed character field,
+ or Python object pointers.
+ 8) Whether the array is to be interpreted as :term:`C-order <C order>`
+ or :term:`Fortran-order <Fortran order>`.
+
+This arrangement allows for the very flexible use of arrays. One thing that it
+allows is simple changes to the metadata to change the interpretation of the
+array buffer. Changing the byteorder of the array is a simple change involving
+no rearrangement of the data. The :term:`shape` of the array can be changed very
+easily without changing anything in the data buffer or any data copying at all.
+
+Among other things that are made possible is one can create a new array metadata
+object that uses the same data buffer
+to create a new :term:`view` of that data buffer that has a different
+interpretation of the buffer (e.g., different shape, offset, byte order,
+strides, etc) but shares the same data bytes. Many operations in NumPy do just
+this such as :term:`slicing <python:slice>`. Other operations, such as
+transpose, don't move data elements around in the array, but rather change the
+information about the shape and strides so that the indexing of the array
+changes, but the data in the doesn't move.
+
+Typically these new versions of the array metadata but the same data buffer are
+new views into the data buffer. There is a different :class:`ndarray` object,
+but it uses the same data buffer. This is why it is necessary to force copies
+through the use of the :func:`copy` method if one really wants to make a new
+and independent copy of the data buffer.
+
+New views into arrays mean the object reference counts for the data buffer
+increase. Simply doing away with the original array object will not remove the
+data buffer if other views of it still exist.
+
+Multidimensional array indexing order issues
+============================================
+
+.. seealso:: :ref:`basics.indexing`
+
+What is the right way to index
+multi-dimensional arrays? Before you jump to conclusions about the one and
+true way to index multi-dimensional arrays, it pays to understand why this is
+a confusing issue. This section will try to explain in detail how NumPy
+indexing works and why we adopt the convention we do for images, and when it
+may be appropriate to adopt other conventions.
+
+The first thing to understand is
+that there are two conflicting conventions for indexing 2-dimensional arrays.
+Matrix notation uses the first index to indicate which row is being selected and
+the second index to indicate which column is selected. This is opposite the
+geometrically oriented-convention for images where people generally think the
+first index represents x position (i.e., column) and the second represents y
+position (i.e., row). This alone is the source of much confusion;
+matrix-oriented users and image-oriented users expect two different things with
+regard to indexing.
+
+The second issue to understand is how indices correspond
+to the order in which the array is stored in memory. In Fortran, the first index
+is the most rapidly varying index when moving through the elements of a
+two-dimensional array as it is stored in memory. If you adopt the matrix
+convention for indexing, then this means the matrix is stored one column at a
+time (since the first index moves to the next row as it changes). Thus Fortran
+is considered a Column-major language. C has just the opposite convention. In
+C, the last index changes most rapidly as one moves through the array as
+stored in memory. Thus C is a Row-major language. The matrix is stored by
+rows. Note that in both cases it presumes that the matrix convention for
+indexing is being used, i.e., for both Fortran and C, the first index is the
+row. Note this convention implies that the indexing convention is invariant
+and that the data order changes to keep that so.
+
+But that's not the only way
+to look at it. Suppose one has large two-dimensional arrays (images or
+matrices) stored in data files. Suppose the data are stored by rows rather than
+by columns. If we are to preserve our index convention (whether matrix or
+image) that means that depending on the language we use, we may be forced to
+reorder the data if it is read into memory to preserve our indexing
+convention. For example, if we read row-ordered data into memory without
+reordering, it will match the matrix indexing convention for C, but not for
+Fortran. Conversely, it will match the image indexing convention for Fortran,
+but not for C. For C, if one is using data stored in row order, and one wants
+to preserve the image index convention, the data must be reordered when
+reading into memory.
+
+In the end, what you do for Fortran or C depends on
+which is more important, not reordering data or preserving the indexing
+convention. For large images, reordering data is potentially expensive, and
+often the indexing convention is inverted to avoid that.
+
+The situation with
+NumPy makes this issue yet more complicated. The internal machinery of NumPy
+arrays is flexible enough to accept any ordering of indices. One can simply
+reorder indices by manipulating the internal :term:`stride` information for
+arrays without reordering the data at all. NumPy will know how to map the new
+index order to the data without moving the data.
+
+So if this is true, why not choose
+the index order that matches what you most expect? In particular, why not define
+row-ordered images to use the image convention? (This is sometimes referred
+to as the Fortran convention vs the C convention, thus the 'C' and 'FORTRAN'
+order options for array ordering in NumPy.) The drawback of doing this is
+potential performance penalties. It's common to access the data sequentially,
+either implicitly in array operations or explicitly by looping over rows of an
+image. When that is done, then the data will be accessed in non-optimal order.
+As the first index is incremented, what is actually happening is that elements
+spaced far apart in memory are being sequentially accessed, with usually poor
+memory access speeds. For example, for a two-dimensional image ``im`` defined so
+that ``im[0, 10]`` represents the value at ``x = 0``, ``y = 10``. To be
+consistent with usual Python behavior then ``im[0]`` would represent a column
+at ``x = 0``. Yet that data would be spread over the whole array since the data
+are stored in row order. Despite the flexibility of NumPy's indexing, it can't
+really paper over the fact basic operations are rendered inefficient because of
+data order or that getting contiguous subarrays is still awkward (e.g.,
+``im[:, 0]`` for the first row, vs ``im[0]``). Thus one can't use an idiom such
+as for row in ``im``; for col in ``im`` does work, but doesn't yield contiguous
+column data.
+
+As it turns out, NumPy is
+smart enough when dealing with :ref:`ufuncs <ufuncs-internals>` to determine
+which index is the most rapidly varying one in memory and uses that for the
+innermost loop. Thus for ufuncs, there is no large intrinsic advantage to
+either approach in most cases. On the other hand, use of :attr:`ndarray.flat`
+with a FORTRAN ordered array will lead to non-optimal memory access as adjacent
+elements in the flattened array (iterator, actually) are not contiguous in
+memory.
+
+Indeed, the fact is that Python
+indexing on lists and other sequences naturally leads to an outside-to-inside
+ordering (the first index gets the largest grouping, the next largest,
+and the last gets the smallest element). Since image data are normally stored
+in rows, this corresponds to the position within rows being the last item
+indexed.
+
+If you do want to use Fortran ordering realize that
+there are two approaches to consider: 1) accept that the first index is just not
+the most rapidly changing in memory and have all your I/O routines reorder
+your data when going from memory to disk or visa versa, or use NumPy's
+mechanism for mapping the first index to the most rapidly varying data. We
+recommend the former if possible. The disadvantage of the latter is that many
+of NumPy's functions will yield arrays without Fortran ordering unless you are
+careful to use the ``order`` keyword. Doing this would be highly inconvenient.
+
+Otherwise, we recommend simply learning to reverse the usual order of indices
+when accessing elements of an array. Granted, it goes against the grain, but
+it is more in line with Python semantics and the natural order of the data.
+
+
diff --git a/doc/source/dev/underthehood.rst b/doc/source/dev/underthehood.rst
index 4dae48689..c0f37fd5b 100644
--- a/doc/source/dev/underthehood.rst
+++ b/doc/source/dev/underthehood.rst
@@ -4,4 +4,12 @@
Under-the-hood Documentation for developers
===========================================
-To be completed.
+These documents are intended as a low-level look into NumPy; focused
+towards developers.
+
+.. toctree::
+ :maxdepth: 1
+
+ internals
+ internals.code-explanations
+ alignment