diff options
Diffstat (limited to 'docs/src/tutorial/parallelization.rst')
-rw-r--r-- | docs/src/tutorial/parallelization.rst | 335 |
1 files changed, 335 insertions, 0 deletions
diff --git a/docs/src/tutorial/parallelization.rst b/docs/src/tutorial/parallelization.rst new file mode 100644 index 000000000..f5e9ba4b2 --- /dev/null +++ b/docs/src/tutorial/parallelization.rst @@ -0,0 +1,335 @@ +.. _parallel-tutorial: + +================================= +Writing parallel code with Cython +================================= + +.. include:: + ../two-syntax-variants-used + +One method of speeding up your Cython code is parallelization: +you write code that can be run on multiple cores of your CPU simultaneously. +For code that lends itself to parallelization this can produce quite +dramatic speed-ups, equal to the number of cores your CPU has (for example +a 4× speed-up on a 4-core CPU). + +This tutorial assumes that you are already familiar with Cython's +:ref:`"typed memoryviews"<memoryviews>` (since code using memoryviews is often +the sort of code that's easy to parallelize with Cython), and also that you're +somewhat familiar with the pitfalls of writing parallel code in general +(it aims to be a Cython tutorial rather than a complete introduction +to parallel programming). + +Before starting, a few notes: + +- Not all code can be parallelized - for some code the algorithm simply + relies on being executed in order and you should not attempt to + parallelize it. A cumulative sum is a good example. + +- Not all code is worth parallelizing. There's a reasonable amount of + overhead in starting a parallel section and so you need to make sure + that you're operating on enough data to make this overhead worthwhile. + Additionally, make sure that you are doing actual work on the data! + Multiple threads simply reading the same data tends not to parallelize + too well. If in doubt, time it. + +- Cython requires the contents of parallel blocks to be ``nogil``. If + your algorithm requires access to Python objects then it may not be + suitable for parallelization. + +- Cython's inbuilt parallelization uses the OpenMP constructs + ``omp parallel for`` and ``omp parallel``. These are ideal + for parallelizing relatively small, self-contained blocks of code + (especially loops). However, If you want to use other models of + parallelization such as spawning and waiting for tasks, or + off-loading some "side work" to a continuously running secondary + thread, then you might be better using other methods (such as + Python's ``threading`` module). + +- Actually implementing your parallel Cython code should probably be + one of the last steps in your optimization. You should start with + some working serial code first. However, it's worth planning for + early since it may affect your choice of algorithm. + +This tutorial does not aim to explore all the options available to +customize parallelization. See the +:ref:`main parallelism documentation<parallel>` for details. +You should also be aware that a lot of the choices Cython makes +about how your code is parallelized are fairly fixed and if you want +specific OpenMP behaviour that Cython doesn't provide by default you +may be better writing it in C yourself. + +Compilation +=========== + +OpenMP requires support from your C/C++ compiler. This support is +usually enabled through a special command-line argument: +on GCC this is ``-fopenmp`` while on MSVC it is +``/openmp``. If your compiler doesn't support OpenMP (or if you +forget to pass the argument) then the code will usually still +compile but will not run in parallel. + +The following ``setup.py`` file can be used to compile the +examples in this tutorial: + +.. literalinclude:: ../../examples/tutorial/parallelization/setup.py + +Element-wise parallel operations +================================ + +The easiest and most common parallel operation in Cython is to +iterate across an array element-wise, performing the same +operation on each array element. In the simple example +below we calculate the ``sin`` of every element in an array: + +.. tabs:: + + .. group-tab:: Cython + + .. literalinclude:: ../../examples/tutorial/parallelization/parallel_sin.pyx + + .. group-tab:: Pure Python + + .. literalinclude:: ../../examples/tutorial/parallelization/parallel_sin.py + +We parallelize the outermost loop. This is usually a good idea +since there is some overhead to entering and leaving a parallel block. +However, you should also consider the likely size of your arrays. +If ``input`` usually had a size of ``(2, 10000000)`` then parallelizing +over the dimension of length ``2`` would likely be a worse choice. + +The body of the loop itself is ``nogil`` - i.e. you cannot perform +"Python" operations. This is a fairly strong limitation and if you +find that you need to use the GIL then it is likely that Cython's +parallelization features are not suitable for you. It is possible +to throw exceptions from within the loop, however -- Cython simply +regains the GIL and raises an exception, then terminates the loop +on all threads. + +It's necessary to explicitly type the loop variable ``i`` as a +C integer. For a non-parallel loop Cython can infer this, but it +does not currently infer the loop variable for parallel loops, +so not typing ``i`` will lead to compile errors since it will be +a Python object and so unusable without the GIL. + +The C code generated is shown below, for the benefit of experienced +users of OpenMP. It is simplified a little for readability here: + +.. code-block:: C + + #pragma omp parallel + { + #pragma omp for firstprivate(i) lastprivate(i) lastprivate(j) + for (__pyx_t_8 = 0; __pyx_t_8 < __pyx_t_9; __pyx_t_8++){ + i = __pyx_t_8; + /* body goes here */ + } + } + +Private variables +----------------- + +One useful point to note from the generated C code above - variables +used in the loops like ``i`` and ``j`` are marked as ``firstprivate`` +and ``lastprivate``. Within the loop each thread has its own copy of +the data, the data is initialized +according to its value before the loop, and after the loop the "global" +copy is set equal to the last iteration (i.e. as if the loop were run +in serial). + +The basic rules that Cython applies are: + +- C scalar variables within a ``prange`` block are made + ``firstprivate`` and ``lastprivate``, + +- C scalar variables assigned within a + :ref:`parallel block<parallel-block>` + are ``private`` (which means they can't be used to pass data in + and out of the block), + +- array variables (e.g. memoryviews) are not made private. Instead + Cython assumes that you have structured your loop so that each iteration + is acting on different data, + +- Python objects are also not made private, although access to them + is controlled via Python's GIL. + +Cython does not currently provide much opportunity of override these +choices. + +Reductions +========== + +The second most common parallel operation in Cython is the "reduction" +operation. A common example is to accumulate a sum over the whole +array, such as in the calculation of a vector norm below: + +.. tabs:: + + .. group-tab:: Cython + + .. literalinclude:: ../../examples/tutorial/parallelization/norm.pyx + + .. group-tab:: Pure Python + + .. literalinclude:: ../../examples/tutorial/parallelization/norm.py + +Cython is able to infer reductions for ``+=``, ``*=``, ``-=``, +``&=``, ``|=``, and ``^=``. These only apply to C scalar variables +so you cannot easily reduce a 2D memoryview to a 1D memoryview for +example. + +The C code generated is approximately: + +.. code-block:: C + + #pragma omp parallel reduction(+:total) + { + #pragma omp for firstprivate(i) lastprivate(i) + for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){ + i = __pyx_t_2; + total = total + /* some indexing code */; + + } + } + +.. _parallel-block: + +``parallel`` blocks +=================== + +Much less frequently used than ``prange`` is Cython's ``parallel`` +operator. ``parallel`` generates a block of code that is run simultaneously +on multiple threads at once. Unlike ``prange``, however, work is +not automatically divided between threads. + +Here we present three common uses for the ``parallel`` block: + +Stringing together prange blocks +-------------------------------- + +There is some overhead in entering and leaving a parallelized section. +Therefore, if you have multiple parallel sections with small +serial sections in between it can be more efficient to +write one large parallel block. Any small serial +sections are duplicated, but the overhead is reduced. + +In the example below we do an in-place normalization of a vector. +The first parallel loop calculates the norm, the second parallel +loop applies the norm to the vector, and we avoid jumping in and out of serial +code in between. + +.. tabs:: + + .. group-tab:: Cython + + .. literalinclude:: ../../examples/tutorial/parallelization/normalize.pyx + + .. group-tab:: Pure Python + + .. literalinclude:: ../../examples/tutorial/parallelization/normalize.py + +The C code is approximately: + +.. code-block:: C + + #pragma omp parallel private(norm) reduction(+:total) + { + /* some calculations of array size... */ + #pragma omp for firstprivate(i) lastprivate(i) + for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){ + /* ... */ + } + norm = sqrt(total); + #pragma omp for firstprivate(i) lastprivate(i) + for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){ + /* ... */ + } + } + +Allocating "scratch space" for each thread +------------------------------------------ + +Suppose that each thread requires a small amount of scratch space +to work in. They cannot share scratch space because that would +lead to data races. In this case the allocation and deallocation +is done in a parallel section (so occurs on a per-thread basis) +surrounding a loop which then uses the scratch space. + +Our example here uses C++ to find the median of each column in +a 2D array (just a parallel version of ``numpy.median(x, axis=0)``). +We must reorder each column to find the median of it, but don't want +to modify the input array. Therefore, we allocate a C++ vector per +thread to use as scratch space, and work in that. For efficiency +the vector is allocated outside the ``prange`` loop. + +.. tabs:: + + .. group-tab:: Cython + + .. literalinclude:: ../../examples/tutorial/parallelization/median.pyx + + .. group-tab:: Pure Python + + .. literalinclude:: ../../examples/tutorial/parallelization/median.py + +.. note:: + + Pure and classic syntax examples are not quite identical + since pure Python syntax does not support C++ "new", so we allocate the + scratch space slightly differently + +In the generated code the ``scratch`` variable is marked as +``private`` in the outer parallel block. A rough outline is: + +.. code-block:: C++ + + #pragma omp parallel private(scratch) + { + scratch = new std::vector<double> ((x.shape[0])) + #pragma omp for firstprivate(i) lastprivate(i) lastprivate(j) lastprivate(median_it) + for (__pyx_t_9 = 0; __pyx_t_9 < __pyx_t_10; __pyx_t_9++){ + i = __pyx_t_9; + /* implementation goes here */ + } + /* some exception handling detail omitted */ + delete scratch; + } + +Performing different tasks on each thread +----------------------------------------- + +Finally, if you manually specify the number of threads and +then identify each thread using ``omp.get_thread_num()`` +you can manually split work between threads. This is +a fairly rare use-case in Cython, and probably suggests +that the ``threading`` module is more suitable for what +you're trying to do. However it is an option. + +.. tabs:: + + .. group-tab:: Cython + + .. literalinclude:: ../../examples/tutorial/parallelization/manual_work.pyx + :lines: 2- + + .. group-tab:: Pure Python + + .. literalinclude:: ../../examples/tutorial/parallelization/manual_work.py + :lines: 2- + +The utility of this kind of block is limited by the fact that +variables assigned to in the block are ``private`` to each thread, +so cannot be accessed in the serial section afterwards. + +The generated C code for the example above is fairly simple: + +.. code-block:: C + + #pragma omp parallel private(thread_num) + { + thread_num = omp_get_thread_num(); + switch (thread_num) { + /* ... */ + } + } |