docs/src/tutorial/parallelization.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335

.. _parallel-tutorial:

=================================
Writing parallel code with Cython
=================================

.. include::
    ../two-syntax-variants-used

One method of speeding up your Cython code is parallelization:
you write code that can be run on multiple cores of your CPU simultaneously.
For code that lends itself to parallelization this can produce quite
dramatic speed-ups, equal to the number of cores your CPU has (for example
a 4× speed-up on a 4-core CPU).

This tutorial assumes that you are already familiar with Cython's 
:ref:`"typed memoryviews"<memoryviews>` (since code using memoryviews is often
the sort of code that's easy to parallelize with Cython), and also that you're
somewhat familiar with the pitfalls of writing parallel code in general
(it aims to be a Cython tutorial rather than a complete introduction
to parallel programming).

Before starting, a few notes:

- Not all code can be parallelized - for some code the algorithm simply
  relies on being executed in order and you should not attempt to
  parallelize it.  A cumulative sum is a good example.
  
- Not all code is worth parallelizing.  There's a reasonable amount of
  overhead in starting a parallel section and so you need to make sure
  that you're operating on enough data to make this overhead worthwhile.
  Additionally, make sure that you are doing actual work on the data!
  Multiple threads simply reading the same data tends not to parallelize
  too well.  If in doubt, time it.

- Cython requires the contents of parallel blocks to be ``nogil``.  If
  your algorithm requires access to Python objects then it may not be
  suitable for parallelization.
  
- Cython's inbuilt parallelization uses the OpenMP constructs
  ``omp parallel for`` and ``omp parallel``.  These are ideal
  for parallelizing relatively small, self-contained blocks of code
  (especially loops).  However, If you want to use other models of 
  parallelization such as spawning and waiting for tasks, or 
  off-loading some "side work" to a continuously running secondary 
  thread, then you might be better using other methods (such as 
  Python's ``threading`` module).
  
- Actually implementing your parallel Cython code should probably be 
  one of the last steps in your optimization.  You should start with
  some working serial code first.  However, it's worth planning for
  early since it may affect your choice of algorithm.

This tutorial does not aim to explore all the options available to
customize parallelization.  See the
:ref:`main parallelism documentation<parallel>` for details.
You should also be aware that a lot of the choices Cython makes
about how your code is parallelized are fairly fixed and if you want
specific OpenMP behaviour that Cython doesn't provide by default you
may be better writing it in C yourself.

Compilation
===========

OpenMP requires support from your C/C++ compiler. This support is
usually enabled through a special command-line argument:
on GCC this is ``-fopenmp`` while on MSVC it is
``/openmp``. If your compiler doesn't support OpenMP (or if you
forget to pass the argument) then the code will usually still
compile but will not run in parallel.

The following ``setup.py`` file can be used to compile the
examples in this tutorial:

.. literalinclude:: ../../examples/tutorial/parallelization/setup.py

Element-wise parallel operations
================================

The easiest and most common parallel operation in Cython is to
iterate across an array element-wise, performing the same
operation on each array element.  In the simple example
below we calculate the ``sin`` of every element in an array:

.. tabs::

    .. group-tab:: Cython

        .. literalinclude:: ../../examples/tutorial/parallelization/parallel_sin.pyx

    .. group-tab:: Pure Python

        .. literalinclude:: ../../examples/tutorial/parallelization/parallel_sin.py

We parallelize the outermost loop.  This is usually a good idea
since there is some overhead to entering and leaving a parallel block.
However, you should also consider the likely size of your arrays.
If ``input`` usually had a size of ``(2, 10000000)`` then parallelizing
over the dimension of length ``2`` would likely be a worse choice.

The body of the loop itself is ``nogil`` - i.e. you cannot perform
"Python" operations.  This is a fairly strong limitation and if you
find that you need to use the GIL then it is likely that Cython's
parallelization features are not suitable for you.  It is possible
to throw exceptions from within the loop, however -- Cython simply
regains the GIL and raises an exception, then terminates the loop
on all threads.

It's necessary to explicitly type the loop variable ``i``  as a
C integer.  For a non-parallel loop Cython can infer this, but it
does not currently infer the loop variable for parallel loops, 
so not typing ``i`` will lead to compile errors since it will be
a Python object and so unusable without the GIL.

The C code generated is shown below, for the benefit of experienced
users of OpenMP. It is simplified a little for readability here:

.. code-block:: C

    #pragma omp parallel
    {
        #pragma omp for firstprivate(i) lastprivate(i) lastprivate(j)
        for (__pyx_t_8 = 0; __pyx_t_8 < __pyx_t_9; __pyx_t_8++){
            i = __pyx_t_8;
            /* body goes here */
        }
    }
    
Private variables
-----------------

One useful point to note from the generated C code above - variables
used in the loops like ``i`` and ``j`` are marked as ``firstprivate``
and ``lastprivate``.  Within the loop each thread has its own copy of
the data, the data is initialized
according to its value before the loop, and after the loop the "global"
copy is set equal to the last iteration (i.e. as if the loop were run
in serial).

The basic rules that Cython applies are:

- C scalar variables within a ``prange`` block are made 
  ``firstprivate`` and ``lastprivate``,

- C scalar variables assigned within a  
  :ref:`parallel block<parallel-block>`
  are ``private`` (which means they can't be used to pass data in
  and out of the block),

- array variables (e.g. memoryviews) are not made private.  Instead
  Cython assumes that you have structured your loop so that each iteration
  is acting on different data,

- Python objects are also not made private, although access to them
  is controlled via Python's GIL.

Cython does not currently provide much opportunity of override these
choices.
  
Reductions
==========

The second most common parallel operation in Cython is the "reduction"
operation.  A common example is to accumulate a sum over the whole
array, such as in the calculation of a vector norm below:

.. tabs::

    .. group-tab:: Cython

        .. literalinclude:: ../../examples/tutorial/parallelization/norm.pyx

    .. group-tab:: Pure Python

        .. literalinclude:: ../../examples/tutorial/parallelization/norm.py

Cython is able to infer reductions for ``+=``, ``*=``, ``-=``,
``&=``, ``|=``, and ``^=``.  These only apply to C scalar variables
so you cannot easily reduce a 2D memoryview to a 1D memoryview for
example.

The C code generated is approximately:

.. code-block:: C

    #pragma omp parallel reduction(+:total)
    {
        #pragma omp for firstprivate(i) lastprivate(i)
        for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){
            i = __pyx_t_2;
            total = total + /* some indexing code */;
            
        }
    }

.. _parallel-block:
    
``parallel`` blocks
===================

Much less frequently used than ``prange`` is Cython's ``parallel``
operator.  ``parallel`` generates a block of code that is run simultaneously
on multiple threads at once.  Unlike ``prange``, however, work is
not automatically divided between threads.

Here we present three common uses for the ``parallel`` block:

Stringing together prange blocks
--------------------------------

There is some overhead in entering and leaving a parallelized section.
Therefore, if you have multiple parallel sections with small
serial sections in between it can be more efficient to
write one large parallel block.  Any small serial
sections are duplicated, but the overhead is reduced.

In the example below we do an in-place normalization of a vector.
The first parallel loop calculates the norm, the second parallel
loop applies the norm to the vector, and we avoid jumping in and out of serial
code in between.

.. tabs::

    .. group-tab:: Cython

        .. literalinclude:: ../../examples/tutorial/parallelization/normalize.pyx
        
    .. group-tab:: Pure Python

        .. literalinclude:: ../../examples/tutorial/parallelization/normalize.py

The C code is approximately:

.. code-block:: C

    #pragma omp parallel private(norm) reduction(+:total)
    {
        /* some calculations of array size... */
        #pragma omp for firstprivate(i) lastprivate(i)
        for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){
            /* ... */
        }
        norm = sqrt(total);
        #pragma omp for firstprivate(i) lastprivate(i)
        for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){
            /* ... */
        }
    }

Allocating "scratch space" for each thread
------------------------------------------

Suppose that each thread requires a small amount of scratch space
to work in.  They cannot share scratch space because that would
lead to data races.  In this case the allocation and deallocation
is done in a parallel section (so occurs on a per-thread basis)
surrounding a loop which then uses the scratch space.

Our example here uses C++ to find the median of each column in
a 2D array (just a parallel version of ``numpy.median(x, axis=0)``).
We must reorder each column to find the median of it, but don't want
to modify the input array.  Therefore, we allocate a C++ vector per
thread to use as scratch space, and work in that.  For efficiency
the vector is allocated outside the ``prange`` loop.

.. tabs::

    .. group-tab:: Cython

        .. literalinclude:: ../../examples/tutorial/parallelization/median.pyx
        
    .. group-tab:: Pure Python

        .. literalinclude:: ../../examples/tutorial/parallelization/median.py

.. note::

    Pure and classic syntax examples are not quite identical
    since pure Python syntax does not support C++ "new", so we allocate the
    scratch space slightly differently

In the generated code the ``scratch`` variable is marked as
``private`` in the outer parallel block.  A rough outline is:

.. code-block:: C++

    #pragma omp parallel private(scratch)
    {
        scratch = new std::vector<double> ((x.shape[0]))
        #pragma omp for firstprivate(i) lastprivate(i) lastprivate(j) lastprivate(median_it)
        for (__pyx_t_9 = 0; __pyx_t_9 < __pyx_t_10; __pyx_t_9++){
            i = __pyx_t_9;
            /* implementation goes here */
        }
        /* some exception handling detail omitted */
        delete scratch;
    }

Performing different tasks on each thread
-----------------------------------------

Finally, if you manually specify the number of threads and
then identify each thread using ``omp.get_thread_num()``
you can manually split work between threads.  This is
a fairly rare use-case in Cython, and probably suggests
that the ``threading`` module is more suitable for what
you're trying to do.  However it is an option.

.. tabs::

    .. group-tab:: Cython

        .. literalinclude:: ../../examples/tutorial/parallelization/manual_work.pyx
           :lines: 2-
        
    .. group-tab:: Pure Python

        .. literalinclude:: ../../examples/tutorial/parallelization/manual_work.py
           :lines: 2-

The utility of this kind of block is limited by the fact that
variables assigned to in the block are ``private`` to each thread,
so cannot be accessed in the serial section afterwards.

The generated C code for the example above is fairly simple:

.. code-block:: C

    #pragma omp parallel private(thread_num) 
    {
        thread_num = omp_get_thread_num();
        switch (thread_num) {
            /* ... */
        }
    }