1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
|
.. highlight:: cython
.. py:module:: cython.parallel
.. _parallel:
**********************************
Using Parallelism
**********************************
.. include::
../two-syntax-variants-used
Cython supports native parallelism through the :py:mod:`cython.parallel`
module. To use this kind of parallelism, the GIL must be released
(see :ref:`Releasing the GIL <nogil>`).
It currently supports OpenMP, but later on more backends might be supported.
.. NOTE:: Functionality in this module may only be used from the main thread
or parallel regions due to OpenMP restrictions.
.. function:: prange([start,] stop[, step][, nogil=False][, schedule=None[, chunksize=None]][, num_threads=None])
This function can be used for parallel loops. OpenMP automatically
starts a thread pool and distributes the work according to the schedule
used.
Thread-locality and reductions are automatically inferred for variables.
If you assign to a variable in a prange block, it becomes lastprivate, meaning that the
variable will contain the value from the last iteration. If you use an
inplace operator on a variable, it becomes a reduction, meaning that the
values from the thread-local copies of the variable will be reduced with
the operator and assigned to the original variable after the loop. The
index variable is always lastprivate.
Variables assigned to in a parallel with block will be private and unusable
after the block, as there is no concept of a sequentially last value.
:param start:
The index indicating the start of the loop (same as the start argument in range).
:param stop:
The index indicating when to stop the loop (same as the stop argument in range).
:param step:
An integer giving the step of the sequence (same as the step argument in range).
It must not be 0.
:param nogil:
This function can only be used with the GIL released.
If ``nogil`` is true, the loop will be wrapped in a nogil section.
:param schedule:
The ``schedule`` is passed to OpenMP and can be one of the following:
static:
If a chunksize is provided, iterations are distributed to all
threads ahead of time in blocks of the given chunksize. If no
chunksize is given, the iteration space is divided into chunks that
are approximately equal in size, and at most one chunk is assigned
to each thread in advance.
This is most appropriate when the scheduling overhead matters and
the problem can be cut down into equally sized chunks that are
known to have approximately the same runtime.
dynamic:
The iterations are distributed to threads as they request them,
with a default chunk size of 1.
This is suitable when the runtime of each chunk differs and is not
known in advance and therefore a larger number of smaller chunks
is used in order to keep all threads busy.
guided:
As with dynamic scheduling, the iterations are distributed to
threads as they request them, but with decreasing chunk size. The
size of each chunk is proportional to the number of unassigned
iterations divided by the number of participating threads,
decreasing to 1 (or the chunksize if provided).
This has an advantage over pure dynamic scheduling when it turns
out that the last chunks take more time than expected or are
otherwise being badly scheduled, so that most threads start running
idle while the last chunks are being worked on by only a smaller
number of threads.
runtime:
The schedule and chunk size are taken from the runtime scheduling
variable, which can be set through the ``openmp.omp_set_schedule()``
function call, or the ``OMP_SCHEDULE`` environment variable. Note that
this essentially disables any static compile time optimisations of
the scheduling code itself and may therefore show a slightly worse
performance than when the same scheduling policy is statically
configured at compile time.
The default schedule is implementation defined. For more information consult
the OpenMP specification [#]_.
.. auto The decision regarding scheduling is delegated to the
.. compiler and/or runtime system. The programmer gives
.. the implementation the freedom to choose any possible
.. mapping of iterations to threads in the team.
:param num_threads:
The ``num_threads`` argument indicates how many threads the team should consist of. If not given,
OpenMP will decide how many threads to use. Typically this is the number of cores available on
the machine. However, this may be controlled through the ``omp_set_num_threads()`` function, or
through the ``OMP_NUM_THREADS`` environment variable.
:param chunksize:
The ``chunksize`` argument indicates the chunksize to be used for dividing the iterations among threads.
This is only valid for ``static``, ``dynamic`` and ``guided`` scheduling, and is optional. Different chunksizes
may give substantially different performance results, depending on the schedule, the load balance it provides,
the scheduling overhead and the amount of false sharing (if any).
Example with a reduction:
.. tabs::
.. group-tab:: Pure Python
.. literalinclude:: ../../examples/userguide/parallelism/simple_sum.py
.. group-tab:: Cython
.. literalinclude:: ../../examples/userguide/parallelism/simple_sum.pyx
Example with a :term:`typed memoryview<Typed memoryview>` (e.g. a NumPy array)
.. tabs::
.. group-tab:: Pure Python
.. literalinclude:: ../../examples/userguide/parallelism/memoryview_sum.py
.. group-tab:: Cython
.. literalinclude:: ../../examples/userguide/parallelism/memoryview_sum.pyx
.. function:: parallel(num_threads=None)
This directive can be used as part of a ``with`` statement to execute code
sequences in parallel. This is currently useful to setup thread-local
buffers used by a prange. A contained prange will be a worksharing loop
that is not parallel, so any variable assigned to in the parallel section
is also private to the prange. Variables that are private in the parallel
block are unavailable after the parallel block.
Example with thread-local buffers
.. tabs::
.. group-tab:: Pure Python
.. literalinclude:: ../../examples/userguide/parallelism/parallel.py
.. group-tab:: Cython
.. literalinclude:: ../../examples/userguide/parallelism/parallel.pyx
Later on sections might be supported in parallel blocks, to distribute
code sections of work among threads.
.. function:: threadid()
Returns the id of the thread. For n threads, the ids will range from 0 to
n-1.
Compiling
=========
To actually use the OpenMP support, you need to tell the C or C++ compiler to
enable OpenMP. For gcc this can be done as follows in a ``setup.py``:
.. tabs::
.. group-tab:: Pure Python
.. literalinclude:: ../../examples/userguide/parallelism/setup_py.py
.. group-tab:: Cython
.. literalinclude:: ../../examples/userguide/parallelism/setup_pyx.py
For Microsoft Visual C++ compiler, use ``'/openmp'`` instead of ``'-fopenmp'``.
Breaking out of loops
=====================
The parallel with and prange blocks support the statements break, continue and
return in nogil mode. Additionally, it is valid to use a ``with gil`` block
inside these blocks, and have exceptions propagate from them.
However, because the blocks use OpenMP, they can not just be left, so the
exiting procedure is best-effort. For ``prange()`` this means that the loop
body is skipped after the first break, return or exception for any subsequent
iteration in any thread. It is undefined which value shall be returned if
multiple different values may be returned, as the iterations are in no
particular order:
.. tabs::
.. group-tab:: Pure Python
.. literalinclude:: ../../examples/userguide/parallelism/breaking_loop.py
.. group-tab:: Cython
.. literalinclude:: ../../examples/userguide/parallelism/breaking_loop.pyx
In the example above it is undefined whether an exception shall be raised,
whether it will simply break or whether it will return 2.
Using OpenMP Functions
======================
OpenMP functions can be used by cimporting ``openmp``:
.. tabs::
.. group-tab:: Pure Python
.. literalinclude:: ../../examples/userguide/parallelism/cimport_openmp.py
:lines: 3-
.. group-tab:: Cython
.. literalinclude:: ../../examples/userguide/parallelism/cimport_openmp.pyx
:lines: 3-
.. rubric:: References
.. [#] https://www.openmp.org/mp-documents/spec30.pdf
|