/build/arrayfire/src/arrayfire-full-3.6.1/docs/pages/gfor.md
Go to the documentation of this file.
1 GFOR: Parallel For-Loops {#page_gfor}
2 ========================
3 
4 [TOC]
5 
6 Run many independent loops simultaneously on the GPU or device.
7 
8 Introduction {#gfor_intro}
9 ============
10 
11 The gfor-loop construct may be used to simultaneously launch all of
12 the iterations of a for-loop on the GPU or device, as long as the
13 iterations are independent. While the standard for-loop performs each
14 iteration sequentially, ArrayFire's gfor-loop performs each iteration
15 at the same time (in parallel). ArrayFire does this by tiling out the
16 values of all loop iterations and then performing computation on those
17 tiles in one pass.
18 
19 You can think of `gfor` as performing auto-vectorization of your
20 code, e.g. you write a gfor-loop that increments every element of a
21 vector but behind the scenes ArrayFire rewrites it to operate on
22 the entire vector in parallel.
23 
24 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
25 for (int i = 0; i < n; ++i)
26  A(i) = A(i) + 1;
27 
28 gfor (seq i, n)
29  A(i) = A(i) + 1;
30 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
31 
32 Behind the scenes, ArrayFire rewrites your code into this
33 equivalent and faster version:
34 
35 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
36 A = A + 1;
37 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
38 
39 It is best to vectorize computation as much as possible to avoid
40 the overhead in both for-loops and gfor-loops.
41 
42 To see another example, you could run an FFT on every 2D slice of a
43 volume in a for-loop, or you could "vectorize" and simply do it all
44 in one gfor-loop operation:
45 
46 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
47 for (int i = 0; i < N; ++i)
48  A(span,span,i) = fft2(A(span,span,i)); // runs each FFT in sequence
49 
50 gfor (seq i, N)
51  A(span,span,i) = fft2(A(span,span,i)); // runs N FFTs in parallel
52 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
53 
54 There are three formats for instantiating gfor-loops.
55 -# gfor(var,n) Creates a sequence _{0, 1, ..., n-1}_
56 -# gfor(var,first,last) Creates a sequence _{first, first+1, ..., last}_
57 -# gfor(var,first,incr,last) Creates a sequence _{first, first+inc, first+2*inc, ..., last}_
58 
59 So all of the following represent the equivalent sequence: _0,1,2,3,4_
60 
61 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
62 gfor (seq i, 5)
63 gfor (seq i, 0, 4)
64 gfor (seq i, 0, 1, 4)
65 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
66 
67 More examples:
68 
69 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
70 array A = constant(1, n, n);
71 array B = constant(1, 1, n);
72 gfor (seq k, 0, n-1) {
73  B(span, k) = sum(A(span, k) * A(span,k)); // inner product
74 }
75 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
76 
77 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
78 array A = randu(n,m);
79 array B = constant(0,n,m);
80 gfor (seq k, 0, m-1) {
81  B(span,k) = fft(A(span,k));
82 }
83 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
84 
85 Usage {#gfor}
86 =====
87 
88 User Functions called within GFOR {#gfor_user_functions}
89 ---------------------------------
90 
91 If you have defined a function that you want to call within a GFOR loop, then
92 that function has to meet all the conditions described in this page in
93 order to be able to work as expected.
94 
95 Consider the (trivial) example below. The function compute() has to satisfy all
96 requirements for GFOR Usage, so you cannot use if-else conditions inside
97 it.
98 
99 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
100 array compute(array A, array B, float ep)
101 {
102  array H;
103  if (ep > 0) H = (A * B) / ep; // BAD
104  else H = A * 0;
105  return H;
106 }
107 
108 int m = 2, n = 3;
109 array A = randu(m, n);
110 array B = randu(m, n);
111 float ep = 2.35;
112 array H = constant(0,m,n);
113 gfor (seq ii, n)
114  H(span,ii) = compute(A(span,ii), B(span,ii), ep);
115 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
116 
117 The Iterator {#gfor_iterator}
118 ------------
119 
120 The iterator can be involved in expressions.
121 
122 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
123 A = constant(1,n,n,m);
124 B = constant(1,n,n);
125 gfor (seq k, m)
126  A(span,span,k) = (k+1)*B + sin(k+1); // expressions
127 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
128 
129 Iterator definitions can include arithmetic in expressions.
130 
131 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
132 A = constant(1,n,n,m);
133 B = constant(1,n,n);
134 gfor (seq k, m/4, m-m/4)
135  A(span,span,k) = k*B + sin(k+1); // expressions
136 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
137 
138 Subscripting {#gfor_subscripting}
139 ------------
140 
141 More complicated subscripting is supported.
142 
143 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
144 A = constant(1,n,n,m);
145 B = constant(1,n,10);
146 gfor (seq k, m)
147  A(span,seq(10),k) = k*B; // subscripting, seq(10) generates index [0,9]
148 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
149 
150 Iterators can be combined with arithmetic in subscripts.
151 
152 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
153 array A = randu(n,m);
154 array B = constant(1,n,m);
155 gfor (seq k, 1, m-1)
156  B(span,k) = A(span,k-1);
157 
158 A = randu(n,2*m);
159 B = constant(1,n,m);
160 gfor (seq k, m)
161  B(span,k) = A(span,2*(k+1)-1);
162 
163 A = randu(n,2*m);
164 B = constant(1,n,m);
165 gfor (seq k, m)
166  B(span,k) = A(span,floor(k+.2));
167 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
168 
169 In-Loop Reuse {#gfor_in_loop}
170 -------------
171 
172 Within the loop, you can use a result you just computed.
173 
174 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
175 gfor (seq k, n) {
176  A(span,k) = 4 * B(span,k);
177  C(span,k) = 4 * A(span,k); // use it again
178 }
179 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
180 
181 Although it is more efficient to store the value in a temporary variable:
182 
183 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
184 gfor (seq k, n) {
185  a = 4 * B(span,k);
186  A(span,k) = a;
187  C(span,k) = 4 * a;
188 }
189 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
190 
191 In-Place Computation {#gfor_in_place_computation}
192 --------------------
193 
194 In some cases, GFOR behaves differently than the typical sequential
195 FOR-loop. For example, you can read and modify a result in place as long as
196 the accesses are independent.
197 
198 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
199 A = constant(1,n,n);
200 gfor (seq k, n)
201  A(span,k) = sin(k) + A(span,k);
202 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
203 
204 Subscripting behaviors `arrays` also work with GFOR.
205 
206 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
207 A = constant(1,n,n,m,k);
208 m = m * k; // precompute since cannot have expressions in iterator
209 gfor (seq k, m)
210  A(span,span,k) = 4 * A(span,span,k); // collapse last dimension
211 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
212 
213 Random Data Generation {#gfor_random}
214 ----------------------
215 
216 Random data should always be generated outside the GFOR loop. This is due to
217 the fact that GFOR only passes over the body of the loop once. Therefore,
218 any calls to randu() inside the body of the loop will result in the same
219 random matrix being assigned to every iteration of the loop.
220 
221 For example, in the following trivial code, all columns of `B` are identical
222 because `A` is only evaluated once:
223 
224 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
225 gfor (seq ii, n) {
226  array A = randu(3,1);
227  B(span,ii) = A;
228 }
229 print(B);
230 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
231 
232  B =
233  0.1209 0.1209 0.1209
234  0.6432 0.6432 0.6432
235  0.8746 0.8746 0.8746
236 
237 This can be rectified by bringing the random number generation outside the
238 loop, as follows:
239 
240 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
241 array A = randu(3,n);
242 gfor (seq ii, n)
243  B(span,ii) = A(span,ii);
244 print(B);
245 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
246 
247  B =
248  0.0892 0.1655 0.7807
249  0.5626 0.5173 0.2932
250  0.5664 0.5898 0.1391
251 
252 This is a trivial example, but demonstrates the principle that random numbers
253 should be pre-allocated outside the loop in most cases.
254 
255 Restrictions {#gfor_restrictions}
256 ============
257 
258 This preliminary implementation of GFOR has the following restrictions.
259 
260 Iteration independence {#gfor_iteration_independence}
261 ----------------------
262 
263 The most important property of the loop body is that each iteration must be
264 independent of the other iterations. Note that accessing the result of a
265 separate iteration produces undefined behavior.
266 
267 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
268 array B = randu(3);
269 gfor (seq k, n)
270  B = B + k; // bad
271 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
272 
273 No conditional statements {#gfor_no_cond}
274 -------------------------
275 
276 No conditional statements in the body of the loop, (i.e. no
277 branching). However, you can often find ways to overcome this
278 restriction. Consider the following two examples:
279 
280 Example 1: Problem
281 
282 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
283 A = constant(1,n,m);
284 gfor (seq k, n) {
285  if (k > 10) A(span,k) = k + 1; // bad
286 }
287 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
288 
289 However, you can do a few tricks to overcome this restriction by expressing
290 the conditional statement as a multiplication by logical values. For instance,
291 the block of code above can be converted to run as follows, without error:
292 
293 Example 1: Solution
294 
295 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
296 gfor (seq k, m) {
297  array condition = (k > 1); // good
298  A(span,k) = (!condition).as(f32) * A(span,k) + condition.as(f32) * (k + 1);
299 }
300 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
301 
302 Another example of overcoming the conditional statement restriction in GFOR is
303 as follows:
304 
305 Example 2: Problem
306 
307 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
308 array A = constant(1,n,n,m);
309 array B = randu(n,n);
310 gfor (seq k, 4) {
311  if ((k % 2) != 0)
312  A(span,span,k) = B + k;
313  else
314  A(span,span,k) = B * k;
315 }
316 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
317 
318 Instead, you can make two passes over the same data, each pass performing one
319 branch.
320 
321 Example 2: Solution
322 
323 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
324 A = constant(1,n,n,m);
325 B = randu(n);
326 gfor (seq k, 0, 2, 3)
327  A(span,span,k) = B + k;
328 gfor (seq k, 1, 2, 3)
329  A(span,span,k) = B * k;
330 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
331 
332 Nested loop restrictions {#gfor_nested_loop}
333 ------------------------
334 
335 Nesting GFOR-loops within GFOR-loops is unsupported. You may interleave
336 FOR-loops as long as they are completely independent of the GFOR-loop
337 iterator.
338 
339 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
340 gfor (seq k, n) {
341  gfor (seq j, m) { // bad
342  // ...
343  }
344 }
345 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
346 
347 Nesting FOR-loops within GFOR-loops is supported, as long as the GFOR iterator
348 is not used in the FOR loop iterator, as follows:
349 
350 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
351 gfor (seq k, n) {
352  for (int j = 0; j < (m+k); j++) { // bad
353  // ...
354  }
355 }
356 
357 gfor (seq k, n) {
358  for (int j = 0; j < m; j++) { // good
359  //...
360  }
361 }
362 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
363 
364 Nesting GFOR-loops inside of FOR-loops is fully supported.
365 
366 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
367 for (seq k, n) {
368  gfor (int j = 0; j < m; j++) { // good
369  // ...
370  }
371 }
372 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
373 
374 No logical indexing {#gfor_no_logical}
375 -------------------
376 
377 Logical indexing like the following is not supported:
378 
379 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
380 gfor (seq i, n) {
381  array B = A(span,i);
382  array tmp = B(B > .5); // bad
383  D(i) = sum(tmp);
384 }
385 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
386 
387 The problem is that every GFOR tile has a different number of elements, something which GFOR cannot yet handle.
388 
389 Similar to the workaround for conditional statements, it might work to use
390 masked arithmetic:
391 
392 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
393 gfor (seq i, n) {
394  array B = A(span,i);
395  array mask = B > .5;
396  D(i) = sum(mask .* B);
397 }
398 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
399 
400 Sub-assignment with scalars and logical masks is supported:
401 
402 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
403 gfor (seq i, n) {
404  a = A(span,i);
405  a(isnan(a)) = 0;
406  A(span,i) = a;
407 }
408 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
409 
410 Memory considerations {#gfor_memory}
411 =====================
412 
413 Since each computation is done in parallel for all iterator values,
414 you need to have enough card memory available to do all iterations
415 simultaneously. If the problem exceeds memory, it will trigger "out of
416 memory" errors.
417 
418 You can work around the memory limitations of your GPU or device by
419 breaking the GFOR loop up into segments; however, you might want to
420 consider using a larger memory GPU or device.
421 
422 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
423 // BEFORE
424 gfor (seq k, 400) {
425  array B = A(span,k);
426  C(span,span,k) = matmulNT(B * B); // outer product expansion runs out of memory
427 }
428 
429 // AFTER
430 for (int kk = 0; kk < 400; kk += 100) {
431  gfor (seq k, kk, kk+99) { // four batches of 100
432  array B = A(span,k);
433  C(span,span,k) = matmulNT(B, B); // now several smaller problems fit in card memory
434  }
435 }
436 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~