1 GFOR: Parallel For-Loops {#page_gfor}
2 ========================
6 Run many independent loops simultaneously on the GPU or device.
8 Introduction {#gfor_intro}
11 The gfor-loop construct may be used to simultaneously launch all of
12 the iterations of a for-loop on the GPU or device, as long as the
13 iterations are independent. While the standard for-loop performs each
14 iteration sequentially, ArrayFire's gfor-loop performs each iteration
15 at the same time (in parallel). ArrayFire does this by tiling out the
16 values of all loop iterations and then performing computation on those
19 You can think of `gfor` as performing auto-vectorization of your
20 code, e.g. you write a gfor-loop that increments every element of a
21 vector but behind the scenes ArrayFire rewrites it to operate on
22 the entire vector in parallel.
24 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
25 for (int i = 0; i < n; ++i)
30 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
32 Behind the scenes, ArrayFire rewrites your code into this
33 equivalent and faster version:
35 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
37 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
39 It is best to vectorize computation as much as possible to avoid
40 the overhead in both for-loops and gfor-loops.
42 To see another example, you could run an FFT on every 2D slice of a
43 volume in a for-loop, or you could "vectorize" and simply do it all
44 in one gfor-loop operation:
46 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
47 for (int i = 0; i < N; ++i)
48 A(span,span,i) = fft2(A(span,span,i)); // runs each FFT in sequence
51 A(span,span,i) = fft2(A(span,span,i)); // runs N FFTs in parallel
52 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
54 There are three formats for instantiating gfor-loops.
55 -# gfor(var,n) Creates a sequence _{0, 1, ..., n-1}_
56 -# gfor(var,first,last) Creates a sequence _{first, first+1, ..., last}_
57 -# gfor(var,first,incr,last) Creates a sequence _{first, first+inc, first+2*inc, ..., last}_
59 So all of the following represent the equivalent sequence: _0,1,2,3,4_
61 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
65 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
69 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
70 array A = constant(1, n, n);
71 array B = constant(1, 1, n);
72 gfor (seq k, 0, n-1) {
73 B(span, k) = sum(A(span, k) * A(span,k)); // inner product
75 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
77 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
79 array B = constant(0,n,m);
80 gfor (seq k, 0, m-1) {
81 B(span,k) = fft(A(span,k));
83 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
88 User Functions called within GFOR {#gfor_user_functions}
89 ---------------------------------
91 If you have defined a function that you want to call within a GFOR loop, then
92 that function has to meet all the conditions described in this page in
93 order to be able to work as expected.
95 Consider the (trivial) example below. The function compute() has to satisfy all
96 requirements for GFOR Usage, so you cannot use if-else conditions inside
99 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
100 array compute(array A, array B, float ep)
103 if (ep > 0) H = (A * B) / ep; // BAD
109 array A = randu(m, n);
110 array B = randu(m, n);
112 array H = constant(0,m,n);
114 H(span,ii) = compute(A(span,ii), B(span,ii), ep);
115 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
117 The Iterator {#gfor_iterator}
120 The iterator can be involved in expressions.
122 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
123 A = constant(1,n,n,m);
126 A(span,span,k) = (k+1)*B + sin(k+1); // expressions
127 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
129 Iterator definitions can include arithmetic in expressions.
131 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
132 A = constant(1,n,n,m);
134 gfor (seq k, m/4, m-m/4)
135 A(span,span,k) = k*B + sin(k+1); // expressions
136 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
138 Subscripting {#gfor_subscripting}
141 More complicated subscripting is supported.
143 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
144 A = constant(1,n,n,m);
145 B = constant(1,n,10);
147 A(span,seq(10),k) = k*B; // subscripting, seq(10) generates index [0,9]
148 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
150 Iterators can be combined with arithmetic in subscripts.
152 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
153 array A = randu(n,m);
154 array B = constant(1,n,m);
156 B(span,k) = A(span,k-1);
161 B(span,k) = A(span,2*(k+1)-1);
166 B(span,k) = A(span,floor(k+.2));
167 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
169 In-Loop Reuse {#gfor_in_loop}
172 Within the loop, you can use a result you just computed.
174 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
176 A(span,k) = 4 * B(span,k);
177 C(span,k) = 4 * A(span,k); // use it again
179 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
181 Although it is more efficient to store the value in a temporary variable:
183 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
189 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
191 In-Place Computation {#gfor_in_place_computation}
194 In some cases, GFOR behaves differently than the typical sequential
195 FOR-loop. For example, you can read and modify a result in place as long as
196 the accesses are independent.
198 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
201 A(span,k) = sin(k) + A(span,k);
202 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
204 Subscripting behaviors `arrays` also work with GFOR.
206 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
207 A = constant(1,n,n,m,k);
208 m = m * k; // precompute since cannot have expressions in iterator
210 A(span,span,k) = 4 * A(span,span,k); // collapse last dimension
211 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
213 Random Data Generation {#gfor_random}
214 ----------------------
216 Random data should always be generated outside the GFOR loop. This is due to
217 the fact that GFOR only passes over the body of the loop once. Therefore,
218 any calls to randu() inside the body of the loop will result in the same
219 random matrix being assigned to every iteration of the loop.
221 For example, in the following trivial code, all columns of `B` are identical
222 because `A` is only evaluated once:
224 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
226 array A = randu(3,1);
230 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
237 This can be rectified by bringing the random number generation outside the
240 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
241 array A = randu(3,n);
243 B(span,ii) = A(span,ii);
245 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
252 This is a trivial example, but demonstrates the principle that random numbers
253 should be pre-allocated outside the loop in most cases.
255 Restrictions {#gfor_restrictions}
258 This preliminary implementation of GFOR has the following restrictions.
260 Iteration independence {#gfor_iteration_independence}
261 ----------------------
263 The most important property of the loop body is that each iteration must be
264 independent of the other iterations. Note that accessing the result of a
265 separate iteration produces undefined behavior.
267 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
271 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
273 No conditional statements {#gfor_no_cond}
274 -------------------------
276 No conditional statements in the body of the loop, (i.e. no
277 branching). However, you can often find ways to overcome this
278 restriction. Consider the following two examples:
282 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
285 if (k > 10) A(span,k) = k + 1; // bad
287 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
289 However, you can do a few tricks to overcome this restriction by expressing
290 the conditional statement as a multiplication by logical values. For instance,
291 the block of code above can be converted to run as follows, without error:
295 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
297 array condition = (k > 1); // good
298 A(span,k) = (!condition).as(f32) * A(span,k) + condition.as(f32) * (k + 1);
300 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
302 Another example of overcoming the conditional statement restriction in GFOR is
307 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
308 array A = constant(1,n,n,m);
309 array B = randu(n,n);
312 A(span,span,k) = B + k;
314 A(span,span,k) = B * k;
316 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
318 Instead, you can make two passes over the same data, each pass performing one
323 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
324 A = constant(1,n,n,m);
326 gfor (seq k, 0, 2, 3)
327 A(span,span,k) = B + k;
328 gfor (seq k, 1, 2, 3)
329 A(span,span,k) = B * k;
330 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
332 Nested loop restrictions {#gfor_nested_loop}
333 ------------------------
335 Nesting GFOR-loops within GFOR-loops is unsupported. You may interleave
336 FOR-loops as long as they are completely independent of the GFOR-loop
339 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
341 gfor (seq j, m) { // bad
345 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
347 Nesting FOR-loops within GFOR-loops is supported, as long as the GFOR iterator
348 is not used in the FOR loop iterator, as follows:
350 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
352 for (int j = 0; j < (m+k); j++) { // bad
358 for (int j = 0; j < m; j++) { // good
362 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
364 Nesting GFOR-loops inside of FOR-loops is fully supported.
366 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
368 gfor (int j = 0; j < m; j++) { // good
372 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
374 No logical indexing {#gfor_no_logical}
377 Logical indexing like the following is not supported:
379 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
382 array tmp = B(B > .5); // bad
385 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
387 The problem is that every GFOR tile has a different number of elements, something which GFOR cannot yet handle.
389 Similar to the workaround for conditional statements, it might work to use
392 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
396 D(i) = sum(mask .* B);
398 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
400 Sub-assignment with scalars and logical masks is supported:
402 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
408 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
410 Memory considerations {#gfor_memory}
411 =====================
413 Since each computation is done in parallel for all iterator values,
414 you need to have enough card memory available to do all iterations
415 simultaneously. If the problem exceeds memory, it will trigger "out of
418 You can work around the memory limitations of your GPU or device by
419 breaking the GFOR loop up into segments; however, you might want to
420 consider using a larger memory GPU or device.
422 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
426 C(span,span,k) = matmulNT(B * B); // outer product expansion runs out of memory
430 for (int kk = 0; kk < 400; kk += 100) {
431 gfor (seq k, kk, kk+99) { // four batches of 100
433 C(span,span,k) = matmulNT(B, B); // now several smaller problems fit in card memory
436 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~