/build/arrayfire/src/arrayfire-full-3.6.1/docs/pages/vectorization.md
Go to the documentation of this file.
1 Introduction to Vectorization {#vectorization}
2 ===================
3 
4 Programmers and Data Scientists want to take advantage of fast and parallel
5 computational devices. Writing vectorized code is necessary to get
6 the best performance out of the current generation parallel hardware and
7 scientific computing software. However, writing vectorized code may not be
8 immediately intuitive. ArrayFire provides many ways to vectorize a given code
9 segment. In this tutorial, we present several methods to vectorize code
10 using ArrayFire and discuss the benefits and drawbacks associated with each method.
11 
12 # Generic/Default vectorization
13 
14 By its very nature, ArrayFire is a vectorized library. Most functions operate on
15 arrays as a whole -- on all elements in parallel. Wherever possible, existing
16 vectorized functions should be used opposed to manually indexing into arrays.
17 For example consider the following code:
18 
19 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
20 af::array a = af::range(10); // [0, 9]
21 for(int i = 0; i < a.dims(0); ++i)
22 {
23  a(i) = a(i) + 1; // [1, 10]
24 }
25 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
26 
27 Although completely valid, the code is very inefficient as it results in
28 a kernel kernels that operate on one datum.
29 Instead, the developer should have used ArrayFire's overload of the + operator:
30 
31 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
32 af::array a = af::range(10); // [0, 9]
33 a = a + 1; // [1, 10]
34 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
35 
36 This code will result in a single kernel that operates on all 10 elements
37 of `a` in parallel.
38 
39 Most ArrayFire functions are vectorized. A small subset of these include:
40 
41 Operator Category | Functions
42 ------------------------------------------------------------|--------------------------
43 [Arithmetic operations](\ref arith_mat) | [+](\ref arith_func_add), [-](\ref arith_func_sub), [*](\ref arith_func_mul), [/](\ref arith_func_div), [%](\ref arith_func_mod), [>>](\ref arith_func_shiftr), [<<](\ref arith_func_shiftl)
44 [Logical operations](\ref logic_mat) | [&&](\ref arith_func_and), \|\|[(or)](\ref arith_func_or), [<](\ref arith_func_lt), [>](\ref arith_func_gt), [==](\ref arith_func_eq), [!=](\ref arith_func_neq) etc.
45 [Numeric functions](\ref numeric_mat) | abs(), floor(), round(), min(), max(), etc.
46 [Complex operations](\ref complex_mat) | real(), imag(), conj(), etc.
47 [Exponential and logarithmic functions](\ref explog_mat) | exp(), log(), expm1(), log1p(), etc.
48 [Trigonometric functions](\ref trig_mat) | sin(), cos(), tan(), etc.
49 [Hyperbolic functions](\ref hyper_mat) | sinh(), cosh(), tanh(), etc.
50 
51 In addition to element-wise operations, many other functions are also
52 vectorized in ArrayFire.
53 
54 Notice that even that perform some form of aggregation (e.g. `sum()` or `min()`),
55 signal processing (like `convolve()`), and even image processing functions
56 (i.e. `rotate()`) all support vectorization on different columns or images.
57 For example, if we have `NUM` images of size `WIDTH` by `HEIGHT`, one could
58 convolve each image in a vector fashion as follows:
59 
60 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
61 float g_coef[] = { 1, 2, 1,
62  2, 4, 2,
63  1, 2, 1 };
64 
65 af::array filter = 1.f/16 * af::array(3, 3, g_coef);
66 
67 af::array signal = randu(WIDTH, HEIGHT, NUM);
68 af::array conv = convolve2(signal, filter);
69 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
70 
71 Similarly, one can rotate 100 images by 45 degrees in a single call using
72 code like the following:
73 
74 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
75 // Construct an array of 100 WIDTH x HEIGHT images of random numbers
76 af::array imgs = randu(WIDTH, HEIGHT, 100);
77 // Rotate all of the images in a single command
78 af::array rot_imgs = rotate(imgs, 45);
79 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
80 
81 Although *most* functions in ArrayFire do support vectorization, some do not.
82 Most notably, all linear algebra functions. Even though they are not vectorized
83 linear algebra operations still execute in parallel on your hardware.
84 
85 Using the built in vectorized operations should be the first
86 and preferred method of vectorizing any code written with ArrayFire.
87 
88 # GFOR: Parallel for-loops
89 
90 Another novel method of vectorization present in ArrayFire is the GFOR loop
91 replacement construct. GFOR allows launching all iterations of a loop in parallel
92 on the GPU or device, as long as the iterations are independent. While the
93 standard for-loop performs each iteration sequentially, ArrayFire's gfor-loop
94 performs each iteration at the same time (in parallel). ArrayFire does this by
95 tiling out the values of all loop iterations and then performing computation on
96 those tiles in one pass. You can think of gfor as performing auto-vectorization
97 of your code, e.g. you write a gfor-loop that increments every element of a vector
98 but behind the scenes ArrayFire rewrites it to operate on the entire vector in
99 parallel.
100 
101 The original for-loop example at the beginning of this document could be
102 rewritten using GFOR as follows:
103 
104 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
105 af::array a = af::range(10);
106 gfor(seq i, n)
107  a(i) = a(i) + 1;
108 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
109 
110 In this case, each instance of the gfor loop is independent, thus ArrayFire
111 will automatically tile out the `a` array in device memory and execute the
112 increment kernels in parallel.
113 
114 To see another example, you could run an accum() on every slice of a matrix in a
115 for-loop, or you could "vectorize" and simply do it all in one gfor-loop operation:
116 
117 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
118 // runs each accum() in sequence
119 for (int i = 0; i < N; ++i)
120  B(span,i) = accum(A(span,i));
121 
122 // runs N accums in parallel
123 gfor (seq i, N)
124  B(span,i) = accum(A(span,i));
125 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
126 
127 However, returning to our previous vectorization technique, accum() is already
128 vectorized and the operation could be completely replaced with merely:
129 
130 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
131  B = accum(A);
132 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
133 
134 It is best to vectorize computation as much as possible to avoid the overhead in
135 both for-loops and gfor-loops. However, the gfor-loop construct is most effective
136 in the narrow case of broadcast-style operations. Consider the case when we have
137 a vector of constants that we wish to apply to a collection of variables, such as
138 expressing the values of a linear combination for multiple vectors. The broadcast
139 of one set of constants to many vectors works well with gfor-loops:
140 
141 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
142 const static int p=4, n=1000;
143 af::array consts = af::randu(p);
144 af::array var_terms = randn(p, n);
145 
146 gfor(seq i, n)
147  combination(span, i) = consts * var_terms(span, i);
148 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
149 
150 Using GFOR requires following several rules and multiple guidelines for optimal
151 performance. The details of this vectorization method can be found in the
152 [GFOR documentation](\ref gfor).
153 
154 # Batching
155 
156 The batchFunc() function allows the broad application of existing ArrayFire
157 functions to multiple sets of data. Effectively, batchFunc() allows ArrayFire
158 functions to execute in "batch processing" mode. In this mode, functions will
159 find a dimension which contains "batches" of data to be processed and will
160 parallelize the procedure.
161 
162 Consider the following example. Here we create a filter which we would like
163 to apply to each of the weight vectors. The naive solution would be using a
164 for-loop as we have seen previously:
165 
166 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
167 // Create the filter and the weight vectors
168 af::array filter = randn(1, 5);
169 af::array weights = randu(5, 5);
170 
171 // Apply the filter using a for-loop
172 af::array filtered_weights = constant(0, 5, 5);
173 for(int i=0; i<weights.dims(1); ++i){
174  filtered_weights.col(i) = filter * weights.col(i);
175 }
176 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
177 
178 However, as we have discussed above, this solution will be very inefficient.
179 One may be tempted to implement a vectorized solution as follows:
180 
181 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
182 // Create the filter and the weight vectors
183 af::array filter = randn(1, 5);
184 af::array weights = randu(5, 5);
185 
186 af::array filtered_weights = filter * weights; // fails due to dimension mismatch
187 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
188 
189 However, the dimensions of `filter` and `weights` do not match, thus ArrayFire
190 will generate a runtime error.
191 
192 `batchfunc()` was created to solve this specific problem.
193 The signature of the function is as follows:
194 
195 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
196 array batchFunc(const array &lhs, const array &rhs, batchFunc_t func);
197 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
198 
199 where `__batchFunc_t__` is a function pointer of the form:
200 
201 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
202 typedef array (*batchFunc_t) (const array &lhs, const array &rhs);
203 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
204 
205 So, to use batchFunc(), we need to provide the function we wish to apply as a
206 batch operation. For illustration's sake, let's "implement" a multiplication
207 function following the format.
208 
209 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
210 af::array my_mult (const af::array &lhs, const af::array &rhs){
211  return lhs * rhs;
212 }
213 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
214 
215 Our final batch call is not much more difficult than the ideal
216 syntax we imagined.
217 
218 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
219 // Create the filter and the weight vectors
220 af::array filter = randn(1, 5);
221 af::array weights = randu(5, 5);
222 
223 // Apply the batch function
224 af::array filtered_weights = batchFunc( filter, weights, my_mult );
225 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
226 
227 The batch function will work with many previously mentioned vectorized ArrayFire
228 functions. It can even work with a combination of those functions if they are
229 wrapped inside a helper function matching the `__batchFunc_t__` signature.
230 One limitation of `batchfunc()` is that it cannot be used from within a
231 `gfor()` loop at the present time.
232 
233 # Advanced Vectorization
234 
235 We have seen the different methods ArrayFire provides to vectorize our code. Tying
236 them all together is a slightly more involved process that needs to consider data
237 dimensionality and layout, memory usage, nesting order, etc. An excellent example
238 and discussion of these factors can be found on our blog:
239 
240 http://arrayfire.com/how-to-write-vectorized-code/
241 
242 It's worth noting that the content discussed in the blog has since been transformed
243 into a convenient af::nearestNeighbour() function. Before writing something from
244 scratch, check that ArrayFire doesn't already have an implementation. The default
245 vectorized nature of ArrayFire and an extensive collection of functions will
246 speed things up in addition to replacing dozens of lines of code!
247