1 Introduction to Vectorization {#vectorization}
4 Programmers and Data Scientists want to take advantage of fast and parallel
5 computational devices. Writing vectorized code is necessary to get
6 the best performance out of the current generation parallel hardware and
7 scientific computing software. However, writing vectorized code may not be
8 immediately intuitive. ArrayFire provides many ways to vectorize a given code
9 segment. In this tutorial, we present several methods to vectorize code
10 using ArrayFire and discuss the benefits and drawbacks associated with each method.
12 # Generic/Default vectorization
14 By its very nature, ArrayFire is a vectorized library. Most functions operate on
15 arrays as a whole -- on all elements in parallel. Wherever possible, existing
16 vectorized functions should be used opposed to manually indexing into arrays.
17 For example consider the following code:
19 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
20 af::array a = af::range(10); // [0, 9]
21 for(int i = 0; i < a.dims(0); ++i)
23 a(i) = a(i) + 1; // [1, 10]
25 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
27 Although completely valid, the code is very inefficient as it results in
28 a kernel kernels that operate on one datum.
29 Instead, the developer should have used ArrayFire's overload of the + operator:
31 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
32 af::array a = af::range(10); // [0, 9]
34 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
36 This code will result in a single kernel that operates on all 10 elements
39 Most ArrayFire functions are vectorized. A small subset of these include:
41 Operator Category | Functions
42 ------------------------------------------------------------|--------------------------
43 [Arithmetic operations](\ref arith_mat) | [+](\ref arith_func_add), [-](\ref arith_func_sub), [*](\ref arith_func_mul), [/](\ref arith_func_div), [%](\ref arith_func_mod), [>>](\ref arith_func_shiftr), [<<](\ref arith_func_shiftl)
44 [Logical operations](\ref logic_mat) | [&&](\ref arith_func_and), \|\|[(or)](\ref arith_func_or), [<](\ref arith_func_lt), [>](\ref arith_func_gt), [==](\ref arith_func_eq), [!=](\ref arith_func_neq) etc.
45 [Numeric functions](\ref numeric_mat) | abs(), floor(), round(), min(), max(), etc.
46 [Complex operations](\ref complex_mat) | real(), imag(), conj(), etc.
47 [Exponential and logarithmic functions](\ref explog_mat) | exp(), log(), expm1(), log1p(), etc.
48 [Trigonometric functions](\ref trig_mat) | sin(), cos(), tan(), etc.
49 [Hyperbolic functions](\ref hyper_mat) | sinh(), cosh(), tanh(), etc.
51 In addition to element-wise operations, many other functions are also
52 vectorized in ArrayFire.
54 Notice that even that perform some form of aggregation (e.g. `sum()` or `min()`),
55 signal processing (like `convolve()`), and even image processing functions
56 (i.e. `rotate()`) all support vectorization on different columns or images.
57 For example, if we have `NUM` images of size `WIDTH` by `HEIGHT`, one could
58 convolve each image in a vector fashion as follows:
60 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
61 float g_coef[] = { 1, 2, 1,
65 af::array filter = 1.f/16 * af::array(3, 3, g_coef);
67 af::array signal = randu(WIDTH, HEIGHT, NUM);
68 af::array conv = convolve2(signal, filter);
69 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
71 Similarly, one can rotate 100 images by 45 degrees in a single call using
72 code like the following:
74 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
75 // Construct an array of 100 WIDTH x HEIGHT images of random numbers
76 af::array imgs = randu(WIDTH, HEIGHT, 100);
77 // Rotate all of the images in a single command
78 af::array rot_imgs = rotate(imgs, 45);
79 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
81 Although *most* functions in ArrayFire do support vectorization, some do not.
82 Most notably, all linear algebra functions. Even though they are not vectorized
83 linear algebra operations still execute in parallel on your hardware.
85 Using the built in vectorized operations should be the first
86 and preferred method of vectorizing any code written with ArrayFire.
88 # GFOR: Parallel for-loops
90 Another novel method of vectorization present in ArrayFire is the GFOR loop
91 replacement construct. GFOR allows launching all iterations of a loop in parallel
92 on the GPU or device, as long as the iterations are independent. While the
93 standard for-loop performs each iteration sequentially, ArrayFire's gfor-loop
94 performs each iteration at the same time (in parallel). ArrayFire does this by
95 tiling out the values of all loop iterations and then performing computation on
96 those tiles in one pass. You can think of gfor as performing auto-vectorization
97 of your code, e.g. you write a gfor-loop that increments every element of a vector
98 but behind the scenes ArrayFire rewrites it to operate on the entire vector in
101 The original for-loop example at the beginning of this document could be
102 rewritten using GFOR as follows:
104 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
105 af::array a = af::range(10);
108 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
110 In this case, each instance of the gfor loop is independent, thus ArrayFire
111 will automatically tile out the `a` array in device memory and execute the
112 increment kernels in parallel.
114 To see another example, you could run an accum() on every slice of a matrix in a
115 for-loop, or you could "vectorize" and simply do it all in one gfor-loop operation:
117 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
118 // runs each accum() in sequence
119 for (int i = 0; i < N; ++i)
120 B(span,i) = accum(A(span,i));
122 // runs N accums in parallel
124 B(span,i) = accum(A(span,i));
125 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
127 However, returning to our previous vectorization technique, accum() is already
128 vectorized and the operation could be completely replaced with merely:
130 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
132 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
134 It is best to vectorize computation as much as possible to avoid the overhead in
135 both for-loops and gfor-loops. However, the gfor-loop construct is most effective
136 in the narrow case of broadcast-style operations. Consider the case when we have
137 a vector of constants that we wish to apply to a collection of variables, such as
138 expressing the values of a linear combination for multiple vectors. The broadcast
139 of one set of constants to many vectors works well with gfor-loops:
141 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
142 const static int p=4, n=1000;
143 af::array consts = af::randu(p);
144 af::array var_terms = randn(p, n);
147 combination(span, i) = consts * var_terms(span, i);
148 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
150 Using GFOR requires following several rules and multiple guidelines for optimal
151 performance. The details of this vectorization method can be found in the
152 [GFOR documentation](\ref gfor).
156 The batchFunc() function allows the broad application of existing ArrayFire
157 functions to multiple sets of data. Effectively, batchFunc() allows ArrayFire
158 functions to execute in "batch processing" mode. In this mode, functions will
159 find a dimension which contains "batches" of data to be processed and will
160 parallelize the procedure.
162 Consider the following example. Here we create a filter which we would like
163 to apply to each of the weight vectors. The naive solution would be using a
164 for-loop as we have seen previously:
166 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
167 // Create the filter and the weight vectors
168 af::array filter = randn(1, 5);
169 af::array weights = randu(5, 5);
171 // Apply the filter using a for-loop
172 af::array filtered_weights = constant(0, 5, 5);
173 for(int i=0; i<weights.dims(1); ++i){
174 filtered_weights.col(i) = filter * weights.col(i);
176 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
178 However, as we have discussed above, this solution will be very inefficient.
179 One may be tempted to implement a vectorized solution as follows:
181 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
182 // Create the filter and the weight vectors
183 af::array filter = randn(1, 5);
184 af::array weights = randu(5, 5);
186 af::array filtered_weights = filter * weights; // fails due to dimension mismatch
187 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
189 However, the dimensions of `filter` and `weights` do not match, thus ArrayFire
190 will generate a runtime error.
192 `batchfunc()` was created to solve this specific problem.
193 The signature of the function is as follows:
195 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
196 array batchFunc(const array &lhs, const array &rhs, batchFunc_t func);
197 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
199 where `__batchFunc_t__` is a function pointer of the form:
201 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
202 typedef array (*batchFunc_t) (const array &lhs, const array &rhs);
203 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
205 So, to use batchFunc(), we need to provide the function we wish to apply as a
206 batch operation. For illustration's sake, let's "implement" a multiplication
207 function following the format.
209 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
210 af::array my_mult (const af::array &lhs, const af::array &rhs){
213 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
215 Our final batch call is not much more difficult than the ideal
218 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
219 // Create the filter and the weight vectors
220 af::array filter = randn(1, 5);
221 af::array weights = randu(5, 5);
223 // Apply the batch function
224 af::array filtered_weights = batchFunc( filter, weights, my_mult );
225 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
227 The batch function will work with many previously mentioned vectorized ArrayFire
228 functions. It can even work with a combination of those functions if they are
229 wrapped inside a helper function matching the `__batchFunc_t__` signature.
230 One limitation of `batchfunc()` is that it cannot be used from within a
231 `gfor()` loop at the present time.
233 # Advanced Vectorization
235 We have seen the different methods ArrayFire provides to vectorize our code. Tying
236 them all together is a slightly more involved process that needs to consider data
237 dimensionality and layout, memory usage, nesting order, etc. An excellent example
238 and discussion of these factors can be found on our blog:
240 http://arrayfire.com/how-to-write-vectorized-code/
242 It's worth noting that the content discussed in the blog has since been transformed
243 into a convenient af::nearestNeighbour() function. Before writing something from
244 scratch, check that ArrayFire doesn't already have an implementation. The default
245 vectorized nature of ArrayFire and an extensive collection of functions will
246 speed things up in addition to replacing dozens of lines of code!