/build/arrayfire/src/arrayfire-full-3.6.1/docs/pages/interop_cuda.md
Go to the documentation of this file.
1 Interoperability with CUDA {#interop_cuda}
2 ========
3 
4 Although ArrayFire is quite extensive, there remain many cases in which you
5 may want to write custom kernels in CUDA or [OpenCL](\ref interop_opencl).
6 For example, you may wish to add ArrayFire to an existing code base to increase
7 your productivity, or you may need to supplement ArrayFire's functionality
8 with your own custom implementation of specific algorithms.
9 
10 ArrayFire manages its own memory, runs within its own CUDA stream, and
11 creates custom IDs for devices. As such, most of the interoperability functions
12 focus on reducing potential synchronization conflicts between ArrayFire and CUDA.
13 
14 # Basics
15 
16 It is fairly straightforward to interface ArrayFire with your own custom CUDA
17 code. ArrayFire provides several functions to ease this process including:
18 
19 | Function | Purpose |
20 |-----------------------|-----------------------------------------------------|
21 | af::array(...) | Construct an ArrayFire Array from device memory |
22 | af::array.device() | Obtain a pointer to the device memory (implies lock() |
23 | af::array.lock() | Removes ArrayFire's control of a device memory pointer |
24 | af::array.unlock() | Restore's ArrayFire's control over a device memory pointer |
25 | af::getDevice() | Gets the current ArrayFire device ID |
26 | af::setDevice() | Switches ArrayFire to the specified device |
27 | afcu::getNativeId() | Converts an ArrayFire device ID to a CUDA device ID |
28 | afcu::setNativeId() | Switches ArrayFire to the specified CUDA device ID |
29 | afcu::getStream() | Get the current CUDA stream used by ArrayFire |
30 
31 
32 Below we provide two worked examples on how ArrayFire can be integrated
33 into new and existing projects.
34 
35 # Adding custom CUDA kernels to an existing ArrayFire application
36 
37 By default, ArrayFire manages its own memory and operates in its own CUDA
38 stream. Thus there is a slight amount of bookkeeping that needs to be done
39 in order to integrate your custom CUDA kernel.
40 
41 If your kernels can share the ArrayFire CUDA stream, you should:
42 
43 1. Include the 'af/afcuda.h' header in your source code
44 2. Use ArrayFire as normal
45 3. Ensure any JIT kernels have executed using `af::eval()`
46 4. Obtain device pointers from ArrayFire array objects using
47 5. Determine ArrayFire's CUDA stream
48 6. Set arguments and run your kernel in ArrayFire's stream
49 7. Return control of af::array memory to ArrayFire
50 8. Compile with `nvcc`, linking with the `afcuda` library.
51 
52 Notice that since ArrayFire and your kernels are sharing the same CUDA
53 stream, there is no need to perform any synchronization operations as
54 operations within a stream are executed in order.
55 
56 This process is best illustrated with a fully worked example:
57 
58 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
59 // 1. Add includes
60 #include <arrayfire.h>
61 #include <af/cuda.h>
62 
63 int main() {
64 
65  // 2. Use ArrayFire as normal
66  size_t num = 10;
67  af::array x = af::constant(0, num);
68 
69  // ... many ArrayFire operaitons here
70 
71  // 3. Ensure any JIT kernels have executed
72  x.eval();
73  af_print(x);
74 
75  // Run a custom CUDA kernel in the ArrayFire CUDA stream
76 
77  // 4. Obtain device pointers from ArrayFire array objects using
78  // the array::device() function:
79  float *d_x = x.device<float>();
80 
81  // 5. Determine ArrayFire's CUDA stream
82  int af_id = af::getDevice();
83  int cuda_id = afcu::getNativeId(af_id);
84  cudaStream_t af_cuda_stream = afcu::getStream(cuda_id);
85 
86  // 6. Set arguments and run your kernel in ArrayFire's stream
87  // Here launch with 10 blocks of 10 threads
88  increment<<<1, num, 0, af_cuda_stream>>>(d_x);
89 
90  // 7. Return control of af::array memory to ArrayFire using
91  // the array::unlock() function:
92  x.unlock();
93 
94  // ... resume ArrayFire operations
95  af_print(x);
96 
97  // Because the device pointers, d_x and d_y, were returned to ArrayFire's
98  // control by the unlock function, there is no need to free them using
99  // cudaFree()
100 
101  return 0;
102 }
103 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
104 
105 If your kernels needs to operate in their own CUDA stream, the process is
106 essentially identical, except you need to instruct ArrayFire to complete
107 its computations using the af::sync() function prior to launching your
108 own kernel and ensure your kernels are complete using `cudaDeviceSynchronize()`
109 (or similar) commands prior to returning control of the memory to ArrayFire:
110 
111 1. Include the 'af/afcuda.h' header in your source code
112 2. Use ArrayFire as normal
113 3. Ensure any JIT kernels have executed using `af::eval()`
114 4. Instruct ArrayFire to finish operations using af::sync()
115 5. Obtain device pointers from ArrayFire array objects using
116 6. Determine ArrayFire's CUDA stream
117 7. Set arguments and run your kernel in your custom stream
118 8. Ensure CUDA operations have finished using `cudaDeviceSyncronize()`
119  or similar commands.
120 9. Return control of af::array memory to ArrayFire
121 10. Compile with `nvcc`, linking with the `afcuda` library.
122 
123 # Adding ArrayFire to an existing CUDA application
124 
125 Adding ArrayFire to an existing CUDA application is slightly more involved
126 and can be somewhat tricky due to several optimizations we implement. The
127 most important are as follows:
128 
129 * ArrayFire assumes control of all memory provided to it.
130 * ArrayFire does not (in general) support in-place memory transactions.
131 
132 We will discuss the implications of these items below. To add ArrayFire
133 to existing code you need to:
134 
135 1. Include `arrayfire.h` and `af/cuda.h` in your source file
136 2. Finish any pending CUDA operations
137  (e.g. use cudaDeviceSynchronize() or similar stream functions)
138 3. Create ArrayFire arrays from existing CUDA pointers
139 4. Perform operations on ArrayFire arrays
140 5. Instruct ArrayFire to finish operations using af::eval() and af::sync()
141 6. Obtain pointers to important memory
142 7. Continue your CUDA application.
143 8. Free non-managed memory
144 9. Compile and link with the appropriate paths and the `-lafcuda` flags.
145 
146 To create the af::array objects, you should use one of the following
147 constructors with `src=afDevice`:
148 
149 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
150 // 1D - 3D af::array constructors
151 af::array (dim_t dim0, const T *pointer, af::source src=afHost)
152 af::array (dim_t dim0, dim_t dim1, const T *pointer, af::source src=afHost)
153 af::array (dim_t dim0, dim_t dim1, dim_t dim2, const T *pointer, af::source src=afHost)
154 af::array (dim_t dim0, dim_t dim1, dim_t dim2, dim_t dim3, const T *pointer, af::source src=afHost)
155 
156 // af::array constructor using a dim4 object
157 af::array (const dim4 &dims, const T *pointer, af::source src=afHost)
158 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
159 
160 *NOTE*: With all of these constructors, ArrayFire's memory manager automatically
161 assumes responsibility for any memory provided to it. Thus ArrayFire could free
162 or reuse the memory at any later time. If this behavior is not desired, you
163 may call `array::unlock()` and manage the memory yourself. However, if you do
164 so, please be cautious not to free memory when ArrayFire might be using it!
165 
166 The seven steps above are best illustrated using a fully-worked example:
167 
168 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
169 // 1. Add includes
170 #include <arrayfire.h>
171 #include <af/cuda.h>
172 
173 using namespace std;
174 
175 int main() {
176 
177  // Create and populate CUDA memory objects
178  const int elements = 100;
179  size_t size = elements * sizeof(float);
180  float *cuda_A;
181  cudaMalloc((void**) &cuda_A, size);
182 
183  // ... perform many CUDA operations here
184 
185  // 2. Finish any pending CUDA operations
186  cudaDeviceSynchronize();
187 
188  // 3. Create ArrayFire arrays from existing CUDA pointers.
189  // Be sure to specify that the memory type is afDevice.
190  af::array d_A(elements, cuda_A, afDevice);
191 
192  // NOTE: ArrayFire now manages cuda_A
193 
194  // 4. Perform operations on the ArrayFire Arrays.
195  d_A = d_A * 2;
196 
197  // NOTE: ArrayFire does not perform the above transaction using
198  // in-place memory, thus the pointers containing memory to d_A have
199  // likely changed.
200 
201  // 5. Instruct ArrayFire to finish pending operations using eval and sync.
202  af::eval(d_A);
203  af::sync();
204 
205  // 6. Get pointers to important memory objects.
206  // Once device is called, ArrayFire will not manage the memory.
207  float * outputValue = d_A.device<float>();
208 
209  // 7. continue CUDA application as normal
210 
211  // 8. Free non-managed memory
212  // We removed outputValue from ArrayFire's control, we need to free it
213  cudaFree(outputValue);
214 
215  return 0;
216 }
217 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
218 
219 # Using multiple devices
220 
221 If you are using multiple devices with ArrayFire and CUDA kernels, there is
222 one "gotcha" of which you should be aware. ArrayFire implements its own internal
223 order of compute devices, thus a CUDA device ID may not be the same as an
224 ArrayFire device ID. Thus when switching between devices it is important
225 that you use our interoperability functions to get/set the correct device
226 IDs. Below is a quick listing of the various functions needed to switch
227 between devices along with some disambiguation as to the device identifiers
228 used with each function:
229 
230 | Function | ID Type | Purpose |
231 |---------------------|-------------|-----------------------------------------|
232 | cudaGetDevice() | CUDA | Gets the current CUDA device ID |
233 | cudaSetDevice() | CUDA |Sets the current CUDA device |
234 | af::getDevice() | AF | Gets the current ArrayFire device ID |
235 | af::setDevice() | AF | Sets the current ArrayFire device |
236 | afcu::getNativeId() | AF -> CUDA | Convert an ArrayFire device ID to a CUDA device ID |
237 | afcu::setNativeId() | CUDA -> AF |Set the current ArrayFire device from a CUDA ID |
238