While testing the new FluxG GPU service we did some testing with matlab/2013a its support for GPUs in the Parallel Computing Toolbox. Below are some examples, and how I sped up a code by using a GPU.
In my examples I use a small GPU call to wake up the GPU, the first time you use a GPU the startup time is long, about 12 seconds. All GPU operations after that will be very fast. I will also use hwloc-bind command to control MATLABs built in threading.
The easiest way to use a GPU with MATLAB is to use gpuArray() to move data to the GPU and then call an MATLAB GPU Enabled function on that data.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
%Brock Palen brockp@umich.edu 1/2014 | |
%To compare against a single core run: | |
% hwloc-bind core:0 matlab -r caenfft\(1e7\) | |
function caenfft(dim) | |
if ischar(dim) | |
dim=str2num(dim); | |
end | |
%wakeup the gpu takes ~12 seconds | |
II = gpuArray.eye(10,'int32'); | |
i=1:dim; | |
a=2.0*pi*i/10; | |
disp('CPU fft call') | |
tic; fft(a); toc | |
disp(' ') | |
disp('Time to move array to GPU and run the GPU fft copy results back to host') | |
tic; | |
Ga=gpuArray(a); | |
Gout = fft(Ga); | |
out = gather(Gout); | |
toc | |
1 CPU: 0.59s 1 GPU: 0.12s Speedup: 4.9x
The next step is todo more of your computation on the GPU including data generation. As the computational complexity rises that can stay on the GPU before moving data with gpuArray() and gather()the greater your performance benefit will be. This example will take a FFT of a 2D function.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
%Brock Palen brockp@umich.edu 1/2014 | |
%To compare against a single core run: | |
% hwloc-bind core:0 matlab -r gen\(1e4\) #first core | |
% hwloc-bind socket:0 matlab -r gen\(1e4\) #first socket all cores | |
function gen(dim) | |
if ischar(dim) | |
dim=str2num(dim); | |
end | |
%preallocate | |
a = zeros(dim); | |
%wakeup the gpu takes ~12 seconds | |
II = gpuArray.eye(10,'int32'); | |
%slow way, ommitted | |
%disp('Time to use loop for allocation') | |
%tic; | |
%for x = 1:dim, | |
% for y = 1:dim, | |
% a(x, y) = sin(6 * pi * (x^2 + y^2)); | |
% end | |
%end | |
%toc | |
disp('Vector Form') | |
tic; | |
x = 1:dim; | |
y = 1:dim; | |
[X, Y] = meshgrid(x,y); | |
a = sin(6 * pi * (X.^2 + Y.^2)); | |
fftn(a); | |
toc | |
disp('GPU Form') | |
tic; | |
Gx = gpuArray(1:dim); | |
Gy = gpuArray(1:dim); | |
[GX, GY] = meshgrid(Gx, Gy); | |
Ga = sin(6 *pi * (GX.^2 + GY.^2)); | |
clear GX GY Gx Gy; | |
fftn(Ga); | |
toc |
1 CPU: 25.66s 1 GPU: 1.38s Speedup: 18.59xWe used the vector form of the MATLAB operations which should be optimal. Another benefit of vector form, not only is it faster than using nested loops (uncomment the loop code in the second example if you are curious) MATLAB can also use multiple cores on vector forms for functions working on data that are large enough. How does this compare to a single GPU?
1 CPU: 25.66sBased on the Flux rates as of 1/2014, $60/GPU-Month and $6.60/CPU-Month, codes need to have greater than 9x speedup from a single GPU to make up for the cost. Even this is not exactly correct as each Flux GPU comes with 2 host CPU cores.
1 GPU: 1.38s Speedup: 18.59x
8 CPU: 4.61s Speedup: 5.56x
16 CPU: 2.65s Speedup: 9.68x