Skip to main content



Best Practices for MATLAB GPU Coding

The following notes were gathered from our experiences converting existing MATLAB code to run on the GPU.

Profiling Code

The first criteria to determine whether the code is a good candidate for GPU processing is to identify what function is taking up the majority of the time. Use the MATLAB profile tool to help determine the characteristics of the code.

Good indicators include:

  • Custom functions that consume the majority of time
  • Custom lines of code that consume the majority of time
  • Use of a parfor loop
  • Custom MEX functions that consume the majority of time

Bad indicators include:

  • Built-in functions that consume the majority of time and do not have a GPUArray corresponding function (e.g. svd)
  • High function call rate that is not isolated to an individual function
  • Short execution time

The next step is to utilize the MATLAB debugger to insert breakpoints in the functions/lines of code that use the most processing time to get a sense of what types/size of data is being processed.

Good indicators include:

  • Large input/output data (several thousand rows/columns/etc…)
  • Simple data types

Bad indicators include:

  • Small data volume
  • Complicated data types (i.e. large, multi-level structs)

Make sure to test a variety of types/sizes of inputs to understand the range of possible bottlenecks.

Strategies for Using GPU Capability

  • Convert directly to gpuArray built-in function
    The easiest conversion is to replace an existing built-in function with a GPUArray built-in function. This is easy to test. Take an existing input array and turn it into a gpuArray and then run the code to see if the desired function works (i.e., is overloaded) for the gpuArray datatype. If it blows up, you can step through the code to find where the problem lies; it might be that other data structures need to be updated to accommodate a gpuArray, or there might be places where you need to convert back to a non-gpuArray to perform a built-in operation that does not support the gpuArray type. In some cases this will provide an immediate speed-up; in other cases it will make the code run slower. The fact that the code is running slower is not a bad sign (at least it works!). If the code works, then it is possible that restructuring the code may lead to better performance. Using the MATLAB debugger with breakpoints is an excellent tool for accomplishing this task.
  • Restructure code to eliminate loops
    A modest speed-up can be found by reducing or eliminating the number of loops/calls that are made in the code. One example is code that divides each row of a matrix by a scalar value:
    A = rand(4);
    output = zeros(4);
    for n=1:4
        output(n,:) = input(n,:) / n;
    end

    This could be:
    A=rand(4);
    output=zeros(4);
    output = A ./ repmat([1:4].',[1,4]);

    This process is also known as “code vectorization.” A good explanation of this concept can be found in MATLAB’s online support.

    Through a combination of restructuring your code and using a GPUArray built-in function, you can often reduce the amount of processing time. In general you want to minimize the number of calls that you make to the GPUArray functions to reduce the overhead of converting to, transferring to, and retrieving GPU data. This will usually involve a pattern of aggregating many inputs together, calling GPUArray built-in functions, and unpacking aggregated results.
  • Stay within GPU memory
    Using the GPU you sooner or later will get an error like:
    ??? Error using ==> gpuArray at 28
    Out of memory on device. You requested: 762.94Mb, device has 1.31Gb free.


    It is not always completely possible to fill device memory due to memory fragmentation and the need to create temporary work arrays.

    There are two options at this point: free GPU memory using “clear”, or try to divide your problem into smaller chunks. For example, if you have a 4 GB array you can break it into 1 GB chunks and do four operations instead of one. This is completely dependent on whether the problem can be divided. Host memory can be used to buffer for memory that is used on the GPU. Matrix manipulation on the GPU tends to be faster so there can be a nice speed-up by moving operations like subsref, reshape, etc… to occur on the GPU and then return the results back to host memory.
  • Recast for arrayfun
    It is possible to use arrayfun to do operations across a large number of scalar values and to a certain extent handle bulk processing of small arrays.

    Handling scalar calculations is covered in the arrayfun documentation. But let's say you need to perform repetitive vector or matrix computations that MATLAB has not yet implemented for the GPU. In that case, you can still get good results with a little creative thinking. For example, let's say you want to take the determinants of 100 2x2 matrices. First, define a little function that takes the elements of one matrix as input, like this:
    function out=littledetfun(a11,a21,a12,a22)
    out = a11*a22-a21*a12;

    Then call it 100 times, in vectorized style, on the GPU using arrayfun (each row of A represents one 2x2 matrix):
    >> A = gpuArray(rand(100,4));
    >> output = arrayfun(@littledetfun,A(:,1),A(:,2),A(:,3),A(:,4));

    This is only practical for a small number of arguments, but can be useful under certain circumstances.

    Other uses of arrayfun is to combine statements that are already vectorized and make them into a single arrayfun function, i.e. (average 3 numbers):

    Original:
    >> result = (a1 + a2 + a3) ./ 3;


    arrayfun:
    function out=littleaaafun(a1,a2,a3)
    out = (a1+a2+a3) / 3;

    In some cases this can result in significant speedup.
  • Create a CUDA kernel
    Creating a CUDA kernel can provide a big speed-up and a big headache. Simple functions that are task independent are good candidates for a CUDA kernel. Bad candidates include complex code that include a lot of branching, task dependencies, and/or serialized output. A good candidate is to translate an existing MEX function that is very simple to run on the GPU. In some cases it will make sense to convert a for-loop to use parallel threads, but in general, dealing with parallel threads can be tricky.
  • Use GPUs from a matlabpool on CPUs
    Calling a function that utilizes GPU functions in a parfor loop runs the risk that worker processes will conflict with each other. The easiest option is to ensure that GPU actions only occur within the main worker process and do not occur in a parfor loop. This can work effectively to use the GPU to expedite the process of collecting data inputs then running code that cannot be run on the GPU in a parfor loop.

    To run GPU functions in a parfor loop, one option is to allocate one GPU per worker as explained here: http://www.mathworks.com/support/solutions/en/data/1-DOIZVT/index.html?product=DM. This will pin a specific GPU to a worker process and avoid conflicts or out of memory errors.

    Another alternative would be the following:
    function testGPUInParfor()
    spmd
        selectGPUDeviceForLab();
    end
    
    parfor i = 1:10
        % Each iteration will generate some data A
        A = rand(5);
        if selectGPUDeviceForLab()
            A = gpuArray(A);
            disp( 'Do it on the GPU' )
        else
            disp( 'Do it on the host' )
        end
        % replace the following line with whatever task you need to do
        S = sum(A,1);
        
        % Maybe collect back from GPU (gather is a no-op if not on the GPU)
        S = gather(S);
    end
    
     
     
    function ok = selectGPUDeviceForLab()
    persistent hasGPU;
    
    if isempty( hasGPU )
        devIdx = mod(labindex-1,gpuDeviceCount())+1;
        try
            dev = gpuDevice( devIdx );
            hasGPU = dev.DeviceSupported;
        catch %#ok
            hasGPU = false;
        end
    end
    ok = hasGPU;
  • Test code changes
    A good habit when adding GPU code is to test the outputs from the GPU code to ensure that numerical values have not changed. Due to slight differences in the way the GPU and CPU handle arithmetic, you may see a small amount of variance in results (difference on the order of 1e-7 percent may be reasonable). Another good habit is to add conditionals or detection code to switch between GPU/non-GPU versions of the code.

    Verify that functions called on a gpuArray actually return the expected output, especially functions that are not explicitly listed as being supported GPUArray built-in functions. shiftdim is an example of a function that doesn’t throw an error when called with a gpuArray, but also doesn’t necessarily return the expected result (the first line below, in R2011a and earlier):
    >> shiftdim(gpuArray(ones(4,1)),1)
    
    ans =
    
    parallel.gpu.GPUArray:
    ---------------------
                    Size: [4 1]
        ClassUnderlying: 'double'
                Complexity: 'real'
    
    >> shiftdim(gpuArray(ones(1,4)),1)
    
    ans =
    
    parallel.gpu.GPUArray:
    ---------------------
                    Size: [4 1]
        ClassUnderlying: 'double'
                Complexity: 'real'


    Other functions may work on an gpuArray but not natively, so that the operation is actually carried out by the CPU. The result is then not a gpuArray. This behavior may show up as a performance anomaly in the profiler.
  • Find and fix new bottlenecks
    After optimizing code to use GPU functions, it is possible to start scaling up the size of input data which can lead to the discovery of new bottlenecks. It is best to iterate through the process by starting with a fresh profile to establish the new bottleneck and determine whether additional GPU functions can be used.

    Sometimes after optimizing the code, it becomes obvious that you are performing many small calculations. In some cases it may not be possible to optimize any further, especially if you are operating in a loop that is not task independent.