Tag Archives: multi-gpu

OpenCL Cookbook: Multi-device utilisation strategies

A quick note on all the different ways in which one can utilise multiple devices in OpenCL. This overlaps with my previous post on the same subject in the series but is a higher level overview than that previous post.

Here are the different ways one can load balance across multiple devices.

  • Use CPU and GPU devices.
  • Use multiple GPU devices.

For each of the above one can utilise them in the following ways.

  • Single context – Use one context in total across all devices and one command queue per device.
  • Multiple contexts single-host-thread – Use one context and one command queue per device.
  • Multiple contexts multi-host-threaded – Same as multiple contexts above but use one host thread per device to enqueue to the command queue for that device. This can also be in the form of starting one host process instance per device which in essence is the same.

For an elaboration on the latter strategies see my previous post on the subject.

OpenCL Cookbook: How to leverage multiple devices in OpenCL

So far, in the OpenCL Cookbook series, we’ve only looked at utilising a single device for computation. But what happens when you install more than one card in your host machine? How do you scale your computation across multiple GPUs? Will your code automatically scale to multiple devices or does it require you to consciously think about how to distribute the load of the computation across all available devices and change your code to apply that strategy? Here I look at answers to these questions.

Decide on how you want to use the host binding to support multiple devices

There are two ways in which a given host binding can support multiple devices.

  • A single context across all device and one command queue per device.
  • One context and command queue per device

Let’s look at these in more detail with skeletal implementations in C.

Creating a single context across all devices and one command queue per device

For this particular way of the binding supporting multiple devices we create only one context and share it across one command queue per device. So if we have say two devices we’ll have one context and two command queues each of which share that one context.

#include <iostream>
#include <CL/cl.hpp>
#include <CL/opencl.h>

int main () {

    cl_int err;
    
    // get first platform
    cl_platform_id platform;
    err = clGetPlatformIDs(1, &platform, NULL);
    
    // get device count
    cl_uint deviceCount;
    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, NULL, &deviceCount);
    
    // get all devices
    cl_device_id* devices;
    devices = new cl_device_id[deviceCount];
    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, deviceCount, devices, NULL);
    
    // create a single context for all devices
    cl_context context = clCreateContext(NULL, deviceCount, devices, NULL, NULL, &err);
    
    // for each device create a separate queue
    cl_command_queue* queues = new cl_command_queue[deviceCount];
    for (int i = 0; i < deviceCount; i++) {
        queues[i] = clCreateCommandQueue(context, devices[i], 0, &err);
    }

    /*
     * Here you have one context across all devices and one command queue per device.
     * You can choose to send your tasks to any of these queues depending on which
     * device you want to execute the task on.
     */

    // cleanup
    for(int i = 0; i < deviceCount; i++) {
        clReleaseDevice(devices[i]);
        clReleaseCommandQueue(queues[i]);
    }
    
    clReleaseContext(context);

    delete[] devices;
    delete[] queues;
    
    return 0;
    
}

Creating one context and one command queue per device

Here I create one context and one command queue per device each of which have their own context rather than sharing one.

#include <iostream>
#include <CL/cl.hpp>
#include <CL/opencl.h>

int main () {

    cl_int err;
    
    // get first platform
    cl_platform_id platform;
    err = clGetPlatformIDs(1, &platform, NULL);
    
    // get device count
    cl_uint deviceCount;
    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, NULL, &deviceCount);
    
    // get all devices
    cl_device_id* devices;
    devices = new cl_device_id[deviceCount];
    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, deviceCount, devices, NULL);
    
    // for each device create a separate context AND queue
    cl_context* contexts = new cl_context[deviceCount];
    cl_command_queue* queues = new cl_command_queue[deviceCount];
    for (int i = 0; i < deviceCount; i++) {
        contexts[i] = clCreateContext(NULL, deviceCount, devices, NULL, NULL, &err);
        queues[i] = clCreateCommandQueue(contexts[i], devices[i], 0, &err);
    }

    /*
     * Here you have one context and one command queue per device.
     * You can choose to send your tasks to any of these queues.
     */

    // cleanup
    for(int i = 0; i < deviceCount; i++) {
        clReleaseDevice(devices[i]);
        clReleaseContext(contexts[i]);
        clReleaseCommandQueue(queues[i]);
    }
    
    delete[] devices;
    delete[] contexts;
    delete[] queues;
    
    return 0;

}

How do you scale your computation across multiple devices?

The process of utilising multiple devices for your computation is not done automatically by the binding when new devices are detected sadly. Nor is it possible for it do so. Doing this requires active thought from the host programmer. When using a single device you send all your kernel invocations to the command queue associated with that device. In order to use multiple devices you must have one command queue per device either sharing a context or each queue having its own context. Then you must decide how to distribute your kernel calls across all available queues. It may be as simple as a round robin strategy across all queues for all your computations or it may be more complex.

Bear in mind that if your computation entails reading back a result synchronously then a round robin strategy across queues won’t work. This is because each current call will block and complete prior to you sending to the next queue which will essentially make the process of distributing across queues serial. Obviously this defeats the whole purpose of having multiple devices operating in parallel. What you really need is one host thread per device each sending computations to its own command queue. That way each queue is receiving and processing computations in parallel with other queues. Then you effectively achieve true hardware parallelism.

Which of the two ways should you use?

It depends. I would try the single context option first as it’s likely to use less memory and be faster. If you encounter instability or problems I would switch to the multiple context method. That’s the general rule. There is, however, another reason you may opt for a multiple context method. If you are using multiple threads which all require access to a context it is preferable for each thread to have its own context as the opencl host binding is not guaranteed to be thread safe. If you try to access a single context across multiple threads you may get serious system crashes and reboots so always have thread confined opencl structures.

Using a single context across multiple host threads

You may want to use one thread per device to send tasks to the command queue associated with each device. In this case you will have multiple host threads. But here have to be careful. In my experience it has not been safe to use a single context across multiple host threads. The last time I tried this was in C# using the Cloo host binding. Using a single context across multiple host threads resulted in a Windows 7 blue screen, Windows dumping memory to a file and then rebooting after which Windows failed to come back up until physically rebooted once more from the machine. The solution is to use the multi context option outlined above. Have thread confined separation for opencl resources and you’ll be fine.