Tag Archives: gpu

OpenCL Cookbook: 10 tips for high performance kernels

Today we adorn our performance hat in the OpenCL Cookbook series. OpenCL kernel optimisation is a black art and very very hard. Recently when I came across a list of 10 tips for high performance kernels in Matthew Scarpino’s excellent book: OpenCL in Action I just had to share it as it was a true gem. Bear in mind that this is an abridged account – if you want more detail read the appropriate section in the book.

  1. Unroll loops: Comparison operations are expensive – if you know how many iterations you need simply perform them individually.
  2. Disable processing of denormalised numbers: The processing of denormalised numbers takes time. Disable them using the -cl-denorms-are-zero option. You can also disable processing of infinite values and NaNs using -cl-finite-math-only.
  3. Transfer constant primitive values to the kernel with compiler defines instead of private memory parameters. Example: clBuildProgram(program, 0, NULL, "-DSIZE=128", NULL, NULL);.
  4. Store small variable values in private memory instead of local memory: Private memory is faster and should be used unless you need to share data between work items in which case use local memory.
  5. Avoid local memory bank conflicts by accessing local memory sequentially: Successive 32-bit elements are stored in successive local memory banks. Sequential bank access is parallel whereas contending on same bank access is serial.
  6. Avoid using modulo operator: This operator is slow and alternatives should be used instead.
  7. Reuse private variables throughout the kernel but use macros to distinguish each separate use.
  8. For multiply-and-add operations use the fma function if available.
  9. Inline non-kernel functions in a program file: This uses more memory but removes the context switches and stack operations associated with regular function calls.
  10. Avoid branch miss penalties by coding conditional statements to be true more often than false: Processors often optimise for the true case and this can result in a penalty for the false case known as the branch miss penalty. Code your conditionals to evaluate to true as often as possible.

It’s easy to list rules for optimising kernels but the truth is that optimising kernels is hard; very hard. The best way to approach it is, in my opinion, to profile your application and kernels as well as to experiment with various changes. There will certainly be optimisations that you’ll apply that will turn out to be ineffectual or even slower so trial and error is key.

8 GPU watercooled computation rig using Power Color Devil 13 HD 7990 cards

Workstation motherboards provide 4 or more PCIE x16 gen 3 slots with a good distribution of dedicated links direct to multiple onboard CPUs. With dual cards such as 7990 which are in fact 2×7970 each one can in theory load up a 4 slot motherboard with 8 gpus in total. Alas, in practice, there can be obstacles booting with 7 or 8 of them with certain motherboads as they tend to run out of PCIE resources after around 6 cards which could be considered to be surprising for a workstation class motherboard. Incidentally, I think it’s worth noting that not only are these the only model of 7990 made but they are practically impossible to get hold of so very rare gems indeed.

Watercooling four dual gpus on a standard workstation motherboard can also be a challenge due to a severe shortage of space between them. Dual slot spacing between PCIE x16 slots is a tight fit as the tubes can take up almost three slots worth of space between cards. In this configuration they are installed on alternate slots so if you had 7 slots in total you’d only install on 1, 3, 5 and 7 leaving 2, 4 and 6 empty. Though 2, 4 and 6 will usually be PCIE x8 slots anyway as opposed to the rest being PCIE x16.

As you can see below these cards have been completely stripped down of their air cooling and heatsink apparatus prior to attaching waterblocks and tubing for coolant to pass through them. The tubing is secured using barb fittings which are stronger than the alternative: compression fittings though they do lack the aesthetic appeal of compression fittings. Compression fittings can come apart under tension and that can create a real mess as I realised the hard way one night.

If not using full card waterblocks (which these aren’t) individual adhesive heatsinks for all the ram chips (known as ram sinks) are required for sufficient cooling. There may be upto 12 of these tiny little ram sinks on each face of a card. I don’t have any photos of that right now but I’ll try and get some. Though, ram sinks, can be flimsy and easily become dislodged and fall off the cards if they are knocked which is why some people prefer full cover blocks. Full cover blocks are more robust but also more expensive.

In terms of power the system is supplied with 2400 watts of power composed of two 1200W power supplies chained by an adapter for the first to kickstart the other on boot. Half the gpus are powered by one and half by the other. This particular machine has 128GB of RAM and dual xeons with a combined total of 32 cores. Update: As rightly pointed out by the commenter below I meant hardware threads not physical cores here.

Note: I do not own the hardware or the photographs but consent has been acquired to publish them here. Also this system is not conceived or assembled by me though I do have the pleasure of loading it with OpenCL benchmarks and computations.

OpenCL Cookbook: How to leverage multiple devices in OpenCL

So far, in the OpenCL Cookbook series, we’ve only looked at utilising a single device for computation. But what happens when you install more than one card in your host machine? How do you scale your computation across multiple GPUs? Will your code automatically scale to multiple devices or does it require you to consciously think about how to distribute the load of the computation across all available devices and change your code to apply that strategy? Here I look at answers to these questions.

Decide on how you want to use the host binding to support multiple devices

There are two ways in which a given host binding can support multiple devices.

  • A single context across all device and one command queue per device.
  • One context and command queue per device

Let’s look at these in more detail with skeletal implementations in C.

Creating a single context across all devices and one command queue per device

For this particular way of the binding supporting multiple devices we create only one context and share it across one command queue per device. So if we have say two devices we’ll have one context and two command queues each of which share that one context.

#include <iostream>
#include <CL/cl.hpp>
#include <CL/opencl.h>

int main () {

    cl_int err;
    
    // get first platform
    cl_platform_id platform;
    err = clGetPlatformIDs(1, &platform, NULL);
    
    // get device count
    cl_uint deviceCount;
    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, NULL, &deviceCount);
    
    // get all devices
    cl_device_id* devices;
    devices = new cl_device_id[deviceCount];
    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, deviceCount, devices, NULL);
    
    // create a single context for all devices
    cl_context context = clCreateContext(NULL, deviceCount, devices, NULL, NULL, &err);
    
    // for each device create a separate queue
    cl_command_queue* queues = new cl_command_queue[deviceCount];
    for (int i = 0; i < deviceCount; i++) {
        queues[i] = clCreateCommandQueue(context, devices[i], 0, &err);
    }

    /*
     * Here you have one context across all devices and one command queue per device.
     * You can choose to send your tasks to any of these queues depending on which
     * device you want to execute the task on.
     */

    // cleanup
    for(int i = 0; i < deviceCount; i++) {
        clReleaseDevice(devices[i]);
        clReleaseCommandQueue(queues[i]);
    }
    
    clReleaseContext(context);

    delete[] devices;
    delete[] queues;
    
    return 0;
    
}

Creating one context and one command queue per device

Here I create one context and one command queue per device each of which have their own context rather than sharing one.

#include <iostream>
#include <CL/cl.hpp>
#include <CL/opencl.h>

int main () {

    cl_int err;
    
    // get first platform
    cl_platform_id platform;
    err = clGetPlatformIDs(1, &platform, NULL);
    
    // get device count
    cl_uint deviceCount;
    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, NULL, &deviceCount);
    
    // get all devices
    cl_device_id* devices;
    devices = new cl_device_id[deviceCount];
    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, deviceCount, devices, NULL);
    
    // for each device create a separate context AND queue
    cl_context* contexts = new cl_context[deviceCount];
    cl_command_queue* queues = new cl_command_queue[deviceCount];
    for (int i = 0; i < deviceCount; i++) {
        contexts[i] = clCreateContext(NULL, deviceCount, devices, NULL, NULL, &err);
        queues[i] = clCreateCommandQueue(contexts[i], devices[i], 0, &err);
    }

    /*
     * Here you have one context and one command queue per device.
     * You can choose to send your tasks to any of these queues.
     */

    // cleanup
    for(int i = 0; i < deviceCount; i++) {
        clReleaseDevice(devices[i]);
        clReleaseContext(contexts[i]);
        clReleaseCommandQueue(queues[i]);
    }
    
    delete[] devices;
    delete[] contexts;
    delete[] queues;
    
    return 0;

}

How do you scale your computation across multiple devices?

The process of utilising multiple devices for your computation is not done automatically by the binding when new devices are detected sadly. Nor is it possible for it do so. Doing this requires active thought from the host programmer. When using a single device you send all your kernel invocations to the command queue associated with that device. In order to use multiple devices you must have one command queue per device either sharing a context or each queue having its own context. Then you must decide how to distribute your kernel calls across all available queues. It may be as simple as a round robin strategy across all queues for all your computations or it may be more complex.

Bear in mind that if your computation entails reading back a result synchronously then a round robin strategy across queues won’t work. This is because each current call will block and complete prior to you sending to the next queue which will essentially make the process of distributing across queues serial. Obviously this defeats the whole purpose of having multiple devices operating in parallel. What you really need is one host thread per device each sending computations to its own command queue. That way each queue is receiving and processing computations in parallel with other queues. Then you effectively achieve true hardware parallelism.

Which of the two ways should you use?

It depends. I would try the single context option first as it’s likely to use less memory and be faster. If you encounter instability or problems I would switch to the multiple context method. That’s the general rule. There is, however, another reason you may opt for a multiple context method. If you are using multiple threads which all require access to a context it is preferable for each thread to have its own context as the opencl host binding is not guaranteed to be thread safe. If you try to access a single context across multiple threads you may get serious system crashes and reboots so always have thread confined opencl structures.

Using a single context across multiple host threads

You may want to use one thread per device to send tasks to the command queue associated with each device. In this case you will have multiple host threads. But here have to be careful. In my experience it has not been safe to use a single context across multiple host threads. The last time I tried this was in C# using the Cloo host binding. Using a single context across multiple host threads resulted in a Windows 7 blue screen, Windows dumping memory to a file and then rebooting after which Windows failed to come back up until physically rebooted once more from the machine. The solution is to use the multi context option outlined above. Have thread confined separation for opencl resources and you’ll be fine.

OpenCL Cookbook: Hello World using C# Cloo host binding

So far I’ve used the C and C++ bindings in the OpenCL Cookbook series. This time I provide a quick and simple example of how to use Cloo – the C# OpenCL host binding. However, since Cloo, for whatever reason, didn’t work as expected with a char array I will use an integer array instead. In other words – instead of sending a “Hello World!” message to the kernel I will send five integers instead. My guess is that there is some sort of bug with Cloo and char arrays.

Device code using Cloo’s variant of the OpenCL language

kernel void helloWorld(global read_only int* message, int messageSize) {
	for (int i = 0; i < messageSize; i++) {
		printf("%d", message[i]);
	}
}

The kernel above is merely illustrative in that it simply receives an integer array and its size and prints the array.

Note that the OpenCL syntax here is not the same as in C/C++. It has additional keywords to say whether the arguments are read only or write or read write and the kernel keyword is not prefixed with two underscores. The Cloo author must have decided that the original OpenCL syntax was for whatever reason unsuitable for adoption which IMO was a mistake. The OpenCL language syntax should be standard for portability, reusability and also so that there is only a single learning curve.

Host code using Cloo API

using System;
using System.Collections.Concurrent;
using System.Threading.Tasks;
using System.IO;
using Cloo;

namespace test
{
    class Program
    {
        static void Main(string[] args)
        {
            // pick first platform
            ComputePlatform platform = ComputePlatform.Platforms[0];

            // create context with all gpu devices
            ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu,
                new ComputeContextPropertyList(platform), null, IntPtr.Zero);

            // create a command queue with first gpu found
            ComputeCommandQueue queue = new ComputeCommandQueue(context,
                context.Devices[0], ComputeCommandQueueFlags.None);

            // load opencl source
            StreamReader streamReader = new StreamReader("..\..\kernels.cl");
            string clSource = streamReader.ReadToEnd();
            streamReader.Close();

            // create program with opencl source
            ComputeProgram program = new ComputeProgram(context, clSource);

            // compile opencl source
            program.Build(null, null, null, IntPtr.Zero);

            // load chosen kernel from program
            ComputeKernel kernel = program.CreateKernel("helloWorld");

            // create a ten integer array and its length
            int[] message = new int[] { 1, 2, 3, 4, 5 };
            int messageSize = message.Length;

            // allocate a memory buffer with the message (the int array)
            ComputeBuffer<int> messageBuffer = new ComputeBuffer<int>(context,
                ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer, message);

            kernel.SetMemoryArgument(0, messageBuffer); // set the integer array
            kernel.SetValueArgument(1, messageSize); // set the array size

            // execute kernel
            queue.ExecuteTask(kernel, null);

            // wait for completion
            queue.Finish();
        }
    }
}

The C# program above uses the Cloo object oriented api to interface with the underlying low level opencl implementation. It’s pretty self explanatory if you’ve been following the series so far. The output of the program is 12345.

OpenCL Cookbook: Parallelise your host loops using OpenCL

Continuing on in our series – this time we look at possibly the most important topic of all in OpenCL. It is the reason why we use OpenCL and it is also the most compelling benefit that OpenCL offers. It is, of course, parallelism. But how do we exploit the vast amount of parallelism that GPUs offer? At the simplest level we can do so by exploiting latent areas of parallelism in our host code the simplest of which are loops. In other words – if we can port loops in our host code to the GPU they become parallel and get faster by a factor of the total number of iterations. I demonstrate using a small example.

Host loop

void cpu_3d_loop (int x, int y, int z) {

    for (int i = 0; i < x; i++) {
        for (int j = 0; j < y; j++) {
            for (int k = 0; k < z; k++) {
                printf("CPU %d,%d,%dn", i, j, k);
            }
        }
    }

}

Imagine the loop above in our C++ host code. This is not one loop but in fact three. In other words it has three dimensions. The total number of iterations in this combined loop is x*y*z. If x=4, y=3 and z=2 the total number of iterations would be 4x3x2=24. On the CPU these loops execute serially which is fine for a small number of iterations but for large numbers it becomes a fundamental bottleneck. If this set of loops was ported to the GPU each iteration would run in parallel and the total number of threads in use would be 24 for the previous example.

A small scale example may not seem impressive at first. You could argue that you could just as well run 24 threads on the CPU. But consider this: what happens when you have the above set of loops in your host code performing thousands or even millions of iterations? How are you going to achieve hardware parallelism in this case on the CPU? The answer is you can’t. GPUs each have hundreds of cores and offer a far greater degree of parallelism so loops with a large number of iterations becomes easy work for the GPU which can run thousands or even millions of threads effectively. Below I demonstrate how to port such a loop to OpenCL.

Host binding code

#define __NO_STD_VECTOR
#define __CL_ENABLE_EXCEPTIONS

#include <fstream>
#include <iostream>
#include <iterator>
#include <CL/cl.hpp>
#include <CL/opencl.h>

using namespace cl;

void cpu_3d_loop (int x, int y, int z) {

    for (int i = 0; i < x; i++) {
        for (int j = 0; j < y; j++) {
            for (int k = 0; k < z; k++) {
                printf("CPU %d,%d,%dn", i, j, k);
            }
        }
    }

}

int main () {

    // CPU 3d loop

    int x = 4;
    int y = 3;
    int z = 2;
    cpu_3d_loop(x, y, z);
    std::cout << std::endl;

    // GPU 3d loop

    vector<Platform> platforms;
    vector<Device> devices;
    vector<Kernel> kernels;
    
    try {
    
        // create platform, context and command queue
        Platform::get(&platforms);
        platforms[0].getDevices(CL_DEVICE_TYPE_GPU, &devices);
        Context context(devices);
        CommandQueue queue(context, devices[0]);

        // load opencl source
        std::ifstream cl_file("kernels.cl");
        std::string cl_string(std::istreambuf_iterator<char>(cl_file),
            (std::istreambuf_iterator<char>()));
        Program::Sources source(1, std::make_pair(cl_string.c_str(), 
            cl_string.length() + 1));

        // create program and kernel and set kernel arguments
        Program program(context, source);
        program.build(devices);
        Kernel kernel(program, "ndrange_parallelism");

        // execute kernel and wait for completion
        NDRange global_work_size(x, y, z);
        queue.enqueueNDRangeKernel(kernel, NullRange, global_work_size, NullRange);
        queue.finish();

    } catch (Error e) {
        std::cout << std::endl << e.what() << " : " << e.err() << std::endl;
    }

    return 0;
    
}

The above program runs the cpu loop and then runs the equivalent logic on the gpu. Both cpu and gpu runs produce output to show which iteration they are processing. The key lines of code that demonstrate how to port the loop are below.

NDRange global_work_size(x, y, z);
queue.enqueueNDRangeKernel(kernel, NullRange, global_work_size, NullRange);

Here we set three upper bounds – one for each loop – this is known as the global work size. The kernel can then retrieve values for the currently executing iteration within the kernel itself as shown below. It can then use these indices to do whatever work is inside the loop. In this case we just print the indices for illustration.

Kernel code

The kernel you see below is executed x*y*z times with different values for i, j and k. See? No loops! 🙂

__kernel void ndrange_parallelism () {

	int i = get_global_id(0);
	int j = get_global_id(1);
	int k = get_global_id(2);

	printf("GPU %d,%d,%dn", i, j, k);
	
}

The output of running the above host code is as follows.

CPU 0,0,0
CPU 0,0,1
CPU 0,1,0
CPU 0,1,1
CPU 0,2,0
CPU 0,2,1
CPU 1,0,0
CPU 1,0,1
CPU 1,1,0
CPU 1,1,1
CPU 1,2,0
CPU 1,2,1
CPU 2,0,0
CPU 2,0,1
CPU 2,1,0
CPU 2,1,1
CPU 2,2,0
CPU 2,2,1
CPU 3,0,0
CPU 3,0,1
CPU 3,1,0
CPU 3,1,1
CPU 3,2,0
CPU 3,2,1

GPU 0,0,0
GPU 1,0,0
GPU 2,0,0
GPU 3,0,0
GPU 0,1,0
GPU 1,1,0
GPU 2,1,0
GPU 3,1,0
GPU 0,2,0
GPU 1,2,0
GPU 2,2,0
GPU 3,2,0
GPU 0,0,1
GPU 1,0,1
GPU 2,0,1
GPU 3,0,1
GPU 0,1,1
GPU 1,1,1
GPU 2,1,1
GPU 3,1,1
GPU 0,2,1
GPU 1,2,1
GPU 2,2,1
GPU 3,2,1

NOTE: Although there may appear to be a sequence in the order in which the GPU processes the iterations this is only due to the use of printf(). In reality when not using printf() the order of iterations is completely arbitrary and random. Therefore one must not rely on the order of iterations when porting loops to the GPU. If you need loops to be in a certain order then you can either keep your loops on the host or port only those parts of the loop that do not need to be sequential.

Why use GPU computing?

Although this example is fairly simple it does illustrate the most important value add of GPU computing and OpenCL. Hardware parallelism is the essence of what GPU computing offers and it is the most compelling reason to use it. If you imagine a legacy codebase and all the latent areas of parallelism that are currently running sequentially you can imagine the vast untapped power of GPGPU. Later on in the series we will look at techniques to port existing host code to the GPU. That process can be very difficult but can provide dramatic gains in performance far beyond the limits of CPU computing. Till next time.

OpenCL Cookbook: Hello World using C++ host binding

Last time, in the OpenCL Cookbook series, I presented a hello world example using OpenCL and C for the host binding language. This time I present a very similar example but using the C++ host binding language. As you already know from previous posts the host language that interfaces with an OpenCL device can be any number of languages such as C, C++, Java, C# and Python.

So far I’ve been using the C API but I’ve decided to switch to the C++ API for two reasons: (1) it’s considerably less lines of code being more succinct and (2) it supports exceptions meaning that you do not have to check error codes for every line of binding code that you write. So, here follows, a brief primer of the C++ OpenCL binding. It’s a very simple example but trust me – we’ll be getting to more complex examples soon (time is the issue).

OpenCL kernel

__kernel void hello_world (__global char* message, int messageSize) {
	for (int i =0; i < messageSize; i++) {
		printf("%s", message[i]);
	}
}

The kernel (OpenCL function) above receives a char array (in essence a string) from the host as well as the size of the char array (as there is no way to derive an array's size from the array itself (Java programmers gasp in shock and disgust). The kernel simply iterates over all the letters in the char array and prints them one at a time to standard output thereby printing the message: "Hello World!". Now let's look at the C++ code that interfaces with this kernel.

C++ host binding

#define __CL_ENABLE_EXCEPTIONS

#include <fstream>
#include <iostream>
#include <iterator>
#include <CL/cl.hpp>
#include <CL/opencl.h>

using namespace std;

int main () {

    vector<cl::Platform> platforms;
    vector<cl::Device> devices;
    vector<cl::Kernel> kernels;
    
    try {
    
        // create platform
        cl::Platform::get(&platforms);
        platforms[0].getDevices(CL_DEVICE_TYPE_GPU, &devices);

        // create context
        cl::Context context(devices);

        // create command queue
        cl::CommandQueue queue(context, devices[0]);

        // load opencl source
        ifstream cl_file("opencl_hello_world.cl");
        string cl_string(istreambuf_iterator<char>(cl_file), (istreambuf_iterator<char>()));
        cl::Program::Sources source(1, make_pair(cl_string.c_str(), 
            cl_string.length() + 1));

        // create program
        cl::Program program(context, source);

        // compile opencl source
        program.build(devices);

        // load named kernel from opencl source
        cl::Kernel kernel(program, "hello_world");

        // create a message to send to kernel
        char* message = "Hello World!";
        int messageSize = 12;

        // allocate device buffer to hold message
        cl::Buffer buffer(CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 
            sizeof(char) * messageSize, message);

        // set message as kernel argument
        kernel.setArg(0, buffer);
        kernel.setArg(1, sizeof(int), &messageSize);

        // execute kernel
        queue.enqueueTask(kernel);

        // wait for completion
        queue.finish();

        cout << endl;
        
    } catch (cl::Error e) {
        cout << endl << e.what() << " : " << e.err() << endl;
    }
    
    return 0;
    
}

The above C++ host binding code is annotated to say what it's doing at each step but I'll provide a brief overview. Initially it's creating a platform, a context and a command queue which are basic opencl binding data structures that are required to interface with an opencl device. It then loads the opencl source from a separate file and with it creates a program. The program is built which compiles the opencl source. It then loads a specific kernel (function) from that source by a given name. It creates a string message on the host side but in order to send it to the device it must create a buffer of the same size as the message. The buffer is created and set as a kernel argument along with the size of the message we are sending.

The kernel is then executed and we wait for its completion on the host. The finish command flushes all outstanding tasks to the device and waits for them to finish. Note the clean exception handling using a try/catch wrap around the entire code instead of having to check error codes produced by each statement. I much prefer the C++ api to the C API. I think you'll agree that it's more concise and cleaner. Till next time.

AMD and Oracle to collaborate on Heterogenous Computing in Java

In August John Coomes from Oracle made a proposal to add GPU support to Java. One month later, on Sep 10, he proposed the creation of a new project called Sumatra to continue with this endeavour. On Sep 24 this project was approved by a 100% vote in favour. During the recent JavaOne 2012 AMD officially announced its participation in OpenJDK Project Sumatra in collaboration with Oracle and OpenJDK to bring heterogenous computing to Java for server and cloud environments. The Inquirer also reports on this subject.

This is very exciting news indeed. Although there are already two libraries for GPU programming in Java – namely rootbeer and aparapi, having GPU support built in to the Java language, the Java API and most importantly the JVM will provide an alternative more compelling than the use of any external library. And to be quite frank there could not be a collaborator than AMD given their vast contribution to date to OpenCL and OpenCL development tools. And unlike Nvidia, they are wholly committed to OpenCL and not working on their own proprietary alternative.

Although it’ll be a while before this project sees any substantial contribution I cannot wait to see this take form over the next year or two. OpenCL and, in general, the GPU programming paradigm is hard; very hard; and even more importantly porting existing code is even harder; and if anyone can make this domain accessible to the mainstream it’s Java. Once Sumatra is ready hopefully we won’t have to write OpenCL anymore. We’ll be able to write normal Java, compile it and at either compile time or runtime the byte code will get translated into OpenCL and compiled. At execution time we won’t have to worry about what hardware we’re running because with any luck it’ll be write once run anywhere!

OpenCL Cookbook: Hello World using C host binding

In our OpenCL Cookbook series so far we’ve looked at some preliminary data structures in OpenCL host programming using the C language. This time – we finally arrive at a complete end-to-end example – the customary Hello World!

What this example does is simple. The host program in C passes a character array to the GPU into which the GPU writes the characters of the phrase: “Hello, World!”. The host program then reads the contents of the character array back and prints them on the screen. The output should be “Hello, World!”.

The code is annotated using brief comments. There are some aspects of OpenCL that are new that I have not yet been through in previous articles but don’t worry I’ll go through a full dissection after presenting the complete code.

Note that error handling has been taken out completely to keep the program short for easy viewing. The most important aspect of error handling in the program below is checking the build error, status and log for a failed program build which you can find further details of in my previous post.

Host source

#include
#include
#ifdef __APPLE__
#include <OpenCL/opencl.h>
#else
#include <CL/cl.h>
#endif

int main() {

    cl_platform_id platform; cl_device_id device; cl_context context;
    cl_program program; cl_kernel kernel; cl_command_queue queue;
    cl_mem kernelBuffer;

    FILE* programHandle; char *programBuffer; char *programLog;
    size_t programSize; char hostBuffer[32];

    // get first available sdk and gpu and create context
    clGetPlatformIDs(1, &platform, NULL);
    clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
    context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);

    // get size of kernel source
    programHandle = fopen("helloWorld.cl", "r");
    fseek(programHandle, 0, SEEK_END);
    programSize = ftell(programHandle);
    rewind(programHandle);

    // read kernel source into buffer
    programBuffer = (char*) malloc(programSize + 1);
    programBuffer[programSize] = '\0';
    fread(programBuffer, sizeof(char), programSize, programHandle);
    fclose(programHandle);

    // create and build program
    program = clCreateProgramWithSource(context, 1,
            (const char**) &programBuffer, &programSize, NULL);
    free(programBuffer);
    clBuildProgram(program, 1, &device, "-Werror -cl-std=CL1.1", NULL, NULL);

    // create kernel and command queue
    kernel = clCreateKernel(program, "hello", NULL);
    queue = clCreateCommandQueue(context, device, 0, NULL);

    // create kernel argument buffer and set it into kernel
    kernelBuffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
            32 * sizeof(char), NULL, NULL);
    clSetKernelArg(kernel, 0, sizeof(cl_mem), &kernelBuffer);

    // execute kernel, read back the output and print to screen
    clEnqueueTask(queue, kernel, 0, NULL, NULL);
    clEnqueueReadBuffer(queue, kernelBuffer, CL_TRUE, 0,
            32 * sizeof(char), hostBuffer, 0, NULL, NULL);
    puts(hostBuffer);

    clFlush(queue);
    clFinish(queue);
    clReleaseKernel(kernel);
    clReleaseProgram(program);
    clReleaseMemObject(kernelBuffer);
    clReleaseCommandQueue(queue);
    clReleaseContext(context);
    return 0;

}

The host source runs on the CPU and is written in C in this case though you could also write it in C++, Python or Java whereas the kernel source runs on a device which could be one or more CPUs, GPUs or accelerators. The host source must be written in a host language whereas the kernel source must be written in OpenCL.

Host source by dissection

Here I describe what the host source is doing by dissecting it. A hello world example should ideally be entirely self contained and not rely on other articles to complement the reader’s understanding. With the exception of error handling and particularly how to debug a failed program build which I address elsewhere this example is self contained.

Below I present one snippet of code at a time followed by its dissection.

Creating platforms, devices and contexts

// get first available sdk and gpu and create context
clGetPlatformIDs(1, &platform, NULL);
clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);

Here I first get a platform (an OpenCL SDK/framework). As I know I only have the Apple OpenCL framework installed on my Mac it will always be the one selected. However, if you have multiple SDKs installed such as AMD, Nvidia and Intel then you may want to select one explicitly. Next I ask for a GPU device. Once again, my machine only has one GPU so it will always be the one that’s selected but if you have multiple GPUs installed you may want to choose one in particular. Finally I create a context which is an incredibly important OpenCL data structure as it is required for the creation of numerous other structures such as programs, command queues and kernel buffers.

Loading kernel sources

// get size of kernel source
programHandle = fopen("helloWorld.cl", "r");
fseek(programHandle, 0, SEEK_END);
programSize = ftell(programHandle);
rewind(programHandle);

// read kernel source into buffer
programBuffer = (char*) malloc(programSize + 1);
programBuffer[programSize] = '\0';
fread(programBuffer, sizeof(char), programSize, programHandle);
fclose(programHandle);

As this is a host source file it has the responsibility of involving the kernel source. Generally speaking the kernel source is usually compiled at runtime as part of the execution of the host source. Therefore, the host source file must pull in the kernel source and compile it. Above I first calculate the size of the kernel source file and then read the source in into a buffer of that calculated size.

Creating a program and compiling kernel sources

// create and build program
program = clCreateProgramWithSource(context, 1,
        (const char**) &programBuffer, &programSize, NULL);
free(programBuffer);
clBuildProgram(program, 1, &device, "-Werror -cl-std=CL1.1", NULL, NULL);

Here I construct a program structure by passing in a context and the buffer containing the kernel source. Then I build the program which essentially compiles the kernel source based on supplied build options. Note that a program can contain numerous kernel sources containing multiple OpenCL functions potentially drawn in from a number of files. This program build steps builds the sum total of all kernels sources read in. At this point the build could fail for a variety of reasons and it’s critically important to be able to narrow the cause easily. Here I’ve skipped this error handling but I address this subject in detail on my previous post in the series.

Creating kernels and command queues

// create kernel and command queue
kernel = clCreateKernel(program, "hello", NULL);
queue = clCreateCommandQueue(context, device, 0, NULL);

Here I create a kernel and a command queue structure. Let’s look at what each one means in turn.

A kernel is an OpenCL function that executes on one or more devices. The program structure above may contain numerous functions so the purpose of creating the kernel structure above is to pinpoint one particular one called ‘hello’. I need a reference to this kernel in order to pass it an argument later on in the process.

A command queue is exactly what the name implies. The host program invokes a device by sending it a command. The sending mechanism for that command is a queue. Commands are by default processed in FIFO order but that can be changed by a configuration option. Sending a command, also known as a task, to a command queue is a way of requesting its execution.

Setting kernel arguments

// create kernel argument buffer and set it into kernel
kernelBuffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
        32 * sizeof(char), NULL, NULL);
clSetKernelArg(kernel, 0, sizeof(cl_mem), &kernelBuffer);

Here, I create an OpenCL memory object. There are two types of memory objects – image and buffer. Here I am not dealing with image data so I choose a buffer memory object. Our goal here is to provide the kernel with a character array big enough to hold the phrase ‘Hello, World!’. However I cannot pass a character array into a kernel directly. I must create an OpenCL buffer memory object of a given size and then set it as the first kernel argument and that’s what I’m doing above. You’ll notice that I set the memory object to be write only as the device only needs to write to it.

Executing kernels and reading output data

// execute kernel, read back the output and print to screen
clEnqueueTask(queue, kernel, 0, NULL, NULL);
clEnqueueReadBuffer(queue, kernelBuffer, CL_TRUE, 0,
        32 * sizeof(char), hostBuffer, 0, NULL, NULL);
puts(hostBuffer);

This is the final step. Earlier I created a command queue and a kernel structure for the hello function and passed in a buffer memory object as the first argument. Here I complete the entire process by enqueuing the kernel for execution as a task into the command queue and reading back the output by passing in a character array of the same size as the original kernel buffer memory object. I then print the contents of that array onto the screen to prove that it contains what the GPU originally wrote into it.

Cleaning up

clFlush(queue);
clFinish(queue);
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseMemObject(kernelBuffer);
clReleaseCommandQueue(queue);
clReleaseContext(context);

Above I first ensure that all commands have been issed to the device associated with the command queue by calling clFlush(). Then I block until all commands have been issued and completed by calling clFinish(). The rest of the functions above simple deallocate their own respective named structures.

Kernel source

__kernel void hello(__global char* message){
message[0] = 'H';
message[1] = 'e';
message[2] = 'l';
message[3] = 'l';
message[4] = 'o';
message[5] = ',';
message[6] = ' ';
message[7] = 'W';
message[8] = 'o';
message[9] = 'r';
message[10] = 'l';
message[11] = 'd';
message[12] = '!';
message[13] = '';
}

The kernel source is fairly self explanatory. It simply receives a character array called message and writes its message into it. Kernel functions get infinitely more complex than this one but this one has been kept deliberately simple.

Compile and run as follows keeping both source files in the same directory.

clang -framework OpenCL helloWorld.c -o helloWorld && ./helloWorld

As always if you have any feedback or if this helped you let me know in the comments!

Oracle and AMD propose GPU support in Java

A very exciting development indeed has come to my attention albeit four days late. This news is, without exaggeration, in my humble opinion nothing short of groundbreaking, not only because it pushes the boundaries and capabilities of Java even further in terms of hardware support and performance but also because, as I’m gradually beginning to realise, the future of high performance computing is in the GPU.

John Coomes (OpenJDK HotSpot Group Lead) and Gary Frost (AMD) have proposed a new OpenJDK project to implement GPU support in Java with a native JVM (full thread).

This project intends to enable Java applications to seamlessly take advantage of a GPU–whether it is a discrete device or integrated with a CPU–with the objective to improve the application performance.

Their focus will be on code generation, garbage collection and runtimes.

We propose to use the Hotspot JVM, and will concentrate on code generation, garbage collection, and
runtimes. Performance will be improved, while preserving compile time, memory consumption and code generation quality.

The project will also use the new Java 8 lambda language and may eventually sprout further platform enhancements.

We will start exploring leveraging the new Java 8 Lambda language and library features. As this project progress, we may identify challenges with the Java API and constructs which may lead to new language, JVM and library extensions that will need standardization under the JCP process.

John Coomes (Oracle) will lead the project and Gary Frost (AMD) has gladly offered resources from AMD. As the mail thread got underway two already existing related projects were brought to the list’s attention: rootbeer (PDF, slashdot) and Aparapi (AMD page). Rootbeer is a particularly interesting project as it performs static analysis of Java code and generates CUDA code automatically – quite different from compile time OpenCL bindings from Java. The developer of rootbeer has also shown interest in joining the openjdk project. John Rose, Oracle HotSpot developer, also posted a talk he recently gave on Arrays 2.0 which I’m yet to watch.

The missing link in deriving value from GPUs, from what I understand in a domain that I’m still fairly new to, is that GPU hardware and programming need to be made accessible to the mainstream and that’s the purpose I hope this project will serve. I hope also that, once underway, it also raises wider awareness of how we must make a mental shift from concurrency to parallelism and from data parallelism to task parallelism and how the next generational leap in parallelism will come, not from cpu cores or concurrency frameworks, but from harnessing the mighty GPU.

Hacker news has some discussion. What do you think? Let me know.

OpenCL Cookbook: Creating contexts and reference counting

Following on from my previous articles on platforms and devices in the OpenCL Cookbook series, in this instalment, I move onto the next most critical host programming data structure in OpenCL – the context.

Contexts

A context in OpenCL requires a platform and one or more devices to function and is used to create command queues which are the structures that allow hosts to send kernels to devices. That’s a loaded sentence so let’s break it down. The program such as the one written below in C (the host) may want the CPU or the GPU (devices) to execute a calculation (a kernel i.e. a function). In order for that to happen a command queue for that device must be created and the calculation enqueued onto it. That, in essence, is how a task is relayed to a device and execution is triggered in OpenCL.

Context creation

In creating a context – the platform and devices do not necessarily have to be created and supplied to the context creating method. For example a context can be created simply by choosing a device type as below.

context = clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);

In the example above, the platform that is selected is implementation defined but if you only have one platform like me then that’s automatically selected. The device selected will be the first available one of the type you specify. So, on my Macbook Air, the Apple SDK and my only GPU – the NVidia card are automatically selected.

Alternatively, as in the snippet below you can create and provide the platform and device explicitly.

context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);

Controlling context lifetime using its reference count

In the complete program below I also introduce another important concept related to contexts – its reference count. A context when created starts with a reference count of 1 and when the function you are creating it in goes out of scope it is deallocated automatically. But this may be undesirable – maybe you want to continue accessing the context after the current function has gone out of scope. For this reason a context’s reference count can be incremented and decremented and it is only deallocated when its reference count reaches zero.

The general guideline is that if you are writing a function that uses an already created context you should increment the reference count at the start and decrement it at the end of your function. If, however, you are creating a context then at the end of your function you must simply decrement the reference count as I’m doing below.

#include <stdio.h>
#include <stdlib.h>
#ifdef __APPLE__
#include <OpenCL/opencl.h>
#else
#include <CL/cl.h>
#endif

int main() {

    cl_platform_id platform;
    cl_device_id device;
    cl_context context;
    cl_uint refCount;

    // get first available platform
    clGetPlatformIDs(1, &platform, NULL);

    // get first available gpu device
    clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);

    // create context
    context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);

    // get context reference count
    clGetContextInfo(context, CL_CONTEXT_REFERENCE_COUNT,
            sizeof(refCount), &refCount, NULL);
    printf("Ref count: %u ", refCount);

    // increment reference count
    clRetainContext(context);
    clGetContextInfo(context, CL_CONTEXT_REFERENCE_COUNT,
            sizeof(refCount), &refCount, NULL);
    printf(">> %u ", refCount);

    // decrement reference count
    clReleaseContext(context);
    clGetContextInfo(context, CL_CONTEXT_REFERENCE_COUNT,
            sizeof(refCount), &refCount, NULL);
    printf(">> %u ", refCount);

    // finally release context
    clReleaseContext(context);
    printf(">> 0n");
    return 0;

}

Compile and run on the Mac as follows. If you don’t have the clang, g++ or gcc commands install them.

$ clang -framework OpenCL contexts.c -o contexts && ./contexts

The output produced on my machine is as below.

Ref count: 1 >> 2 >> 1 >> 0

As always error handling has been omitted for brevity and the code is only tested on my Macbook Air but should work on other platforms. If you have any issues or have suggestions for improvements to the code do let me know.

Did this help you? Let me know in the comments!