Tag Archives: cookbook

OpenCL Cookbook: 10 tips for high performance kernels

Today we adorn our performance hat in the OpenCL Cookbook series. OpenCL kernel optimisation is a black art and very very hard. Recently when I came across a list of 10 tips for high performance kernels in Matthew Scarpino’s excellent book: OpenCL in Action I just had to share it as it was a true gem. Bear in mind that this is an abridged account – if you want more detail read the appropriate section in the book.

  1. Unroll loops: Comparison operations are expensive – if you know how many iterations you need simply perform them individually.
  2. Disable processing of denormalised numbers: The processing of denormalised numbers takes time. Disable them using the -cl-denorms-are-zero option. You can also disable processing of infinite values and NaNs using -cl-finite-math-only.
  3. Transfer constant primitive values to the kernel with compiler defines instead of private memory parameters. Example: clBuildProgram(program, 0, NULL, "-DSIZE=128", NULL, NULL);.
  4. Store small variable values in private memory instead of local memory: Private memory is faster and should be used unless you need to share data between work items in which case use local memory.
  5. Avoid local memory bank conflicts by accessing local memory sequentially: Successive 32-bit elements are stored in successive local memory banks. Sequential bank access is parallel whereas contending on same bank access is serial.
  6. Avoid using modulo operator: This operator is slow and alternatives should be used instead.
  7. Reuse private variables throughout the kernel but use macros to distinguish each separate use.
  8. For multiply-and-add operations use the fma function if available.
  9. Inline non-kernel functions in a program file: This uses more memory but removes the context switches and stack operations associated with regular function calls.
  10. Avoid branch miss penalties by coding conditional statements to be true more often than false: Processors often optimise for the true case and this can result in a penalty for the false case known as the branch miss penalty. Code your conditionals to evaluate to true as often as possible.

It’s easy to list rules for optimising kernels but the truth is that optimising kernels is hard; very hard. The best way to approach it is, in my opinion, to profile your application and kernels as well as to experiment with various changes. There will certainly be optimisations that you’ll apply that will turn out to be ineffectual or even slower so trial and error is key.

OpenCL Cookbook: Multi-device utilisation strategies

A quick note on all the different ways in which one can utilise multiple devices in OpenCL. This overlaps with my previous post on the same subject in the series but is a higher level overview than that previous post.

Here are the different ways one can load balance across multiple devices.

  • Use CPU and GPU devices.
  • Use multiple GPU devices.

For each of the above one can utilise them in the following ways.

  • Single context – Use one context in total across all devices and one command queue per device.
  • Multiple contexts single-host-thread – Use one context and one command queue per device.
  • Multiple contexts multi-host-threaded – Same as multiple contexts above but use one host thread per device to enqueue to the command queue for that device. This can also be in the form of starting one host process instance per device which in essence is the same.

For an elaboration on the latter strategies see my previous post on the subject.

OpenCL Cookbook: Running remote multi-gpu OpenCL computations over SSH on Ubuntu or Debian using AMD Catalyst drivers

Continuing on in the OpenCL cookbook series here I present a post not about code but about environmental setup further diversifying the scope of the cookbook. Here is a no-nonsense guide to running OpenCL programs via ssh remote login into linux. This is intended as a follow up to my previous guide to compiling OpenCL on linux so have a look at that if you haven’t done so already but this account should be fairly self contained nevertheless.

Important: Note that this is not a guide to setting OpenCL on linux in a headless fashion. It requires GUI linux, it requires X and it requires that X be running on screen. Although there are reports of people getting headless setups to work I haven’t yet had time to try these things out so I cannot comment on them.

Install Linux

I went for Ubuntu and Debian for their easy and well known package management. I installed them in adjacent partitions so that I could test my setup on both and compare them easily. Eventually I settled on Debian due to strongly disliking unity and the presence of several known issues with raid on boot in Ubuntu. Regarding desktop environments/window managers I installed an assortment of them because I wanted fully featured ones for development and lightweight ones for benchmarking. I also wanted to test OpenCL in a variety of environments. I installed GNome, KDE, Blackbox, Fluxbox, XFCE and E17 on each distro. It’s good to have options. 🙂 Note that you should NOT install any drivers automatically through the ubuntu/debian package management tools!

Install AMD Catalyst 12.11 beta drivers and AMD APP SDK 2.7

Installing the above is as simple as downloading the linux tarballs, extracting them and running the executable script within each as root. I don’t bother building distro specific packages. Just install the damn things. They work. If you want a bit more elaboration on installing and setting these up look at my previous guide. The reason I’m suggesting the 12.11 beta drivers is because the 12.10 stable driver doesn’t work with X on ubuntu or debian whereas 12.11 beta does. Plus I haven’t noticed any instability with 12.11 beta at all. It seems to work just fine.

Use amdconfig to generate a new Xorg configuration using all adapters

Amdconfig (also aliased as aticonfig) is a powerful command line tool provided by the AMD catalyst driver that you can use for an assortment of tasks such as listing your adapters, querying clocks and temperatures of your gpus, setting overclocks and most importantly for our purposes generating a variety of Xorg configurations with different settings. Here we need to generate an X configuration that will enable all our adapters in Xorg.

sudo amdconfig --initial -f --adapter=all

This will back up your existing xorg.conf in /etc/X11/xorg.conf and create a new one in its place containing device and screen sections for all your gpus. Though if you’ve just installed ubuntu or debian you won’t have an xorg config to begin with anyway.

Reboot to start using the new AMD fglrx driver and your new X configuration

A reboot is required to unload the old open source driver and load fresh the new proprietary amd driver fglrx which cannot be done cleanly by simply using modprobe -r to remove old drivers and modprobe to load new drivers into the kernel and then restarting X. A reboot is recommended and essential. Once you’ve rebooted you’ll also start to use your new X configuration with all adapters enabled. When you see a login manager screen (gdm/xdm/kdm/slim) log into it.

Test your new local OpenCL environment

Start up a terminal. First check that amdconfig sees all your adapters. If you have multiple gpus installed you should see all of them with one of them (typically the first) selected as default.

dhruba@debian:~$ amdconfig --list-adapters
* 0. 05:00.0 AMD Radeon HD 7900 Series  
  1. 06:00.0 AMD Radeon HD 7900 Series  
  2. 09:00.0 AMD Radeon HD 7900 Series  
  3. 0a:00.0 AMD Radeon HD 7900 Series  
  4. 85:00.0 AMD Radeon HD 7900 Series  
  5. 86:00.0 AMD Radeon HD 7900 Series  

* - Default adapter

Next, check that all your devices are being picked up for computation by the OpenCL runtime.As you can see below on my workstation it shows 7 devices – 6 of them being gpus and 1 being the cpu. I’m using 3 7990s each of which are a dual 7970.

dhruba@debian:~$ clinfo | grep 'Number of devices'
Number of devices:				 7

If you see all your devices in the count above your environment should be ready to compile and run OpenCL programs. The next step is get it ready for remote ssh multi-gpu OpenCL computations.

Enable multiple GPUs in the OpenCL runtime

At this point you may find that although amdconfig lists all your adapters the OpenCL runtime only sees one gpu or sees fewer gpus than you have installed. You may also find that some window managers/desktop environments see all devices and others see only one. To enable multiple gpus you may need to do two things.

First you need to disable crossfire. You can do this either from within the catalyst control panel which you can start by running sudo amdcccle or you can use the powerful amdconfig tool to do it by running sudo amdconfig --crossfire=off. See my post about amdconfig for more detail on how to use this amazing tool.

Secondly you may need to pass a hint to the OpenCL runtime to tell it to use the current display for computation as follows.

export COMPUTE=:0

As an alternative to this you can also do the following.

unset DISPLAY

However I’d recommend the first as it is an override flag that does not require any change to the existing DISPLAY variable. Implementing the above two tweaks should enable multi-gpu support in your OpenCL programs. Check either by using clinfo | grep 'Number of devices' or by using a C program as in my previous article.

Install SSH and set up your server for remote computations

Install SSH on ubuntu/debian by running sudo apt-get install openssh-server. This should install the ssh server and start it automatically. Login to your OpenCL server from a remote client and run clinfo again to check number of devices. If this is showing all your gpus then you’re done. However chances are that it won’t yet show all your devices. It will show only your cpu and no gpus at all. The reason for this is that the remote client has not yet been granted access to the X runtime. You can grant access as follows.

dhruba@debian:~$ xhost +
access control disabled, clients can connect from any host

This disables access control altogether. However if you are concerned about security you can be more restrictive and enable it only for specific usernames, hostnames, ip addresses, networks or protocols. See man xhost for details. Basically you append the hostname, username or ip to the plus sign. Once you’ve done the above run clinfo again from your ssh session and now you should see all your gpus and your system is ready to remote OpenCL computations.

Automating setup of remote computation environment

In a true datacentre environment you don’t really want to physically login to X and set environment variables particularly because your server may get rebooted. What you really want is for it to be ready to run OpenCL via remote sessions on a fresh boot without any manual intervention. If you wish to do this there is a guide available on the amd forum for ubuntu specifically which I suppose could be adapted to Debian. I’ve tried this on ubuntu and it does work though it’s a bit hackish. I’ll leave this guide at your discretion.

AMD OpenCL Limitations

Note that this is not a headless setup. In this guide we installed a GUI linux, ran X, logged into X and kept X displayed on screen. All of these are essential. If you either do not run X or run X but switch away from it to one of the other virtual terminals you will not be able to run OpenCL either from the virtual terminals or remotely. If you try to run OpenCL without X running and showing on screen the execution will just hang indefinitely until you start X and display it on screen at which point that hanging computation will continue and complete.

This seems to be an inherent limitation of the AMD driver. AMD say they’re working on separating X from the OpenCL runtime but who knows when this will actually be done. There are quite a few reports of people succeeding at running truly headless OpenCL setups – without any peripherals attached – but I have not had time to research these methods yet so I cannot comment on them.

OpenCL Cookbook: Managing your GPUs using amdconfig/aticonfig – a powerful utility in the AMD OpenCL toolset on Linux

When you install AMD Catalyst drivers (I’m using 12.11 beta) on linux you gain access to a command line utility called amdconfig. This is better known by its legacy name aticonfig but in this article we’ll stick with the new name. This tool provides a wealth of functionality in querying and configuring your AMD gpu making it a very powerful utility in your OpenCL toolset on Linux.

Here we explore some basic yet powerful uses of this tool to query and manage the state of our gpus. In the following examples I use the long form of the arguments to the command so that it is easier to remember and understand for those new to the command. Note that in general write commands due to the inherent risks involved can only be run as root. Read only commands can be run by normal users.

List all gpu adapters

Note – just because you see multiple adapters here does not mean they will be enabled in the OpenCL runtime. For that you have to generate a new X configuration with all adapters (see command for further down).

dhruba@debian:~$ amdconfig --list-adapters
* 0. 05:00.0 AMD Radeon HD 7900 Series  
  1. 06:00.0 AMD Radeon HD 7900 Series  
  2. 09:00.0 AMD Radeon HD 7900 Series  
  3. 0a:00.0 AMD Radeon HD 7900 Series  
  4. 85:00.0 AMD Radeon HD 7900 Series  
  5. 86:00.0 AMD Radeon HD 7900 Series  

* - Default adapter

Generate a fresh X config with all adapters enabled

Generating a new config in this way will backup your old config.

dhruba@debian:~$ sudo amdconfig --initial --force --adapter=all
Uninitialised file found, configuring.
Using xorg.conf
Saving back-up to xorg.conf.fglrx-0

Or you can specify your old and new config files explicitly.

dhruba@debian:~$ sudo amdconfig --initial --force --adapter=all --input=foo --output=bar
Uninitialised file found, configuring.
Using bar
Saving back-up to bar.original-0

Query current clocks, clock ranges and load for all adapters

Note here that by default, in idle mode, on linux adapters are clocked at 300/150 (core/memory) but under load the clocks automatically increase to 925/1375 (core/memory) which is nice.

dhruba@debian:~$ amdconfig --od-getclocks --adapter=all

Adapter 0 - AMD Radeon HD 7900 Series  
                            Core (MHz)    Memory (MHz)
           Current Clocks :    925           1375
             Current Peak :    925           1375
  Configurable Peak Range : [300-1125]     [150-1575]
                 GPU load :    99%

Adapter 1 - AMD Radeon HD 7900 Series  
                            Core (MHz)    Memory (MHz)
           Current Clocks :    925           1375
             Current Peak :    925           1375
  Configurable Peak Range : [300-1125]     [150-1575]
                 GPU load :    98%

Adapter 2 - AMD Radeon HD 7900 Series  
                            Core (MHz)    Memory (MHz)
           Current Clocks :    925           1375
             Current Peak :    925           1375
  Configurable Peak Range : [300-1125]     [150-1575]
                 GPU load :    98%

Adapter 3 - AMD Radeon HD 7900 Series  
                            Core (MHz)    Memory (MHz)
           Current Clocks :    925           1375
             Current Peak :    925           1375
  Configurable Peak Range : [300-1125]     [150-1575]
                 GPU load :    98%

Adapter 4 - AMD Radeon HD 7900 Series  
                            Core (MHz)    Memory (MHz)
           Current Clocks :    925           1375
             Current Peak :    925           1375
  Configurable Peak Range : [300-1125]     [150-1575]
                 GPU load :    98%

Adapter 5 - AMD Radeon HD 7900 Series  
                            Core (MHz)    Memory (MHz)
           Current Clocks :    925           1375
             Current Peak :    925           1375
  Configurable Peak Range : [300-1125]     [150-1575]
                 GPU load :    98%

Query temperatues for all adapters

This is handy to keep an eye on your gpus under load to check they are not overheating. The following temperatures were taken under load and you can see that adapter 3 has reached 70C despite all cards being aggressively water cooled.

dhruba@debian:~$ amdconfig --odgt --adapter=all

Adapter 0 - AMD Radeon HD 7900 Series  
            Sensor 0: Temperature - 65.00 C

Adapter 1 - AMD Radeon HD 7900 Series  
            Sensor 0: Temperature - 50.00 C

Adapter 2 - AMD Radeon HD 7900 Series  
            Sensor 0: Temperature - 60.00 C

Adapter 3 - AMD Radeon HD 7900 Series  
            Sensor 0: Temperature - 70.00 C

Adapter 4 - AMD Radeon HD 7900 Series  
            Sensor 0: Temperature - 58.00 C

Adapter 5 - AMD Radeon HD 7900 Series  
            Sensor 0: Temperature - 54.00 C

List crossfire candidates and crossfire status

For OpenCL it is essential that you have crossfire disabled. You can disable it either using amdconfig --crossfire=off or through catalyst control centre which you start by running amdcccle.

dhruba@debian:~$ amdconfig --list-crossfire-candidates

Master adapter:  0. 05:00.0 AMD Radeon HD 7900 Series  
    Candidates:  none
Master adapter:  1. 06:00.0 AMD Radeon HD 7900 Series  
    Candidates:  none
Master adapter:  2. 09:00.0 AMD Radeon HD 7900 Series  
    Candidates:  none
Master adapter:  3. 0a:00.0 AMD Radeon HD 7900 Series  
    Candidates:  none
Master adapter:  4. 85:00.0 AMD Radeon HD 7900 Series  
    Candidates:  none
Master adapter:  5. 86:00.0 AMD Radeon HD 7900 Series  
    Candidates:  none
dhruba@debian:~$ amdconfig --list-crossfire-status
    Candidate Combination: 
    Master: 0:0:0 
    Slave: 0:0:0 
    CrossFire is disabled on current device
    CrossFire Diagnostics:
    CrossFire can work with P2P mapping through GART
    Candidate Combination: 
    Master: 0:0:0 
    Slave: 0:0:0 
    CrossFire is disabled on current device
    CrossFire Diagnostics:
    CrossFire can work with P2P mapping through GART
    Candidate Combination: 
    Master: 0:0:0 
    Slave: 0:0:0 
    CrossFire is disabled on current device
    CrossFire Diagnostics:
    CrossFire can work with P2P mapping through GART
    Candidate Combination: 
    Master: 0:0:0 
    Slave: 0:0:0 
    CrossFire is disabled on current device
    CrossFire Diagnostics:
    CrossFire can work with P2P mapping through GART
    Candidate Combination: 
    Master: 0:0:0 
    Slave: 0:0:0 
    CrossFire is disabled on current device
    CrossFire Diagnostics:
    CrossFire can work with P2P mapping through GART
    Candidate Combination: 
    Master: 0:0:0 
    Slave: 0:0:0 
    CrossFire is disabled on current device
    CrossFire Diagnostics:
    CrossFire can work with P2P mapping through GART

You can also use amdconfig to set core and memory clocks but this will be covered in a separate article. I do not want to run these commands on my system as I’m happy with current clocks. But here’s a snippet from the man page which is fairly self explanatory. Bear in mind that to tweak clocks you need to enable overdrive using –od-enable.

  --od-enable
        Unlocks the ability to change core or memory clock values by
        acknowledging that you have read and understood the AMD Overdrive (TM)
        disclaimer and accept responsibility for and recognize the potential
        dangers posed to your hardware by changing the default core or memory
        clocks
  --od-disable
        Disables AMD Overdrive(TM) set related aticonfig options.  Previously
        commited core and memory clock values will remain, but will not be set
        on X Server restart.
  --odsc, --od-setclocks={NewCoreClock|0,NewMemoryClock|0}
        Sets the core and memory clock to the values specified in MHz
        The new clock values must be within the theoretical ranges provided
        by --od-getclocks.  If a 0 is passed as either the NewCoreClock or
        NewMemoryClock it will retain the previous value and not be changed.
        There is no guarantee that the attempted clock values will succeed
        even if they lay inside the theoretical range.  These newly set
        clock values will revert to the default values if they are not
        committed using the "--od-commitclocks" command before X is
        restarted
  --odrd, --od-restoredefaultclocks
        Sets the core and memory clock to the default values.
        Warning X needs to be restarted before these clock changes will take
        effect
  --odcc, --od-commitclocks
        Once the stability of a new set of custom clocks has been proven this
        command will ensure that the Adapter will attempt to run at these new
        values whenever X is restarted

OpenCL Cookbook: Compiling OpenCL with Ubuntu 12.10, Unity, AMD 12.11 beta drivers & AMD APP SDK 2.7

Continuing on in the OpenCL cookbook series here I present a post not about code but about environmental setup further diversifying the scope of the cookbook. It can be a real challenge for the uninitiated to install all the above and compile an opencl c or c++ program on linux. Here’s a short guide. First download and install ubuntu (duh!).

Install ubuntu build tools and linux kernel extras

Then install the following packages which are a prerequisite to the amd installers and the subsequent c/c++ compilation.

sudo apt-get update
sudo apt-get install build-essential
sudo apt-get install linux-source
sudo apt-get install linux-headers-generic

Then download AMD 12.11 beta drivers (amd-driver-installer-catalyst-12.11-beta-x86.x86_64.zip) and AMD APP SDK 2.7 (AMD-APP-SDK-v2.7-lnx64.tgz). Obviously download either 32bit or 64bit based on what your system supports.

AMD 12.11 beta drivers installation

Once you’ve done that install the AMD 12.11 beta drivers as root first. Installation is as simple as extracting the tarball, marking the script inside as executable and running the script as root. Reboot. After the reboot unity should start using the new AMD 12.11 beta driver and you’ll know it’s the beta because you’ll see a watermark at the bottom left of the screen saying ‘AMD Testing use only’. Note that the reason we’re using the beta here is because unity does not work with earlier versions of the driver. You get a problem where you see the desktop background and a mouse pointer but there’s no toolbar or status bar. But the 12.11 beta driver works which is great.

AMD APP SDK 2.7 installation

Then install the AMD APP SDK 2.7 also as root. Again installation is very simple and exactly the same as for the beta driver above. The AMD beta drivers install a video driver and the OpenCL runtime. The AMD APP SDK install the SDK and also OpenCL and OpenGL runtimes. However if you’ve already installed the video driver first you’ll already have the OpenCL runtime on your system in /usr/lib/libamdocl64.so so the APP SDK won’t install another copy in its location of /opt/AMDAPP/lib/x86_64/libOpenCL.so. You’ll see some messages during installation that it’s skipping the opencl runtime and that’s absolutely fine for now.

Test your OpenCL environment

Now you should test your OpenCL environment by compiling and running an example c opencl program. Get my C file to list all devices on your system as an example calling it devices.c and compile as follows.

gcc -L/usr/lib -I/opt/AMDAPP/include devices.c -lamdocl64 -o devices.o # for c
g++ -L/usr/lib -I/opt/AMDAPP/include devices.c -lamdocl64 -o devices.o # for c++

Once compiled run the output file (devices.o) and if it works then you should output similar to that below.

1. Device: Tahiti
 1.1 Hardware version: OpenCL 1.2 AMD-APP (923.1)
 1.2 Software version: CAL 1.4.1741 (VM)
 1.3 OpenCL C version: OpenCL C 1.2 
 1.4 Parallel compute units: 32
2. Device: Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz
 2.1 Hardware version: OpenCL 1.2 AMD-APP (923.1)
 2.2 Software version: 2.0 (sse2,avx)
 2.3 OpenCL C version: OpenCL C 1.2 
 2.4 Parallel compute units: 32

Enabling multiple gpus for OpenCL

You may find that you are only seeing one gpu in your opencl programs. There are two things you need to do to enable multiple gpus in the OpenCL runtime. The first is to disable all crossfire. You can do this either in the amd catalyst control centre > performance which you start by running amdcccle or you can do it using the awesome amdconfig tool by running amdconfig --crossfire=off. See my post on amdconfig to find out more about this incredibly powerful tool.

The second thing you may or may not need to do is to enable COMPUTE mode as follows.

export COMPUTE=:0

Once you’ve done the above you should see program output from the program above similar to below.

dhruba@debian:~$ ./source/devices.o 
1. Device: Tahiti
 1.1 Hardware version: OpenCL 1.2 AMD-APP (1084.2)
 1.2 Software version: 1084.2 (VM)
 1.3 OpenCL C version: OpenCL C 1.2 
 1.4 Parallel compute units: 32
2. Device: Tahiti
 2.1 Hardware version: OpenCL 1.2 AMD-APP (1084.2)
 2.2 Software version: 1084.2 (VM)
 2.3 OpenCL C version: OpenCL C 1.2 
 2.4 Parallel compute units: 32
3. Device: Tahiti
 3.1 Hardware version: OpenCL 1.2 AMD-APP (1084.2)
 3.2 Software version: 1084.2 (VM)
 3.3 OpenCL C version: OpenCL C 1.2 
 3.4 Parallel compute units: 32
4. Device: Tahiti
 4.1 Hardware version: OpenCL 1.2 AMD-APP (1084.2)
 4.2 Software version: 1084.2 (VM)
 4.3 OpenCL C version: OpenCL C 1.2 
 4.4 Parallel compute units: 32
5. Device: Tahiti
 5.1 Hardware version: OpenCL 1.2 AMD-APP (1084.2)
 5.2 Software version: 1084.2 (VM)
 5.3 OpenCL C version: OpenCL C 1.2 
 5.4 Parallel compute units: 32
6. Device: Tahiti
 6.1 Hardware version: OpenCL 1.2 AMD-APP (1084.2)
 6.2 Software version: 1084.2 (VM)
 6.3 OpenCL C version: OpenCL C 1.2 
 6.4 Parallel compute units: 32
7. Device: Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz
 7.1 Hardware version: OpenCL 1.2 AMD-APP (1084.2)
 7.2 Software version: 1084.2 (sse2,avx)
 7.3 OpenCL C version: OpenCL C 1.2 
 7.4 Parallel compute units: 32

Standardising the OpenCL runtime library path

Now – it may be that you wish for the OpenCL runtime library to be installed in the standard AMD APP SDK location of /opt/AMDAPP/lib/x86_64/libOpenCL.so as opposed to the non-standard location of /usr/lib/libamdocl64.so which is where the beta driver installation puts it. The proper way to do this would probably be to install the AMD APP SDK first and then the video driver or simply skip the video driver installation (I haven’t tried either of these options so they may need verification).

However, I used a little trick to make this easier since I’d already installed the video driver followed by the APP SDK. I renamed /usr/lib/libamdocl64.so to /usr/lib/libamdocl64.so.x and reinstalled the APP SDK. This time it detected that the runtime wasn’t present and installed another runtime in /opt/AMDAPP/lib/x86_64/libOpenCL.so – the standard SDK runtime path. With the new APP SDK OpenCL runtime in place I was able to compile the same program using the new runtime as below depending on whether you want the c or c++ compiler.

gcc -L/opt/AMDAPP/lib/x86_64/ -I/opt/AMDAPP/include devices.c -lOpenCL -o devices.o # for c
g++ -L/opt/AMDAPP/lib/x86_64/ -I/opt/AMDAPP/include devices.c -lOpenCL -o devices.o # for c++

Summary

And there you have it – an opencl compiler working on ubuntu 12.10 using the AMD 12.11 beta drivers and the AMD APP 2.7 SDK. Sometimes you just need someone else to have done it first and written a guide and I hope this serves to help someone out there.

OpenCL Cookbook: Hello World using C# Cloo host binding

So far I’ve used the C and C++ bindings in the OpenCL Cookbook series. This time I provide a quick and simple example of how to use Cloo – the C# OpenCL host binding. However, since Cloo, for whatever reason, didn’t work as expected with a char array I will use an integer array instead. In other words – instead of sending a “Hello World!” message to the kernel I will send five integers instead. My guess is that there is some sort of bug with Cloo and char arrays.

Device code using Cloo’s variant of the OpenCL language

kernel void helloWorld(global read_only int* message, int messageSize) {
	for (int i = 0; i < messageSize; i++) {
		printf("%d", message[i]);
	}
}

The kernel above is merely illustrative in that it simply receives an integer array and its size and prints the array.

Note that the OpenCL syntax here is not the same as in C/C++. It has additional keywords to say whether the arguments are read only or write or read write and the kernel keyword is not prefixed with two underscores. The Cloo author must have decided that the original OpenCL syntax was for whatever reason unsuitable for adoption which IMO was a mistake. The OpenCL language syntax should be standard for portability, reusability and also so that there is only a single learning curve.

Host code using Cloo API

using System;
using System.Collections.Concurrent;
using System.Threading.Tasks;
using System.IO;
using Cloo;

namespace test
{
    class Program
    {
        static void Main(string[] args)
        {
            // pick first platform
            ComputePlatform platform = ComputePlatform.Platforms[0];

            // create context with all gpu devices
            ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu,
                new ComputeContextPropertyList(platform), null, IntPtr.Zero);

            // create a command queue with first gpu found
            ComputeCommandQueue queue = new ComputeCommandQueue(context,
                context.Devices[0], ComputeCommandQueueFlags.None);

            // load opencl source
            StreamReader streamReader = new StreamReader("..\..\kernels.cl");
            string clSource = streamReader.ReadToEnd();
            streamReader.Close();

            // create program with opencl source
            ComputeProgram program = new ComputeProgram(context, clSource);

            // compile opencl source
            program.Build(null, null, null, IntPtr.Zero);

            // load chosen kernel from program
            ComputeKernel kernel = program.CreateKernel("helloWorld");

            // create a ten integer array and its length
            int[] message = new int[] { 1, 2, 3, 4, 5 };
            int messageSize = message.Length;

            // allocate a memory buffer with the message (the int array)
            ComputeBuffer<int> messageBuffer = new ComputeBuffer<int>(context,
                ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer, message);

            kernel.SetMemoryArgument(0, messageBuffer); // set the integer array
            kernel.SetValueArgument(1, messageSize); // set the array size

            // execute kernel
            queue.ExecuteTask(kernel, null);

            // wait for completion
            queue.Finish();
        }
    }
}

The C# program above uses the Cloo object oriented api to interface with the underlying low level opencl implementation. It’s pretty self explanatory if you’ve been following the series so far. The output of the program is 12345.

OpenCL Cookbook: Parallelise your host loops using OpenCL

Continuing on in our series – this time we look at possibly the most important topic of all in OpenCL. It is the reason why we use OpenCL and it is also the most compelling benefit that OpenCL offers. It is, of course, parallelism. But how do we exploit the vast amount of parallelism that GPUs offer? At the simplest level we can do so by exploiting latent areas of parallelism in our host code the simplest of which are loops. In other words – if we can port loops in our host code to the GPU they become parallel and get faster by a factor of the total number of iterations. I demonstrate using a small example.

Host loop

void cpu_3d_loop (int x, int y, int z) {

    for (int i = 0; i < x; i++) {
        for (int j = 0; j < y; j++) {
            for (int k = 0; k < z; k++) {
                printf("CPU %d,%d,%dn", i, j, k);
            }
        }
    }

}

Imagine the loop above in our C++ host code. This is not one loop but in fact three. In other words it has three dimensions. The total number of iterations in this combined loop is x*y*z. If x=4, y=3 and z=2 the total number of iterations would be 4x3x2=24. On the CPU these loops execute serially which is fine for a small number of iterations but for large numbers it becomes a fundamental bottleneck. If this set of loops was ported to the GPU each iteration would run in parallel and the total number of threads in use would be 24 for the previous example.

A small scale example may not seem impressive at first. You could argue that you could just as well run 24 threads on the CPU. But consider this: what happens when you have the above set of loops in your host code performing thousands or even millions of iterations? How are you going to achieve hardware parallelism in this case on the CPU? The answer is you can’t. GPUs each have hundreds of cores and offer a far greater degree of parallelism so loops with a large number of iterations becomes easy work for the GPU which can run thousands or even millions of threads effectively. Below I demonstrate how to port such a loop to OpenCL.

Host binding code

#define __NO_STD_VECTOR
#define __CL_ENABLE_EXCEPTIONS

#include <fstream>
#include <iostream>
#include <iterator>
#include <CL/cl.hpp>
#include <CL/opencl.h>

using namespace cl;

void cpu_3d_loop (int x, int y, int z) {

    for (int i = 0; i < x; i++) {
        for (int j = 0; j < y; j++) {
            for (int k = 0; k < z; k++) {
                printf("CPU %d,%d,%dn", i, j, k);
            }
        }
    }

}

int main () {

    // CPU 3d loop

    int x = 4;
    int y = 3;
    int z = 2;
    cpu_3d_loop(x, y, z);
    std::cout << std::endl;

    // GPU 3d loop

    vector<Platform> platforms;
    vector<Device> devices;
    vector<Kernel> kernels;
    
    try {
    
        // create platform, context and command queue
        Platform::get(&platforms);
        platforms[0].getDevices(CL_DEVICE_TYPE_GPU, &devices);
        Context context(devices);
        CommandQueue queue(context, devices[0]);

        // load opencl source
        std::ifstream cl_file("kernels.cl");
        std::string cl_string(std::istreambuf_iterator<char>(cl_file),
            (std::istreambuf_iterator<char>()));
        Program::Sources source(1, std::make_pair(cl_string.c_str(), 
            cl_string.length() + 1));

        // create program and kernel and set kernel arguments
        Program program(context, source);
        program.build(devices);
        Kernel kernel(program, "ndrange_parallelism");

        // execute kernel and wait for completion
        NDRange global_work_size(x, y, z);
        queue.enqueueNDRangeKernel(kernel, NullRange, global_work_size, NullRange);
        queue.finish();

    } catch (Error e) {
        std::cout << std::endl << e.what() << " : " << e.err() << std::endl;
    }

    return 0;
    
}

The above program runs the cpu loop and then runs the equivalent logic on the gpu. Both cpu and gpu runs produce output to show which iteration they are processing. The key lines of code that demonstrate how to port the loop are below.

NDRange global_work_size(x, y, z);
queue.enqueueNDRangeKernel(kernel, NullRange, global_work_size, NullRange);

Here we set three upper bounds – one for each loop – this is known as the global work size. The kernel can then retrieve values for the currently executing iteration within the kernel itself as shown below. It can then use these indices to do whatever work is inside the loop. In this case we just print the indices for illustration.

Kernel code

The kernel you see below is executed x*y*z times with different values for i, j and k. See? No loops! 🙂

__kernel void ndrange_parallelism () {

	int i = get_global_id(0);
	int j = get_global_id(1);
	int k = get_global_id(2);

	printf("GPU %d,%d,%dn", i, j, k);
	
}

The output of running the above host code is as follows.

CPU 0,0,0
CPU 0,0,1
CPU 0,1,0
CPU 0,1,1
CPU 0,2,0
CPU 0,2,1
CPU 1,0,0
CPU 1,0,1
CPU 1,1,0
CPU 1,1,1
CPU 1,2,0
CPU 1,2,1
CPU 2,0,0
CPU 2,0,1
CPU 2,1,0
CPU 2,1,1
CPU 2,2,0
CPU 2,2,1
CPU 3,0,0
CPU 3,0,1
CPU 3,1,0
CPU 3,1,1
CPU 3,2,0
CPU 3,2,1

GPU 0,0,0
GPU 1,0,0
GPU 2,0,0
GPU 3,0,0
GPU 0,1,0
GPU 1,1,0
GPU 2,1,0
GPU 3,1,0
GPU 0,2,0
GPU 1,2,0
GPU 2,2,0
GPU 3,2,0
GPU 0,0,1
GPU 1,0,1
GPU 2,0,1
GPU 3,0,1
GPU 0,1,1
GPU 1,1,1
GPU 2,1,1
GPU 3,1,1
GPU 0,2,1
GPU 1,2,1
GPU 2,2,1
GPU 3,2,1

NOTE: Although there may appear to be a sequence in the order in which the GPU processes the iterations this is only due to the use of printf(). In reality when not using printf() the order of iterations is completely arbitrary and random. Therefore one must not rely on the order of iterations when porting loops to the GPU. If you need loops to be in a certain order then you can either keep your loops on the host or port only those parts of the loop that do not need to be sequential.

Why use GPU computing?

Although this example is fairly simple it does illustrate the most important value add of GPU computing and OpenCL. Hardware parallelism is the essence of what GPU computing offers and it is the most compelling reason to use it. If you imagine a legacy codebase and all the latent areas of parallelism that are currently running sequentially you can imagine the vast untapped power of GPGPU. Later on in the series we will look at techniques to port existing host code to the GPU. That process can be very difficult but can provide dramatic gains in performance far beyond the limits of CPU computing. Till next time.

OpenCL Cookbook: Hello World using C++ host binding

Last time, in the OpenCL Cookbook series, I presented a hello world example using OpenCL and C for the host binding language. This time I present a very similar example but using the C++ host binding language. As you already know from previous posts the host language that interfaces with an OpenCL device can be any number of languages such as C, C++, Java, C# and Python.

So far I’ve been using the C API but I’ve decided to switch to the C++ API for two reasons: (1) it’s considerably less lines of code being more succinct and (2) it supports exceptions meaning that you do not have to check error codes for every line of binding code that you write. So, here follows, a brief primer of the C++ OpenCL binding. It’s a very simple example but trust me – we’ll be getting to more complex examples soon (time is the issue).

OpenCL kernel

__kernel void hello_world (__global char* message, int messageSize) {
	for (int i =0; i < messageSize; i++) {
		printf("%s", message[i]);
	}
}

The kernel (OpenCL function) above receives a char array (in essence a string) from the host as well as the size of the char array (as there is no way to derive an array's size from the array itself (Java programmers gasp in shock and disgust). The kernel simply iterates over all the letters in the char array and prints them one at a time to standard output thereby printing the message: "Hello World!". Now let's look at the C++ code that interfaces with this kernel.

C++ host binding

#define __CL_ENABLE_EXCEPTIONS

#include <fstream>
#include <iostream>
#include <iterator>
#include <CL/cl.hpp>
#include <CL/opencl.h>

using namespace std;

int main () {

    vector<cl::Platform> platforms;
    vector<cl::Device> devices;
    vector<cl::Kernel> kernels;
    
    try {
    
        // create platform
        cl::Platform::get(&platforms);
        platforms[0].getDevices(CL_DEVICE_TYPE_GPU, &devices);

        // create context
        cl::Context context(devices);

        // create command queue
        cl::CommandQueue queue(context, devices[0]);

        // load opencl source
        ifstream cl_file("opencl_hello_world.cl");
        string cl_string(istreambuf_iterator<char>(cl_file), (istreambuf_iterator<char>()));
        cl::Program::Sources source(1, make_pair(cl_string.c_str(), 
            cl_string.length() + 1));

        // create program
        cl::Program program(context, source);

        // compile opencl source
        program.build(devices);

        // load named kernel from opencl source
        cl::Kernel kernel(program, "hello_world");

        // create a message to send to kernel
        char* message = "Hello World!";
        int messageSize = 12;

        // allocate device buffer to hold message
        cl::Buffer buffer(CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 
            sizeof(char) * messageSize, message);

        // set message as kernel argument
        kernel.setArg(0, buffer);
        kernel.setArg(1, sizeof(int), &messageSize);

        // execute kernel
        queue.enqueueTask(kernel);

        // wait for completion
        queue.finish();

        cout << endl;
        
    } catch (cl::Error e) {
        cout << endl << e.what() << " : " << e.err() << endl;
    }
    
    return 0;
    
}

The above C++ host binding code is annotated to say what it's doing at each step but I'll provide a brief overview. Initially it's creating a platform, a context and a command queue which are basic opencl binding data structures that are required to interface with an opencl device. It then loads the opencl source from a separate file and with it creates a program. The program is built which compiles the opencl source. It then loads a specific kernel (function) from that source by a given name. It creates a string message on the host side but in order to send it to the device it must create a buffer of the same size as the message. The buffer is created and set as a kernel argument along with the size of the message we are sending.

The kernel is then executed and we wait for its completion on the host. The finish command flushes all outstanding tasks to the device and waits for them to finish. Note the clean exception handling using a try/catch wrap around the entire code instead of having to check error codes produced by each statement. I much prefer the C++ api to the C API. I think you'll agree that it's more concise and cleaner. Till next time.

OpenCL Cookbook: Series Reference

Recently I’d been writing a number of primer articles on OpenCL programming under the common reference name of ‘OpenCL Cookbook’ but, caught up in the content, I had completely forgotten to provide a single point of reference to all articles in the series. Here it is finally. This page will always be kept up-to-date, in chronological order, with all new articles in the series.

  1. OpenCL Cookbook: Listing all platforms and their attributes
  2. OpenCL Cookbook: Listing all devices and their critical attributes
  3. OpenCL Cookbook: Creating contexts and reference counting
  4. OpenCL Cookbook: Creating programs and reading kernels from a file
  5. OpenCL Cookbook: Building a program and debugging failures
  6. OpenCL Cookbook: Hello World using C host binding
  7. OpenCL Cookbook: Hello World using C++ host binding
  8. OpenCL Cookbook: Parallelise your host loops using OpenCL
  9. OpenCL Cookbook: Hello World using C# Cloo host binding
  10. OpenCL Cookbook: How to leverage multiple devices in OpenCL
  11. OpenCL Cookbook: Compiling OpenCL with Ubuntu 12.10, Unity, AMD 12.11 beta drivers & AMD APP SDK 2.7
  12. OpenCL Cookbook: Using amdconfig/aticonfig – a powerful utility in the AMD OpenCL toolset on Linux
  13. OpenCL Cookbook: Running remote multi-gpu OpenCL computations over SSH on Ubuntu or Debian using AMD Catalyst drivers
  14. OpenCL Cookbook: Multi device utilisation strategies
  15. OpenCL Cookbook: 10 tips for high performance kernels

OpenCL Cookbook: Hello World using C host binding

In our OpenCL Cookbook series so far we’ve looked at some preliminary data structures in OpenCL host programming using the C language. This time – we finally arrive at a complete end-to-end example – the customary Hello World!

What this example does is simple. The host program in C passes a character array to the GPU into which the GPU writes the characters of the phrase: “Hello, World!”. The host program then reads the contents of the character array back and prints them on the screen. The output should be “Hello, World!”.

The code is annotated using brief comments. There are some aspects of OpenCL that are new that I have not yet been through in previous articles but don’t worry I’ll go through a full dissection after presenting the complete code.

Note that error handling has been taken out completely to keep the program short for easy viewing. The most important aspect of error handling in the program below is checking the build error, status and log for a failed program build which you can find further details of in my previous post.

Host source

#include
#include
#ifdef __APPLE__
#include <OpenCL/opencl.h>
#else
#include <CL/cl.h>
#endif

int main() {

    cl_platform_id platform; cl_device_id device; cl_context context;
    cl_program program; cl_kernel kernel; cl_command_queue queue;
    cl_mem kernelBuffer;

    FILE* programHandle; char *programBuffer; char *programLog;
    size_t programSize; char hostBuffer[32];

    // get first available sdk and gpu and create context
    clGetPlatformIDs(1, &platform, NULL);
    clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
    context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);

    // get size of kernel source
    programHandle = fopen("helloWorld.cl", "r");
    fseek(programHandle, 0, SEEK_END);
    programSize = ftell(programHandle);
    rewind(programHandle);

    // read kernel source into buffer
    programBuffer = (char*) malloc(programSize + 1);
    programBuffer[programSize] = '\0';
    fread(programBuffer, sizeof(char), programSize, programHandle);
    fclose(programHandle);

    // create and build program
    program = clCreateProgramWithSource(context, 1,
            (const char**) &programBuffer, &programSize, NULL);
    free(programBuffer);
    clBuildProgram(program, 1, &device, "-Werror -cl-std=CL1.1", NULL, NULL);

    // create kernel and command queue
    kernel = clCreateKernel(program, "hello", NULL);
    queue = clCreateCommandQueue(context, device, 0, NULL);

    // create kernel argument buffer and set it into kernel
    kernelBuffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
            32 * sizeof(char), NULL, NULL);
    clSetKernelArg(kernel, 0, sizeof(cl_mem), &kernelBuffer);

    // execute kernel, read back the output and print to screen
    clEnqueueTask(queue, kernel, 0, NULL, NULL);
    clEnqueueReadBuffer(queue, kernelBuffer, CL_TRUE, 0,
            32 * sizeof(char), hostBuffer, 0, NULL, NULL);
    puts(hostBuffer);

    clFlush(queue);
    clFinish(queue);
    clReleaseKernel(kernel);
    clReleaseProgram(program);
    clReleaseMemObject(kernelBuffer);
    clReleaseCommandQueue(queue);
    clReleaseContext(context);
    return 0;

}

The host source runs on the CPU and is written in C in this case though you could also write it in C++, Python or Java whereas the kernel source runs on a device which could be one or more CPUs, GPUs or accelerators. The host source must be written in a host language whereas the kernel source must be written in OpenCL.

Host source by dissection

Here I describe what the host source is doing by dissecting it. A hello world example should ideally be entirely self contained and not rely on other articles to complement the reader’s understanding. With the exception of error handling and particularly how to debug a failed program build which I address elsewhere this example is self contained.

Below I present one snippet of code at a time followed by its dissection.

Creating platforms, devices and contexts

// get first available sdk and gpu and create context
clGetPlatformIDs(1, &platform, NULL);
clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);

Here I first get a platform (an OpenCL SDK/framework). As I know I only have the Apple OpenCL framework installed on my Mac it will always be the one selected. However, if you have multiple SDKs installed such as AMD, Nvidia and Intel then you may want to select one explicitly. Next I ask for a GPU device. Once again, my machine only has one GPU so it will always be the one that’s selected but if you have multiple GPUs installed you may want to choose one in particular. Finally I create a context which is an incredibly important OpenCL data structure as it is required for the creation of numerous other structures such as programs, command queues and kernel buffers.

Loading kernel sources

// get size of kernel source
programHandle = fopen("helloWorld.cl", "r");
fseek(programHandle, 0, SEEK_END);
programSize = ftell(programHandle);
rewind(programHandle);

// read kernel source into buffer
programBuffer = (char*) malloc(programSize + 1);
programBuffer[programSize] = '\0';
fread(programBuffer, sizeof(char), programSize, programHandle);
fclose(programHandle);

As this is a host source file it has the responsibility of involving the kernel source. Generally speaking the kernel source is usually compiled at runtime as part of the execution of the host source. Therefore, the host source file must pull in the kernel source and compile it. Above I first calculate the size of the kernel source file and then read the source in into a buffer of that calculated size.

Creating a program and compiling kernel sources

// create and build program
program = clCreateProgramWithSource(context, 1,
        (const char**) &programBuffer, &programSize, NULL);
free(programBuffer);
clBuildProgram(program, 1, &device, "-Werror -cl-std=CL1.1", NULL, NULL);

Here I construct a program structure by passing in a context and the buffer containing the kernel source. Then I build the program which essentially compiles the kernel source based on supplied build options. Note that a program can contain numerous kernel sources containing multiple OpenCL functions potentially drawn in from a number of files. This program build steps builds the sum total of all kernels sources read in. At this point the build could fail for a variety of reasons and it’s critically important to be able to narrow the cause easily. Here I’ve skipped this error handling but I address this subject in detail on my previous post in the series.

Creating kernels and command queues

// create kernel and command queue
kernel = clCreateKernel(program, "hello", NULL);
queue = clCreateCommandQueue(context, device, 0, NULL);

Here I create a kernel and a command queue structure. Let’s look at what each one means in turn.

A kernel is an OpenCL function that executes on one or more devices. The program structure above may contain numerous functions so the purpose of creating the kernel structure above is to pinpoint one particular one called ‘hello’. I need a reference to this kernel in order to pass it an argument later on in the process.

A command queue is exactly what the name implies. The host program invokes a device by sending it a command. The sending mechanism for that command is a queue. Commands are by default processed in FIFO order but that can be changed by a configuration option. Sending a command, also known as a task, to a command queue is a way of requesting its execution.

Setting kernel arguments

// create kernel argument buffer and set it into kernel
kernelBuffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
        32 * sizeof(char), NULL, NULL);
clSetKernelArg(kernel, 0, sizeof(cl_mem), &kernelBuffer);

Here, I create an OpenCL memory object. There are two types of memory objects – image and buffer. Here I am not dealing with image data so I choose a buffer memory object. Our goal here is to provide the kernel with a character array big enough to hold the phrase ‘Hello, World!’. However I cannot pass a character array into a kernel directly. I must create an OpenCL buffer memory object of a given size and then set it as the first kernel argument and that’s what I’m doing above. You’ll notice that I set the memory object to be write only as the device only needs to write to it.

Executing kernels and reading output data

// execute kernel, read back the output and print to screen
clEnqueueTask(queue, kernel, 0, NULL, NULL);
clEnqueueReadBuffer(queue, kernelBuffer, CL_TRUE, 0,
        32 * sizeof(char), hostBuffer, 0, NULL, NULL);
puts(hostBuffer);

This is the final step. Earlier I created a command queue and a kernel structure for the hello function and passed in a buffer memory object as the first argument. Here I complete the entire process by enqueuing the kernel for execution as a task into the command queue and reading back the output by passing in a character array of the same size as the original kernel buffer memory object. I then print the contents of that array onto the screen to prove that it contains what the GPU originally wrote into it.

Cleaning up

clFlush(queue);
clFinish(queue);
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseMemObject(kernelBuffer);
clReleaseCommandQueue(queue);
clReleaseContext(context);

Above I first ensure that all commands have been issed to the device associated with the command queue by calling clFlush(). Then I block until all commands have been issued and completed by calling clFinish(). The rest of the functions above simple deallocate their own respective named structures.

Kernel source

__kernel void hello(__global char* message){
message[0] = 'H';
message[1] = 'e';
message[2] = 'l';
message[3] = 'l';
message[4] = 'o';
message[5] = ',';
message[6] = ' ';
message[7] = 'W';
message[8] = 'o';
message[9] = 'r';
message[10] = 'l';
message[11] = 'd';
message[12] = '!';
message[13] = '';
}

The kernel source is fairly self explanatory. It simply receives a character array called message and writes its message into it. Kernel functions get infinitely more complex than this one but this one has been kept deliberately simple.

Compile and run as follows keeping both source files in the same directory.

clang -framework OpenCL helloWorld.c -o helloWorld && ./helloWorld

As always if you have any feedback or if this helped you let me know in the comments!