Tag Archives: parallelism

Concurrent producer consumer pattern using C# 4.0, BlockingCollection & Tasks

Here is a very simple illustrative and annotated producer consumer pattern example using C#, .NET 4.0, BlockingCollection and Tasks. The example sets up one producer thread on which it produces 100 integers and two consumer threads which each read that data off a common concurrent blocking queue. Although I use a BlockingCollection here it is backed internally by a concurrent queue.

using System;
using System.Collections.Concurrent;
using System.Threading.Tasks;

namespace test

    class Program

        static void Main(string[] args)


            // declare blocking collection backed by concurrent queue
            BlockingCollection<int> b = new BlockingCollection<int>();

            // start consumer 1 which waits for data until available
            var consumer1 = Task.Factory.StartNew(() =>
                foreach (int data in b.GetConsumingEnumerable())
                    Console.Write("c1=" + data + ", ");

            // start consumer 2 which waits for data until available
            var consumer2 = Task.Factory.StartNew(() =>
                foreach (int data in b.GetConsumingEnumerable())
                    Console.Write("c2=" + data + ", ");

            // produce 100 integers on a producer thread
            var producer = Task.Factory.StartNew(() =>
                for (int i = 0; i < 100; i++)

            // wait for producer to finish producing
            // wait for all consumers to finish consuming
            Task.WaitAll(consumer1, consumer2);





The output from running the program above is as follows. It’s basically consumer 1 and 2 interleaving as they read integers off the queue as they’re being produced.

c1=0, c1=2, c1=3, c1=4, c1=5, c1=6, c1=7, c1=8, c1=9, c1=10, c1=11, c1=12, c1=13, c1=14, c1=15, c1=16, c1=17, c1=18, c1=19, c2=1, c2=21, c2=22, c2=23, c2=24, c2
=25, c2=26, c2=27, c2=28, c2=29, c2=30, c2=31, c2=32, c2=33, c2=34, c2=35, c2=36, c2=37, c2=38, c2=39, c2=40, c2=41, c2=42, c2=43, c2=44, c2=45, c2=46, c2=47, c
2=48, c2=49, c2=50, c2=51, c2=52, c2=53, c2=54, c2=55, c2=56, c2=57, c2=58, c2=59, c2=60, c2=61, c2=62, c2=63, c2=64, c2=65, c2=66, c2=67, c2=68, c2=69, c2=70,
c2=71, c2=72, c2=73, c2=74, c2=75, c2=76, c2=77, c2=78, c2=79, c2=80, c2=81, c2=82, c2=83, c2=84, c2=85, c2=86, c2=87, c2=88, c2=89, c2=90, c2=91, c2=92, c2=93,
 c2=94, c2=95, c2=96, c2=97, c2=98, c2=99, c1=20,

It’s somewhat sad and disheartening to see just how much further ahead C# is of Java in terms of the expressiveness and richness of the language. I personally really enjoyed using the lambdas above and I hope that with Java 8 the SDK libraries will grow to develop a more fluid and expressive style.

Credit: [1, 2]

OpenCL Cookbook: Parallelise your host loops using OpenCL

Continuing on in our series – this time we look at possibly the most important topic of all in OpenCL. It is the reason why we use OpenCL and it is also the most compelling benefit that OpenCL offers. It is, of course, parallelism. But how do we exploit the vast amount of parallelism that GPUs offer? At the simplest level we can do so by exploiting latent areas of parallelism in our host code the simplest of which are loops. In other words – if we can port loops in our host code to the GPU they become parallel and get faster by a factor of the total number of iterations. I demonstrate using a small example.

Host loop

void cpu_3d_loop (int x, int y, int z) {

    for (int i = 0; i < x; i++) {
        for (int j = 0; j < y; j++) {
            for (int k = 0; k < z; k++) {
                printf("CPU %d,%d,%dn", i, j, k);


Imagine the loop above in our C++ host code. This is not one loop but in fact three. In other words it has three dimensions. The total number of iterations in this combined loop is x*y*z. If x=4, y=3 and z=2 the total number of iterations would be 4x3x2=24. On the CPU these loops execute serially which is fine for a small number of iterations but for large numbers it becomes a fundamental bottleneck. If this set of loops was ported to the GPU each iteration would run in parallel and the total number of threads in use would be 24 for the previous example.

A small scale example may not seem impressive at first. You could argue that you could just as well run 24 threads on the CPU. But consider this: what happens when you have the above set of loops in your host code performing thousands or even millions of iterations? How are you going to achieve hardware parallelism in this case on the CPU? The answer is you can’t. GPUs each have hundreds of cores and offer a far greater degree of parallelism so loops with a large number of iterations becomes easy work for the GPU which can run thousands or even millions of threads effectively. Below I demonstrate how to port such a loop to OpenCL.

Host binding code

#define __NO_STD_VECTOR

#include <fstream>
#include <iostream>
#include <iterator>
#include <CL/cl.hpp>
#include <CL/opencl.h>

using namespace cl;

void cpu_3d_loop (int x, int y, int z) {

    for (int i = 0; i < x; i++) {
        for (int j = 0; j < y; j++) {
            for (int k = 0; k < z; k++) {
                printf("CPU %d,%d,%dn", i, j, k);


int main () {

    // CPU 3d loop

    int x = 4;
    int y = 3;
    int z = 2;
    cpu_3d_loop(x, y, z);
    std::cout << std::endl;

    // GPU 3d loop

    vector<Platform> platforms;
    vector<Device> devices;
    vector<Kernel> kernels;
    try {
        // create platform, context and command queue
        platforms[0].getDevices(CL_DEVICE_TYPE_GPU, &devices);
        Context context(devices);
        CommandQueue queue(context, devices[0]);

        // load opencl source
        std::ifstream cl_file("kernels.cl");
        std::string cl_string(std::istreambuf_iterator<char>(cl_file),
        Program::Sources source(1, std::make_pair(cl_string.c_str(), 
            cl_string.length() + 1));

        // create program and kernel and set kernel arguments
        Program program(context, source);
        Kernel kernel(program, "ndrange_parallelism");

        // execute kernel and wait for completion
        NDRange global_work_size(x, y, z);
        queue.enqueueNDRangeKernel(kernel, NullRange, global_work_size, NullRange);

    } catch (Error e) {
        std::cout << std::endl << e.what() << " : " << e.err() << std::endl;

    return 0;

The above program runs the cpu loop and then runs the equivalent logic on the gpu. Both cpu and gpu runs produce output to show which iteration they are processing. The key lines of code that demonstrate how to port the loop are below.

NDRange global_work_size(x, y, z);
queue.enqueueNDRangeKernel(kernel, NullRange, global_work_size, NullRange);

Here we set three upper bounds – one for each loop – this is known as the global work size. The kernel can then retrieve values for the currently executing iteration within the kernel itself as shown below. It can then use these indices to do whatever work is inside the loop. In this case we just print the indices for illustration.

Kernel code

The kernel you see below is executed x*y*z times with different values for i, j and k. See? No loops! 🙂

__kernel void ndrange_parallelism () {

	int i = get_global_id(0);
	int j = get_global_id(1);
	int k = get_global_id(2);

	printf("GPU %d,%d,%dn", i, j, k);

The output of running the above host code is as follows.

CPU 0,0,0
CPU 0,0,1
CPU 0,1,0
CPU 0,1,1
CPU 0,2,0
CPU 0,2,1
CPU 1,0,0
CPU 1,0,1
CPU 1,1,0
CPU 1,1,1
CPU 1,2,0
CPU 1,2,1
CPU 2,0,0
CPU 2,0,1
CPU 2,1,0
CPU 2,1,1
CPU 2,2,0
CPU 2,2,1
CPU 3,0,0
CPU 3,0,1
CPU 3,1,0
CPU 3,1,1
CPU 3,2,0
CPU 3,2,1

GPU 0,0,0
GPU 1,0,0
GPU 2,0,0
GPU 3,0,0
GPU 0,1,0
GPU 1,1,0
GPU 2,1,0
GPU 3,1,0
GPU 0,2,0
GPU 1,2,0
GPU 2,2,0
GPU 3,2,0
GPU 0,0,1
GPU 1,0,1
GPU 2,0,1
GPU 3,0,1
GPU 0,1,1
GPU 1,1,1
GPU 2,1,1
GPU 3,1,1
GPU 0,2,1
GPU 1,2,1
GPU 2,2,1
GPU 3,2,1

NOTE: Although there may appear to be a sequence in the order in which the GPU processes the iterations this is only due to the use of printf(). In reality when not using printf() the order of iterations is completely arbitrary and random. Therefore one must not rely on the order of iterations when porting loops to the GPU. If you need loops to be in a certain order then you can either keep your loops on the host or port only those parts of the loop that do not need to be sequential.

Why use GPU computing?

Although this example is fairly simple it does illustrate the most important value add of GPU computing and OpenCL. Hardware parallelism is the essence of what GPU computing offers and it is the most compelling reason to use it. If you imagine a legacy codebase and all the latent areas of parallelism that are currently running sequentially you can imagine the vast untapped power of GPGPU. Later on in the series we will look at techniques to port existing host code to the GPU. That process can be very difficult but can provide dramatic gains in performance far beyond the limits of CPU computing. Till next time.

Oracle and AMD propose GPU support in Java

A very exciting development indeed has come to my attention albeit four days late. This news is, without exaggeration, in my humble opinion nothing short of groundbreaking, not only because it pushes the boundaries and capabilities of Java even further in terms of hardware support and performance but also because, as I’m gradually beginning to realise, the future of high performance computing is in the GPU.

John Coomes (OpenJDK HotSpot Group Lead) and Gary Frost (AMD) have proposed a new OpenJDK project to implement GPU support in Java with a native JVM (full thread).

This project intends to enable Java applications to seamlessly take advantage of a GPU–whether it is a discrete device or integrated with a CPU–with the objective to improve the application performance.

Their focus will be on code generation, garbage collection and runtimes.

We propose to use the Hotspot JVM, and will concentrate on code generation, garbage collection, and
runtimes. Performance will be improved, while preserving compile time, memory consumption and code generation quality.

The project will also use the new Java 8 lambda language and may eventually sprout further platform enhancements.

We will start exploring leveraging the new Java 8 Lambda language and library features. As this project progress, we may identify challenges with the Java API and constructs which may lead to new language, JVM and library extensions that will need standardization under the JCP process.

John Coomes (Oracle) will lead the project and Gary Frost (AMD) has gladly offered resources from AMD. As the mail thread got underway two already existing related projects were brought to the list’s attention: rootbeer (PDF, slashdot) and Aparapi (AMD page). Rootbeer is a particularly interesting project as it performs static analysis of Java code and generates CUDA code automatically – quite different from compile time OpenCL bindings from Java. The developer of rootbeer has also shown interest in joining the openjdk project. John Rose, Oracle HotSpot developer, also posted a talk he recently gave on Arrays 2.0 which I’m yet to watch.

The missing link in deriving value from GPUs, from what I understand in a domain that I’m still fairly new to, is that GPU hardware and programming need to be made accessible to the mainstream and that’s the purpose I hope this project will serve. I hope also that, once underway, it also raises wider awareness of how we must make a mental shift from concurrency to parallelism and from data parallelism to task parallelism and how the next generational leap in parallelism will come, not from cpu cores or concurrency frameworks, but from harnessing the mighty GPU.

Hacker news has some discussion. What do you think? Let me know.

OpenCL Cookbook: Listing all devices and their critical attributes

Last time, in our OpenCL Cookbook series, we looked at how to list all platforms and their attributes. This time, we take the next step and list all devices that a platform provides access to and their critical attributes. This program is very useful in that it provides a quick and easy way of introspecting a given system’s OpenCL capabilities. Note that error handling has been omitted for brevity and the host language as before is C.

To recap on terminology: a platform is an OpenCL SDK such as an Apple, Intel, NVidia or AMD SDK. A device, on the other hand, may be a cpu, gpu or accelerator and as a result it’s highly likely a system will have multiple devices. For each device we list its critical attributes: hardware OpenCL version, software driver version, opencl c version supported by compiler for device and finally the number of parallel compute units (cores) it possesses which symbolises the extent of task based parallelism that we can achieve.

#include <stdio.h>                                                                                                                                               
#include <stdlib.h>
#ifdef __APPLE__
#include <OpenCL/opencl.h>
#include <CL/cl.h>

int main() {

    int i, j;
    char* value;
    size_t valueSize;
    cl_uint platformCount;
    cl_platform_id* platforms;
    cl_uint deviceCount;
    cl_device_id* devices;
    cl_uint maxComputeUnits;

    // get all platforms
    clGetPlatformIDs(0, NULL, &platformCount);
    platforms = (cl_platform_id*) malloc(sizeof(cl_platform_id) * platformCount);
    clGetPlatformIDs(platformCount, platforms, NULL);

    for (i = 0; i < platformCount; i++) {

        // get all devices
        clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_ALL, 0, NULL, &deviceCount);
        devices = (cl_device_id*) malloc(sizeof(cl_device_id) * deviceCount);
        clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_ALL, deviceCount, devices, NULL);

        // for each device print critical attributes
        for (j = 0; j < deviceCount; j++) {

            // print device name
            clGetDeviceInfo(devices[j], CL_DEVICE_NAME, 0, NULL, &valueSize);
            value = (char*) malloc(valueSize);
            clGetDeviceInfo(devices[j], CL_DEVICE_NAME, valueSize, value, NULL);
            printf("%d. Device: %sn", j+1, value);

            // print hardware device version
            clGetDeviceInfo(devices[j], CL_DEVICE_VERSION, 0, NULL, &valueSize);
            value = (char*) malloc(valueSize);
            clGetDeviceInfo(devices[j], CL_DEVICE_VERSION, valueSize, value, NULL);
            printf(" %d.%d Hardware version: %sn", j+1, 1, value);

            // print software driver version
            clGetDeviceInfo(devices[j], CL_DRIVER_VERSION, 0, NULL, &valueSize);
            value = (char*) malloc(valueSize);
            clGetDeviceInfo(devices[j], CL_DRIVER_VERSION, valueSize, value, NULL);
            printf(" %d.%d Software version: %sn", j+1, 2, value);

            // print c version supported by compiler for device
            clGetDeviceInfo(devices[j], CL_DEVICE_OPENCL_C_VERSION, 0, NULL, &valueSize);
            value = (char*) malloc(valueSize);
            clGetDeviceInfo(devices[j], CL_DEVICE_OPENCL_C_VERSION, valueSize, value, NULL);
            printf(" %d.%d OpenCL C version: %sn", j+1, 3, value);

            // print parallel compute units
            clGetDeviceInfo(devices[j], CL_DEVICE_MAX_COMPUTE_UNITS,
                    sizeof(maxComputeUnits), &maxComputeUnits, NULL);
            printf(" %d.%d Parallel compute units: %dn", j+1, 4, maxComputeUnits);




    return 0;


Compile and run on the Mac as follows. If you don’t have the clang, g++ or gcc commands install them. Any of those commands should work.

$ clang -framework OpenCL devices.c -o devices && ./devices

The output produced on my machine is as follows but may differ on your system.

1. Device: Intel(R) Core(TM)2 Duo CPU     U9600  @ 1.60GHz
 1.1 Hardware version: OpenCL 1.2 
 1.2 Software version: 1.1
 1.3 OpenCL C version: OpenCL C 1.2 
 1.4 Parallel compute units: 2
2. Device: GeForce 320M
 2.1 Hardware version: OpenCL 1.0 
 2.2 Software version: CLH 1.0
 2.3 OpenCL C version: OpenCL C 1.1 
 2.4 Parallel compute units: 6

As you can see my Macbook Air shows rather feeble and outdated metadata being an old slimline laptop. As always, the code is only tested on my Macbook Air but, in theory, should run on Windows and Linux though the way you compile and run will differ from above slightly. If you have any issues or would like to critique and improve my code (given I’m not a C programmer) by all means leave a comment.

Did this help you? Let me know in the comments!