Tag Archives: linkedin

Have you donated to Wikipedia?

Do you use wikipedia? Have you ever considered yourself to be privileged to be able to use wikipedia whenever you wanted for whatever you wanted totally free of charge? How many times has it helped you avoid an embarrassing admission of ignorance? How many times has it helped you prepare for a job interview? Considering that – have you ever donated to wikipedia? I just did. And this is the email I received in return.

Dear Dhruba,

Thank you for donating to the Wikimedia Foundation. You are wonderful!

It’s easy to ignore our fundraising banners, and I’m really glad you didn’t. This is how Wikipedia pays its bills — people like you giving us money, so we can keep the site freely available for everyone around the world.

People tell me they donate to Wikipedia because they find it useful, and they trust it because even though it’s not perfect, they know it’s written for them. Wikipedia isn’t meant to advance somebody’s PR agenda or push a particular ideology, or to persuade you to believe something that’s not true. We aim to tell the truth, and we can do that because of you. The fact that you fund the site keeps us independent and able to deliver what you need and want from Wikipedia. Exactly as it should be.

You should know: your donation isn’t just covering your own costs. The average donor is paying for his or her own use of Wikipedia, plus the costs of hundreds of other people. Your donation keeps Wikipedia available for an ambitious kid in Bangalore who’s teaching herself computer programming. A middle-aged homemaker in Vienna who’s just been diagnosed with Parkinson’s disease. A novelist researching 1850s Britain. A 10-year-old in San Salvador who’s just discovered Carl Sagan.

On behalf of those people, and the half-billion other readers of Wikipedia and its sister sites and projects, I thank you for joining us in our effort to make the sum of all human knowledge available for everyone. Your donation makes the world a better place. Thank you.

Most people don’t know Wikipedia’s run by a non-profit. Please consider sharing this e-mail with a few of your friends to encourage them to donate too. And if you’re interested, you should try adding some new information to Wikipedia. If you see a typo or other small mistake, please fix it, and if you find something missing, please add it. There are resources here that can help you get started. Don’t worry about making a mistake: that’s normal when people first start editing and if it happens, other Wikipedians will be happy to fix it for you.

I appreciate your trust in us, and I promise you we’ll use your money well.

Thanks,
Sue

Sue Gardner
Executive Director,
Wikimedia Foundation
https://donate.wikimedia.org

If you’ve used wikipedia you owe them and all contributors a debt of gratitude. And a donation symbolises a repayment of that debt of gratitude. Donate today. It doesn’t have to be much. £5, £10 – anything that you can afford.

Collapsing two arrays into one in C and C++

A quick tip – if you have two arrays and you want to merge/collapse/concatenate/append/combine (whatever you want to call it) them into one here is one way to do it in C and C++. The code also prints the combined array to check it’s correct.

Note the subtle differences between C and C++. One of the (innumerable!) problems with C++ is that you can write as much C as you want in it and leave the next developer wondering why you were writing C instead of C++ and whether you had specific justifiable reasons to do so in each instance. I personally also consider it bad practice. This is one of the reasons I dislike hybrid languages – they allow you to blend two languages together in arbitrary proportions making the end result unreadable and unmaintainable and just raising more questions than the problems that it solves.

C

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main () {

   double a[4] = {1,2,3,4};
   double b[4] = {5,6,7,8};

   double* c = malloc(8 * sizeof(double));

   memcpy(c,     a, 4 * sizeof(double));
   memcpy(c + 4, b, 4 * sizeof(double));

   for (int i = 0; i < 8; i++) {
      printf("%d %fn", i, c[i]);
   }

   free(c);

}

C++

#include <iostream>
#include <cstring>
#include <iomanip>
       
using namespace std;
       
int main () {
       
   double a[4] = {1,2,3,4};
   double b[4] = {5,6,7,8};
       
   double* c = new double[8];
       
   copy(a, a + 4, c); 
   copy(b, b + 4, c + 4); 
       
   for (int i = 0; i < 8; i++) {
      cout << i << " " << c[i] << endl;
   }
       
   delete[] c;
       
}   

Note that in the C++ program:

  • I’m not using includes from C
  • I’m not using memcpy from C
  • I’m not using printf from C
  • I’m not using malloc from C
  • I’m not using sizeof from C
  • I’m not using free from C

8 GPU watercooled computation rig using Power Color Devil 13 HD 7990 cards

Workstation motherboards provide 4 or more PCIE x16 gen 3 slots with a good distribution of dedicated links direct to multiple onboard CPUs. With dual cards such as 7990 which are in fact 2×7970 each one can in theory load up a 4 slot motherboard with 8 gpus in total. Alas, in practice, there can be obstacles booting with 7 or 8 of them with certain motherboads as they tend to run out of PCIE resources after around 6 cards which could be considered to be surprising for a workstation class motherboard. Incidentally, I think it’s worth noting that not only are these the only model of 7990 made but they are practically impossible to get hold of so very rare gems indeed.

Watercooling four dual gpus on a standard workstation motherboard can also be a challenge due to a severe shortage of space between them. Dual slot spacing between PCIE x16 slots is a tight fit as the tubes can take up almost three slots worth of space between cards. In this configuration they are installed on alternate slots so if you had 7 slots in total you’d only install on 1, 3, 5 and 7 leaving 2, 4 and 6 empty. Though 2, 4 and 6 will usually be PCIE x8 slots anyway as opposed to the rest being PCIE x16.

As you can see below these cards have been completely stripped down of their air cooling and heatsink apparatus prior to attaching waterblocks and tubing for coolant to pass through them. The tubing is secured using barb fittings which are stronger than the alternative: compression fittings though they do lack the aesthetic appeal of compression fittings. Compression fittings can come apart under tension and that can create a real mess as I realised the hard way one night.

If not using full card waterblocks (which these aren’t) individual adhesive heatsinks for all the ram chips (known as ram sinks) are required for sufficient cooling. There may be upto 12 of these tiny little ram sinks on each face of a card. I don’t have any photos of that right now but I’ll try and get some. Though, ram sinks, can be flimsy and easily become dislodged and fall off the cards if they are knocked which is why some people prefer full cover blocks. Full cover blocks are more robust but also more expensive.

In terms of power the system is supplied with 2400 watts of power composed of two 1200W power supplies chained by an adapter for the first to kickstart the other on boot. Half the gpus are powered by one and half by the other. This particular machine has 128GB of RAM and dual xeons with a combined total of 32 cores. Update: As rightly pointed out by the commenter below I meant hardware threads not physical cores here.

Note: I do not own the hardware or the photographs but consent has been acquired to publish them here. Also this system is not conceived or assembled by me though I do have the pleasure of loading it with OpenCL benchmarks and computations.

OpenCL Cookbook: Managing your GPUs using amdconfig/aticonfig – a powerful utility in the AMD OpenCL toolset on Linux

When you install AMD Catalyst drivers (I’m using 12.11 beta) on linux you gain access to a command line utility called amdconfig. This is better known by its legacy name aticonfig but in this article we’ll stick with the new name. This tool provides a wealth of functionality in querying and configuring your AMD gpu making it a very powerful utility in your OpenCL toolset on Linux.

Here we explore some basic yet powerful uses of this tool to query and manage the state of our gpus. In the following examples I use the long form of the arguments to the command so that it is easier to remember and understand for those new to the command. Note that in general write commands due to the inherent risks involved can only be run as root. Read only commands can be run by normal users.

List all gpu adapters

Note – just because you see multiple adapters here does not mean they will be enabled in the OpenCL runtime. For that you have to generate a new X configuration with all adapters (see command for further down).

dhruba@debian:~$ amdconfig --list-adapters
* 0. 05:00.0 AMD Radeon HD 7900 Series  
  1. 06:00.0 AMD Radeon HD 7900 Series  
  2. 09:00.0 AMD Radeon HD 7900 Series  
  3. 0a:00.0 AMD Radeon HD 7900 Series  
  4. 85:00.0 AMD Radeon HD 7900 Series  
  5. 86:00.0 AMD Radeon HD 7900 Series  

* - Default adapter

Generate a fresh X config with all adapters enabled

Generating a new config in this way will backup your old config.

dhruba@debian:~$ sudo amdconfig --initial --force --adapter=all
Uninitialised file found, configuring.
Using xorg.conf
Saving back-up to xorg.conf.fglrx-0

Or you can specify your old and new config files explicitly.

dhruba@debian:~$ sudo amdconfig --initial --force --adapter=all --input=foo --output=bar
Uninitialised file found, configuring.
Using bar
Saving back-up to bar.original-0

Query current clocks, clock ranges and load for all adapters

Note here that by default, in idle mode, on linux adapters are clocked at 300/150 (core/memory) but under load the clocks automatically increase to 925/1375 (core/memory) which is nice.

dhruba@debian:~$ amdconfig --od-getclocks --adapter=all

Adapter 0 - AMD Radeon HD 7900 Series  
                            Core (MHz)    Memory (MHz)
           Current Clocks :    925           1375
             Current Peak :    925           1375
  Configurable Peak Range : [300-1125]     [150-1575]
                 GPU load :    99%

Adapter 1 - AMD Radeon HD 7900 Series  
                            Core (MHz)    Memory (MHz)
           Current Clocks :    925           1375
             Current Peak :    925           1375
  Configurable Peak Range : [300-1125]     [150-1575]
                 GPU load :    98%

Adapter 2 - AMD Radeon HD 7900 Series  
                            Core (MHz)    Memory (MHz)
           Current Clocks :    925           1375
             Current Peak :    925           1375
  Configurable Peak Range : [300-1125]     [150-1575]
                 GPU load :    98%

Adapter 3 - AMD Radeon HD 7900 Series  
                            Core (MHz)    Memory (MHz)
           Current Clocks :    925           1375
             Current Peak :    925           1375
  Configurable Peak Range : [300-1125]     [150-1575]
                 GPU load :    98%

Adapter 4 - AMD Radeon HD 7900 Series  
                            Core (MHz)    Memory (MHz)
           Current Clocks :    925           1375
             Current Peak :    925           1375
  Configurable Peak Range : [300-1125]     [150-1575]
                 GPU load :    98%

Adapter 5 - AMD Radeon HD 7900 Series  
                            Core (MHz)    Memory (MHz)
           Current Clocks :    925           1375
             Current Peak :    925           1375
  Configurable Peak Range : [300-1125]     [150-1575]
                 GPU load :    98%

Query temperatues for all adapters

This is handy to keep an eye on your gpus under load to check they are not overheating. The following temperatures were taken under load and you can see that adapter 3 has reached 70C despite all cards being aggressively water cooled.

dhruba@debian:~$ amdconfig --odgt --adapter=all

Adapter 0 - AMD Radeon HD 7900 Series  
            Sensor 0: Temperature - 65.00 C

Adapter 1 - AMD Radeon HD 7900 Series  
            Sensor 0: Temperature - 50.00 C

Adapter 2 - AMD Radeon HD 7900 Series  
            Sensor 0: Temperature - 60.00 C

Adapter 3 - AMD Radeon HD 7900 Series  
            Sensor 0: Temperature - 70.00 C

Adapter 4 - AMD Radeon HD 7900 Series  
            Sensor 0: Temperature - 58.00 C

Adapter 5 - AMD Radeon HD 7900 Series  
            Sensor 0: Temperature - 54.00 C

List crossfire candidates and crossfire status

For OpenCL it is essential that you have crossfire disabled. You can disable it either using amdconfig --crossfire=off or through catalyst control centre which you start by running amdcccle.

dhruba@debian:~$ amdconfig --list-crossfire-candidates

Master adapter:  0. 05:00.0 AMD Radeon HD 7900 Series  
    Candidates:  none
Master adapter:  1. 06:00.0 AMD Radeon HD 7900 Series  
    Candidates:  none
Master adapter:  2. 09:00.0 AMD Radeon HD 7900 Series  
    Candidates:  none
Master adapter:  3. 0a:00.0 AMD Radeon HD 7900 Series  
    Candidates:  none
Master adapter:  4. 85:00.0 AMD Radeon HD 7900 Series  
    Candidates:  none
Master adapter:  5. 86:00.0 AMD Radeon HD 7900 Series  
    Candidates:  none
dhruba@debian:~$ amdconfig --list-crossfire-status
    Candidate Combination: 
    Master: 0:0:0 
    Slave: 0:0:0 
    CrossFire is disabled on current device
    CrossFire Diagnostics:
    CrossFire can work with P2P mapping through GART
    Candidate Combination: 
    Master: 0:0:0 
    Slave: 0:0:0 
    CrossFire is disabled on current device
    CrossFire Diagnostics:
    CrossFire can work with P2P mapping through GART
    Candidate Combination: 
    Master: 0:0:0 
    Slave: 0:0:0 
    CrossFire is disabled on current device
    CrossFire Diagnostics:
    CrossFire can work with P2P mapping through GART
    Candidate Combination: 
    Master: 0:0:0 
    Slave: 0:0:0 
    CrossFire is disabled on current device
    CrossFire Diagnostics:
    CrossFire can work with P2P mapping through GART
    Candidate Combination: 
    Master: 0:0:0 
    Slave: 0:0:0 
    CrossFire is disabled on current device
    CrossFire Diagnostics:
    CrossFire can work with P2P mapping through GART
    Candidate Combination: 
    Master: 0:0:0 
    Slave: 0:0:0 
    CrossFire is disabled on current device
    CrossFire Diagnostics:
    CrossFire can work with P2P mapping through GART

You can also use amdconfig to set core and memory clocks but this will be covered in a separate article. I do not want to run these commands on my system as I’m happy with current clocks. But here’s a snippet from the man page which is fairly self explanatory. Bear in mind that to tweak clocks you need to enable overdrive using –od-enable.

  --od-enable
        Unlocks the ability to change core or memory clock values by
        acknowledging that you have read and understood the AMD Overdrive (TM)
        disclaimer and accept responsibility for and recognize the potential
        dangers posed to your hardware by changing the default core or memory
        clocks
  --od-disable
        Disables AMD Overdrive(TM) set related aticonfig options.  Previously
        commited core and memory clock values will remain, but will not be set
        on X Server restart.
  --odsc, --od-setclocks={NewCoreClock|0,NewMemoryClock|0}
        Sets the core and memory clock to the values specified in MHz
        The new clock values must be within the theoretical ranges provided
        by --od-getclocks.  If a 0 is passed as either the NewCoreClock or
        NewMemoryClock it will retain the previous value and not be changed.
        There is no guarantee that the attempted clock values will succeed
        even if they lay inside the theoretical range.  These newly set
        clock values will revert to the default values if they are not
        committed using the "--od-commitclocks" command before X is
        restarted
  --odrd, --od-restoredefaultclocks
        Sets the core and memory clock to the default values.
        Warning X needs to be restarted before these clock changes will take
        effect
  --odcc, --od-commitclocks
        Once the stability of a new set of custom clocks has been proven this
        command will ensure that the Adapter will attempt to run at these new
        values whenever X is restarted

OpenCL Cookbook: Compiling OpenCL with Ubuntu 12.10, Unity, AMD 12.11 beta drivers & AMD APP SDK 2.7

Continuing on in the OpenCL cookbook series here I present a post not about code but about environmental setup further diversifying the scope of the cookbook. It can be a real challenge for the uninitiated to install all the above and compile an opencl c or c++ program on linux. Here’s a short guide. First download and install ubuntu (duh!).

Install ubuntu build tools and linux kernel extras

Then install the following packages which are a prerequisite to the amd installers and the subsequent c/c++ compilation.

sudo apt-get update
sudo apt-get install build-essential
sudo apt-get install linux-source
sudo apt-get install linux-headers-generic

Then download AMD 12.11 beta drivers (amd-driver-installer-catalyst-12.11-beta-x86.x86_64.zip) and AMD APP SDK 2.7 (AMD-APP-SDK-v2.7-lnx64.tgz). Obviously download either 32bit or 64bit based on what your system supports.

AMD 12.11 beta drivers installation

Once you’ve done that install the AMD 12.11 beta drivers as root first. Installation is as simple as extracting the tarball, marking the script inside as executable and running the script as root. Reboot. After the reboot unity should start using the new AMD 12.11 beta driver and you’ll know it’s the beta because you’ll see a watermark at the bottom left of the screen saying ‘AMD Testing use only’. Note that the reason we’re using the beta here is because unity does not work with earlier versions of the driver. You get a problem where you see the desktop background and a mouse pointer but there’s no toolbar or status bar. But the 12.11 beta driver works which is great.

AMD APP SDK 2.7 installation

Then install the AMD APP SDK 2.7 also as root. Again installation is very simple and exactly the same as for the beta driver above. The AMD beta drivers install a video driver and the OpenCL runtime. The AMD APP SDK install the SDK and also OpenCL and OpenGL runtimes. However if you’ve already installed the video driver first you’ll already have the OpenCL runtime on your system in /usr/lib/libamdocl64.so so the APP SDK won’t install another copy in its location of /opt/AMDAPP/lib/x86_64/libOpenCL.so. You’ll see some messages during installation that it’s skipping the opencl runtime and that’s absolutely fine for now.

Test your OpenCL environment

Now you should test your OpenCL environment by compiling and running an example c opencl program. Get my C file to list all devices on your system as an example calling it devices.c and compile as follows.

gcc -L/usr/lib -I/opt/AMDAPP/include devices.c -lamdocl64 -o devices.o # for c
g++ -L/usr/lib -I/opt/AMDAPP/include devices.c -lamdocl64 -o devices.o # for c++

Once compiled run the output file (devices.o) and if it works then you should output similar to that below.

1. Device: Tahiti
 1.1 Hardware version: OpenCL 1.2 AMD-APP (923.1)
 1.2 Software version: CAL 1.4.1741 (VM)
 1.3 OpenCL C version: OpenCL C 1.2 
 1.4 Parallel compute units: 32
2. Device: Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz
 2.1 Hardware version: OpenCL 1.2 AMD-APP (923.1)
 2.2 Software version: 2.0 (sse2,avx)
 2.3 OpenCL C version: OpenCL C 1.2 
 2.4 Parallel compute units: 32

Enabling multiple gpus for OpenCL

You may find that you are only seeing one gpu in your opencl programs. There are two things you need to do to enable multiple gpus in the OpenCL runtime. The first is to disable all crossfire. You can do this either in the amd catalyst control centre > performance which you start by running amdcccle or you can do it using the awesome amdconfig tool by running amdconfig --crossfire=off. See my post on amdconfig to find out more about this incredibly powerful tool.

The second thing you may or may not need to do is to enable COMPUTE mode as follows.

export COMPUTE=:0

Once you’ve done the above you should see program output from the program above similar to below.

dhruba@debian:~$ ./source/devices.o 
1. Device: Tahiti
 1.1 Hardware version: OpenCL 1.2 AMD-APP (1084.2)
 1.2 Software version: 1084.2 (VM)
 1.3 OpenCL C version: OpenCL C 1.2 
 1.4 Parallel compute units: 32
2. Device: Tahiti
 2.1 Hardware version: OpenCL 1.2 AMD-APP (1084.2)
 2.2 Software version: 1084.2 (VM)
 2.3 OpenCL C version: OpenCL C 1.2 
 2.4 Parallel compute units: 32
3. Device: Tahiti
 3.1 Hardware version: OpenCL 1.2 AMD-APP (1084.2)
 3.2 Software version: 1084.2 (VM)
 3.3 OpenCL C version: OpenCL C 1.2 
 3.4 Parallel compute units: 32
4. Device: Tahiti
 4.1 Hardware version: OpenCL 1.2 AMD-APP (1084.2)
 4.2 Software version: 1084.2 (VM)
 4.3 OpenCL C version: OpenCL C 1.2 
 4.4 Parallel compute units: 32
5. Device: Tahiti
 5.1 Hardware version: OpenCL 1.2 AMD-APP (1084.2)
 5.2 Software version: 1084.2 (VM)
 5.3 OpenCL C version: OpenCL C 1.2 
 5.4 Parallel compute units: 32
6. Device: Tahiti
 6.1 Hardware version: OpenCL 1.2 AMD-APP (1084.2)
 6.2 Software version: 1084.2 (VM)
 6.3 OpenCL C version: OpenCL C 1.2 
 6.4 Parallel compute units: 32
7. Device: Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz
 7.1 Hardware version: OpenCL 1.2 AMD-APP (1084.2)
 7.2 Software version: 1084.2 (sse2,avx)
 7.3 OpenCL C version: OpenCL C 1.2 
 7.4 Parallel compute units: 32

Standardising the OpenCL runtime library path

Now – it may be that you wish for the OpenCL runtime library to be installed in the standard AMD APP SDK location of /opt/AMDAPP/lib/x86_64/libOpenCL.so as opposed to the non-standard location of /usr/lib/libamdocl64.so which is where the beta driver installation puts it. The proper way to do this would probably be to install the AMD APP SDK first and then the video driver or simply skip the video driver installation (I haven’t tried either of these options so they may need verification).

However, I used a little trick to make this easier since I’d already installed the video driver followed by the APP SDK. I renamed /usr/lib/libamdocl64.so to /usr/lib/libamdocl64.so.x and reinstalled the APP SDK. This time it detected that the runtime wasn’t present and installed another runtime in /opt/AMDAPP/lib/x86_64/libOpenCL.so – the standard SDK runtime path. With the new APP SDK OpenCL runtime in place I was able to compile the same program using the new runtime as below depending on whether you want the c or c++ compiler.

gcc -L/opt/AMDAPP/lib/x86_64/ -I/opt/AMDAPP/include devices.c -lOpenCL -o devices.o # for c
g++ -L/opt/AMDAPP/lib/x86_64/ -I/opt/AMDAPP/include devices.c -lOpenCL -o devices.o # for c++

Summary

And there you have it – an opencl compiler working on ubuntu 12.10 using the AMD 12.11 beta drivers and the AMD APP 2.7 SDK. Sometimes you just need someone else to have done it first and written a guide and I hope this serves to help someone out there.

C++ error LNK2001: unresolved external symbol

A quick Visual C++ tip to help those who get stuck on this problem like I did. If you find yourself getting errors like this:

foo.obj : error LNK2001: unresolved external symbol "public: __thiscall MyMatrix::~MyMatrix(void)" (??1MyMatrix@@QAE@XZ)
foo.obj : error LNK2001: unresolved external symbol "public: __thiscall MyMatrix::MyMatrix(class MyMatrix const &)" (??0MyMatrix@@QAE@ABV0@@Z)
fooData.obj : error LNK2001: unresolved external symbol "public: __thiscall MyMatrix::MyMatrix(unsigned int,unsigned int)" (??0MyMatrix@@QAE@II@Z)
bar.obj : error LNK2001: unresolved external symbol "public: class MyMatrix & __thiscall MyMatrix::resize(unsigned int,unsigned int)" (?resize@MyMatrix@@QAEAAV1@II@Z)
.Release2MCProject-32bit-noxlw.exe : fatal error LNK1120: 4 unresolved externals

it could mean that certain methods are declared but not defined. In other words there may be methods in the header files that have no implementations in cpp files. Above we can see that a destructor, two constructors and a resize call are what the compiler’s complaining about. This was because those methods had not been implemented but were in the header file.

Though the really odd thing is that this project was compiling in 32 bit mode and when I tried to port it to 64 bit mode suddenly these errors crept up. I don’t know why these errors didn’t occur in 32 bit mode. Must be yet another compiling intricacy that I’m not aware of.

C++ can be a nightmare compared to higher level languages.

msvcprtd.lib(MSVCP100D.dll) : fatal error LNK1112: module machine type ‘X86’ conflicts with target machine type ‘x64’

If you get the above error when trying to build a project in 64 bits or port a 32 bits project to 64 bits in Visual Studio 2010 go to Project Property Pages > Configuration Properties > VC++ Directories > Library Directories and make sure you have the appropriate 64 bit directories in there and not the 32 bit equivalents.

I had to change my Library Directories entry from:

$(VCInstallDir)lib;$(VCInstallDir)atlmfclib;$(WindowsSdkDir)lib;$(FrameworkSDKDir)lib;

to:

$(VCInstallDir)libamd64;$(VCInstallDir)atlmfclibamd64;$(WindowsSdkDir)libx64;

In fact check all entries under Project Property Pages > Configuration Properties > VC++ Directories to see they’re 64 bit paths. If the above doesn’t work try creating a new project in VS, converting it from 32bit to 64bit and if it builds use it as a reference and compare with the real project you’re trying to convert to 64bit. Sync the differences and see where you get to. That’s what I did.

This has wasted so much of my time and has been so difficult to track down that I just had to blog it in case it saved other hours, days or even weeks. There are many variants of the above error each of which could be due to different causes. This post only relates to the error prefixed with msvcprtd.lib(MSVCP100D.dll).

Did this help you out? Let me know in the comments.

Java pitfall: How to prevent Runtime.getRuntime().exec() from hanging

Runtime.getRuntime().exec() is used to execute a command line program from within the Java program as below.

import java.io.File;
import java.io.IOException;

public class ProcessExecutor {

    public static void main(String[] args) throws IOException, InterruptedException {

        String command = "c:\my.exe";
        String workingDir = "c:\myworkingdir";

        // start execution
        Process process = Runtime.getRuntime().exec(command, null, new File(workingDir));

        // wait for completion
        process.waitFor();

    }

}

However the command line program being run above may block/deadlock as it did for me on Windows 7. I was trying to run a program that produced a lot of output. I could run the program standalone but through Java it hung indefinitely. Thread dumps showed nothing.

After being quite puzzled for a while as to why this was happening finally I found the answer in Java 7 api docs for Process.

Because some native platforms only provide limited buffer size for standard input and output streams, failure to promptly write the input stream or read the output stream of the subprocess may cause the subprocess to block, or even deadlock.

So, in fact, the fix for the above program is as follows.

import java.io.BufferedInputStream;
import java.io.File;
import java.io.IOException;

public class ProcessExecutor {

    public static void main(String[] args) throws IOException, InterruptedException {

        String command = "c:\my.exe";
        String workingDir = "c:\myworkingdir";

        // start execution
        Process process = Runtime.getRuntime().exec(command, null, new File(workingDir));

        // exhaust input stream
        BufferedInputStream in = new BufferedInputStream(process.getInputStream());
        byte[] bytes = new byte[4096];
        while (in.read(bytes) != -1) {}

        // wait for completion
        process.waitFor();

    }

}

This is so bad. Not only is this unexpected but it is also undocumented in the exec call. Also another problem is that if you are timing the total execution time for a given command and don’t care about the output you need to read the output anyway and probably subtract the reading time from the total execution time. I’m not sure how accurate that will be.

Surely there could have been a better way to handle this for the user in the api internals. So windows 7 must be one of those OSs with small buffer sizes then. Anyway, at least you know now. Obviously you don’t have to read it into nothing as I’m doing above. You can write it to stdout or a file.

Update: A commenter made a good point that I’d forgotten to read the error stream above. Don’t forget to do so in your own code!

OpenCL Cookbook: How to leverage multiple devices in OpenCL

So far, in the OpenCL Cookbook series, we’ve only looked at utilising a single device for computation. But what happens when you install more than one card in your host machine? How do you scale your computation across multiple GPUs? Will your code automatically scale to multiple devices or does it require you to consciously think about how to distribute the load of the computation across all available devices and change your code to apply that strategy? Here I look at answers to these questions.

Decide on how you want to use the host binding to support multiple devices

There are two ways in which a given host binding can support multiple devices.

  • A single context across all device and one command queue per device.
  • One context and command queue per device

Let’s look at these in more detail with skeletal implementations in C.

Creating a single context across all devices and one command queue per device

For this particular way of the binding supporting multiple devices we create only one context and share it across one command queue per device. So if we have say two devices we’ll have one context and two command queues each of which share that one context.

#include <iostream>
#include <CL/cl.hpp>
#include <CL/opencl.h>

int main () {

    cl_int err;
    
    // get first platform
    cl_platform_id platform;
    err = clGetPlatformIDs(1, &platform, NULL);
    
    // get device count
    cl_uint deviceCount;
    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, NULL, &deviceCount);
    
    // get all devices
    cl_device_id* devices;
    devices = new cl_device_id[deviceCount];
    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, deviceCount, devices, NULL);
    
    // create a single context for all devices
    cl_context context = clCreateContext(NULL, deviceCount, devices, NULL, NULL, &err);
    
    // for each device create a separate queue
    cl_command_queue* queues = new cl_command_queue[deviceCount];
    for (int i = 0; i < deviceCount; i++) {
        queues[i] = clCreateCommandQueue(context, devices[i], 0, &err);
    }

    /*
     * Here you have one context across all devices and one command queue per device.
     * You can choose to send your tasks to any of these queues depending on which
     * device you want to execute the task on.
     */

    // cleanup
    for(int i = 0; i < deviceCount; i++) {
        clReleaseDevice(devices[i]);
        clReleaseCommandQueue(queues[i]);
    }
    
    clReleaseContext(context);

    delete[] devices;
    delete[] queues;
    
    return 0;
    
}

Creating one context and one command queue per device

Here I create one context and one command queue per device each of which have their own context rather than sharing one.

#include <iostream>
#include <CL/cl.hpp>
#include <CL/opencl.h>

int main () {

    cl_int err;
    
    // get first platform
    cl_platform_id platform;
    err = clGetPlatformIDs(1, &platform, NULL);
    
    // get device count
    cl_uint deviceCount;
    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, NULL, &deviceCount);
    
    // get all devices
    cl_device_id* devices;
    devices = new cl_device_id[deviceCount];
    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, deviceCount, devices, NULL);
    
    // for each device create a separate context AND queue
    cl_context* contexts = new cl_context[deviceCount];
    cl_command_queue* queues = new cl_command_queue[deviceCount];
    for (int i = 0; i < deviceCount; i++) {
        contexts[i] = clCreateContext(NULL, deviceCount, devices, NULL, NULL, &err);
        queues[i] = clCreateCommandQueue(contexts[i], devices[i], 0, &err);
    }

    /*
     * Here you have one context and one command queue per device.
     * You can choose to send your tasks to any of these queues.
     */

    // cleanup
    for(int i = 0; i < deviceCount; i++) {
        clReleaseDevice(devices[i]);
        clReleaseContext(contexts[i]);
        clReleaseCommandQueue(queues[i]);
    }
    
    delete[] devices;
    delete[] contexts;
    delete[] queues;
    
    return 0;

}

How do you scale your computation across multiple devices?

The process of utilising multiple devices for your computation is not done automatically by the binding when new devices are detected sadly. Nor is it possible for it do so. Doing this requires active thought from the host programmer. When using a single device you send all your kernel invocations to the command queue associated with that device. In order to use multiple devices you must have one command queue per device either sharing a context or each queue having its own context. Then you must decide how to distribute your kernel calls across all available queues. It may be as simple as a round robin strategy across all queues for all your computations or it may be more complex.

Bear in mind that if your computation entails reading back a result synchronously then a round robin strategy across queues won’t work. This is because each current call will block and complete prior to you sending to the next queue which will essentially make the process of distributing across queues serial. Obviously this defeats the whole purpose of having multiple devices operating in parallel. What you really need is one host thread per device each sending computations to its own command queue. That way each queue is receiving and processing computations in parallel with other queues. Then you effectively achieve true hardware parallelism.

Which of the two ways should you use?

It depends. I would try the single context option first as it’s likely to use less memory and be faster. If you encounter instability or problems I would switch to the multiple context method. That’s the general rule. There is, however, another reason you may opt for a multiple context method. If you are using multiple threads which all require access to a context it is preferable for each thread to have its own context as the opencl host binding is not guaranteed to be thread safe. If you try to access a single context across multiple threads you may get serious system crashes and reboots so always have thread confined opencl structures.

Using a single context across multiple host threads

You may want to use one thread per device to send tasks to the command queue associated with each device. In this case you will have multiple host threads. But here have to be careful. In my experience it has not been safe to use a single context across multiple host threads. The last time I tried this was in C# using the Cloo host binding. Using a single context across multiple host threads resulted in a Windows 7 blue screen, Windows dumping memory to a file and then rebooting after which Windows failed to come back up until physically rebooted once more from the machine. The solution is to use the multi context option outlined above. Have thread confined separation for opencl resources and you’ll be fine.

OpenCL Cookbook: Hello World using C# Cloo host binding

So far I’ve used the C and C++ bindings in the OpenCL Cookbook series. This time I provide a quick and simple example of how to use Cloo – the C# OpenCL host binding. However, since Cloo, for whatever reason, didn’t work as expected with a char array I will use an integer array instead. In other words – instead of sending a “Hello World!” message to the kernel I will send five integers instead. My guess is that there is some sort of bug with Cloo and char arrays.

Device code using Cloo’s variant of the OpenCL language

kernel void helloWorld(global read_only int* message, int messageSize) {
	for (int i = 0; i < messageSize; i++) {
		printf("%d", message[i]);
	}
}

The kernel above is merely illustrative in that it simply receives an integer array and its size and prints the array.

Note that the OpenCL syntax here is not the same as in C/C++. It has additional keywords to say whether the arguments are read only or write or read write and the kernel keyword is not prefixed with two underscores. The Cloo author must have decided that the original OpenCL syntax was for whatever reason unsuitable for adoption which IMO was a mistake. The OpenCL language syntax should be standard for portability, reusability and also so that there is only a single learning curve.

Host code using Cloo API

using System;
using System.Collections.Concurrent;
using System.Threading.Tasks;
using System.IO;
using Cloo;

namespace test
{
    class Program
    {
        static void Main(string[] args)
        {
            // pick first platform
            ComputePlatform platform = ComputePlatform.Platforms[0];

            // create context with all gpu devices
            ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu,
                new ComputeContextPropertyList(platform), null, IntPtr.Zero);

            // create a command queue with first gpu found
            ComputeCommandQueue queue = new ComputeCommandQueue(context,
                context.Devices[0], ComputeCommandQueueFlags.None);

            // load opencl source
            StreamReader streamReader = new StreamReader("..\..\kernels.cl");
            string clSource = streamReader.ReadToEnd();
            streamReader.Close();

            // create program with opencl source
            ComputeProgram program = new ComputeProgram(context, clSource);

            // compile opencl source
            program.Build(null, null, null, IntPtr.Zero);

            // load chosen kernel from program
            ComputeKernel kernel = program.CreateKernel("helloWorld");

            // create a ten integer array and its length
            int[] message = new int[] { 1, 2, 3, 4, 5 };
            int messageSize = message.Length;

            // allocate a memory buffer with the message (the int array)
            ComputeBuffer<int> messageBuffer = new ComputeBuffer<int>(context,
                ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer, message);

            kernel.SetMemoryArgument(0, messageBuffer); // set the integer array
            kernel.SetValueArgument(1, messageSize); // set the array size

            // execute kernel
            queue.ExecuteTask(kernel, null);

            // wait for completion
            queue.Finish();
        }
    }
}

The C# program above uses the Cloo object oriented api to interface with the underlying low level opencl implementation. It’s pretty self explanatory if you’ve been following the series so far. The output of the program is 12345.