OpenCL Cookbook: 10 tips for high performance kernels

Today we adorn our performance hat in the OpenCL Cookbook series. OpenCL kernel optimisation is a black art and very very hard. Recently when I came across a list of 10 tips for high performance kernels in Matthew Scarpino’s excellent book: OpenCL in Action I just had to share it as it was a true gem. Bear in mind that this is an abridged account – if you want more detail read the appropriate section in the book.

  1. Unroll loops: Comparison operations are expensive – if you know how many iterations you need simply perform them individually.
  2. Disable processing of denormalised numbers: The processing of denormalised numbers takes time. Disable them using the -cl-denorms-are-zero option. You can also disable processing of infinite values and NaNs using -cl-finite-math-only.
  3. Transfer constant primitive values to the kernel with compiler defines instead of private memory parameters. Example: clBuildProgram(program, 0, NULL, "-DSIZE=128", NULL, NULL);.
  4. Store small variable values in private memory instead of local memory: Private memory is faster and should be used unless you need to share data between work items in which case use local memory.
  5. Avoid local memory bank conflicts by accessing local memory sequentially: Successive 32-bit elements are stored in successive local memory banks. Sequential bank access is parallel whereas contending on same bank access is serial.
  6. Avoid using modulo operator: This operator is slow and alternatives should be used instead.
  7. Reuse private variables throughout the kernel but use macros to distinguish each separate use.
  8. For multiply-and-add operations use the fma function if available.
  9. Inline non-kernel functions in a program file: This uses more memory but removes the context switches and stack operations associated with regular function calls.
  10. Avoid branch miss penalties by coding conditional statements to be true more often than false: Processors often optimise for the true case and this can result in a penalty for the false case known as the branch miss penalty. Code your conditionals to evaluate to true as often as possible.

It’s easy to list rules for optimising kernels but the truth is that optimising kernels is hard; very hard. The best way to approach it is, in my opinion, to profile your application and kernels as well as to experiment with various changes. There will certainly be optimisations that you’ll apply that will turn out to be ineffectual or even slower so trial and error is key.

2 thoughts on “OpenCL Cookbook: 10 tips for high performance kernels”

Leave a Reply