Tag Archives: performance

OpenCL Cookbook: 10 tips for high performance kernels

Today we adorn our performance hat in the OpenCL Cookbook series. OpenCL kernel optimisation is a black art and very very hard. Recently when I came across a list of 10 tips for high performance kernels in Matthew Scarpino’s excellent book: OpenCL in Action I just had to share it as it was a true gem. Bear in mind that this is an abridged account – if you want more detail read the appropriate section in the book.

  1. Unroll loops: Comparison operations are expensive – if you know how many iterations you need simply perform them individually.
  2. Disable processing of denormalised numbers: The processing of denormalised numbers takes time. Disable them using the -cl-denorms-are-zero option. You can also disable processing of infinite values and NaNs using -cl-finite-math-only.
  3. Transfer constant primitive values to the kernel with compiler defines instead of private memory parameters. Example: clBuildProgram(program, 0, NULL, "-DSIZE=128", NULL, NULL);.
  4. Store small variable values in private memory instead of local memory: Private memory is faster and should be used unless you need to share data between work items in which case use local memory.
  5. Avoid local memory bank conflicts by accessing local memory sequentially: Successive 32-bit elements are stored in successive local memory banks. Sequential bank access is parallel whereas contending on same bank access is serial.
  6. Avoid using modulo operator: This operator is slow and alternatives should be used instead.
  7. Reuse private variables throughout the kernel but use macros to distinguish each separate use.
  8. For multiply-and-add operations use the fma function if available.
  9. Inline non-kernel functions in a program file: This uses more memory but removes the context switches and stack operations associated with regular function calls.
  10. Avoid branch miss penalties by coding conditional statements to be true more often than false: Processors often optimise for the true case and this can result in a penalty for the false case known as the branch miss penalty. Code your conditionals to evaluate to true as often as possible.

It’s easy to list rules for optimising kernels but the truth is that optimising kernels is hard; very hard. The best way to approach it is, in my opinion, to profile your application and kernels as well as to experiment with various changes. There will certainly be optimisations that you’ll apply that will turn out to be ineffectual or even slower so trial and error is key.

Performance Pattern: Multiplication is ~20x faster than Math.pow()

Last time, in the performance pattern series, I wrote about how a bitwise equivalent implementation of the modulo operator was a significantly faster alternative to the modulo operator (around 5x faster). This time, I look at a possible ~20x speedup simply using multiplication in place of the Math.pow() operator. As a reference implementation I use the cumulative distribution function from the Black Scholes algorithm (and the Greeks) because it uses the Math.pow() function quite a bit. First – let’s look at the pow based implementation.

Cumulative distribution function – Math.pow() based implementation

package name.dhruba.black.scholes.utils;

import static java.lang.Math.PI;
import static java.lang.Math.abs;
import static java.lang.Math.exp;
import static java.lang.Math.pow;
import static java.lang.Math.sqrt;

public enum CumulativeDistributionUsingPow {

    _;

    private static final double P = 0.2316419;
    private static final double B1 = 0.319381530;
    private static final double B2 = -0.356563782;
    private static final double B3 = 1.781477937;
    private static final double B4 = -1.821255978;
    private static final double B5 = 1.330274429;

    public static double cumulativeDistribution(double x) {
        double t = 1 / (1 + P * abs(x));
        double t1 = B1 * pow(t, 1);
        double t2 = B2 * pow(t, 2);
        double t3 = B3 * pow(t, 3);
        double t4 = B4 * pow(t, 4);
        double t5 = B5 * pow(t, 5);
        double b = t1 + t2 + t3 + t4 + t5;
        double cd = 1 - standardNormalDistribution(x) * b;
        return x < 0 ? 1 - cd : cd;
    }

    public static double standardNormalDistribution(double x) {
        double top = exp(-0.5 * pow(x, 2));
        double bottom = sqrt(2 * PI);
        return top / bottom;
    }

}

As you can see there are a number of calls to Math.pow() in both the cumulative distribution function and the standard normal distribution. Let us now replace those with equivalent multiplication operations.

Cumulative distribution function – Multiplication based implementation

package name.dhruba.black.scholes.utils;

import static java.lang.Math.PI;
import static java.lang.Math.abs;
import static java.lang.Math.exp;
import static java.lang.Math.sqrt;

public enum CumulativeDistributionUsingMult {

    _;

    private static final double P = 0.2316419;
    private static final double B1 = 0.319381530;
    private static final double B2 = -0.356563782;
    private static final double B3 = 1.781477937;
    private static final double B4 = -1.821255978;
    private static final double B5 = 1.330274429;

    public static double cumulativeDistribution(double x) {
        double t = 1 / (1 + P * abs(x));
        double t1 = t;
        double t2 = t * t;
        double t3 = t2 * t;
        double t4 = t3 * t;
        double t5 = t4 * t;
        double b1 = B1 * t1;
        double b2 = B2 * t2;
        double b3 = B3 * t3;
        double b4 = B4 * t4;
        double b5 = B5 * t5;
        double b = b1 + b2 + b3 + b4 + b5;
        double cd = 1 - standardNormalDistribution(x) * b;
        return x < 0 ? 1 - cd : cd;
    }

    public static double standardNormalDistribution(double x) {
        double top = exp(-0.5 * x * x);
        double bottom = sqrt(2 * PI);
        return top / bottom;
    }

}

Benchmark – Math.pow() versus Multiplication

Benchmarks are generally speaking remarkably difficult to do correctly. Here I’ve done a best effort implementation. The first thing this test does is run both implementations for 10m iterations to forcibly enable JIT. It normally takes around 10k iterations for JIT to come on but I’ve deliberately overcompensated.

Then it runs both implementations for a variety of iteration counts ranging from 10k to 10m each time increasing iteration count by a factor of 10. For each iteration count it stores the results but does not print output until the very end to not involve the system libraries while benchmarking. At the end it prints the results for each iteration count.

package name.dhruba.black.scholes;

import java.util.Random;

import name.dhruba.black.scholes.utils.CumulativeDistributionUsingMult;
import name.dhruba.black.scholes.utils.CumulativeDistributionUsingPow;

import org.junit.Test;

public class TestCumulativeDistributionCalculator {

    @Test
    public void test() {

        Random r = new Random();

        // 10m iteration warmup to ensure jit is underway
        for (int i = 0; i < Math.pow(10, 7); i++) {
            double d = r.nextDouble();
            CumulativeDistributionUsingPow.cumulativeDistribution(d);
            CumulativeDistributionUsingMult.cumulativeDistribution(d);
        }

        // execute both for a variety of iterations and store results
        long[][] times = new long[4][4];
        for (int i = 0; i < 4; i++) {
            int iterations = (int) Math.pow(10, i + 4);
            times[i] = testHelper(iterations, r);
        }

        // computations complete; print times
        for (int i = 0; i < times.length; i++) {
            System.out.format("Over %1$9s iterations pow took %2$5s ms and "
                    + "mult took %3$4s ms; mult was %4$3s faster.n",
                    times[i][0], times[i][1], times[i][2], times[i][3]);
        }

    }

    private long[] testHelper(int iterations, Random r) {

        // generate data for given iterations
        double[] data = new double[iterations];
        for (int i = 0; i < iterations; i++) {
            data[i] = r.nextDouble();
        }

        // benchmark pow for given iterations
        long t1 = System.currentTimeMillis();
        for (int i = 0; i < iterations; i++) {
            CumulativeDistributionUsingPow.cumulativeDistribution(data[i]);
        }
        t1 = System.currentTimeMillis() - t1;

        // benchmark multiplication for given iterations
        long t2 = System.currentTimeMillis();
        for (int i = 0; i < iterations; i++) {
            CumulativeDistributionUsingMult.cumulativeDistribution(data[i]);
        }
        t2 = System.currentTimeMillis() - t2;

        // save times taken
        return new long[] { iterations, t1, t2, t1 / t2 };

    }

}

Running times and speedup factors

Running the above benchmark test prints the following output.

Over     10000 iterations pow took     8 ms and mult took    2 ms; mult was   4 faster.
Over    100000 iterations pow took    82 ms and mult took    4 ms; mult was  20 faster.
Over   1000000 iterations pow took   794 ms and mult took   36 ms; mult was  22 faster.
Over  10000000 iterations pow took  9116 ms and mult took  422 ms; mult was  21 faster.

So, unless I’ve made mistakes above, you can see that simple multiplication can be ~20 faster than equivalent Math.pow() operations and this, I believe, is universal. It will be the same for C++ and probably other languages too. And this is because Math.pow() deals with the generic case of raising a number to any given power. The lookup involved in computing the result (via a native call in Java) is what takes additional time.

Once again, we’ve looked at a way of optimising performance that isn’t entirely obvious and cannot generally be found in official documentation. These are things one learns by working with others who know more than we do. And, by the way, for all the “premature optimisation is the root of all evil” hippie believers I don’t want to hear any of it. If you’re writing high latency code such as webapps or data access code this approach of remaining oblivious to performance may be acceptable but low latency requires an opposite mindset. In low latency you have to preempt implementations with faster ones and you have to question everything you see. It’s a dark art and one that takes a lifetime to learn and here I try to share with you whatever I learn one tip at a time.

jsr166e: Upcoming java.util.concurrent APIs for Java 8

Jsr166e is to Java8 as Jsr166y was to Java7. Jsr166y introduced the fork join framework and Phaser to Java7 which are worthy of blog posts of their own. The fork join framework will enable us to introduce fine grained inversion of concurrency whereby we can code logic without really needing to think about or implement how that logic will perform on arbitrary hardware.

Now that Java 7 has been released jsr166e has emerged as a repository of utilities that are intended for inclusion into Java 8 next year. Having followed the discussion on the concurrency mailing list I’ve become absolutely fascinated by the work going on in jsr166e for the simple reason that it is catering for use cases that have been directly relevant to my recent work. So without further delay here is an abridged guide to the current state of jsr166e.

Collections

  • ConcurrentHashMapV8: A candidate replacement for java.util.concurrent.ConcurrentHashMap with lower memory footprint. The exact improvements of this over the old implementation I’m yet to explore. Here’s the mail thread announcement and discussion.
  • ConcurrentHashMapV8.MappingFunction: A (well overdue) mechanism for automatically computing a value for a key that doesn’t already have one. I’ve waited a long time for this and in my opinion this is the most basic requirement of a concurrent map as without this you always end up locking to create a new mapping.
  • LongAdderTable: A concurrent counting map where a key is associated with an efficient concurrent primitive counter. This provides significantly improved performance over AtomicLong under high contention as it utilises striping across multiple values. I’ve desperately needed this in my job and I am overjoyed that this has finally been written by the experts group. I’ve been exploring and coding up various implementations of my own of such a class recently but I’d rather have this provided by the JDK. Again a very basic requirement and a class that’s well overdue.
  • ReadMostlyVector: Same as Vector but with reduced contention and better throughput for concurrent reads. I’m a little surprised about this one. Does anyone even use Vector anymore? Why not just replace the underlying implementations of Hashtable and Vector with more performant ones? Is there any backward compatibility constraint that’s restricting this?

Adders

The following adders are essentially high performance concurrent primitive counters that dynamically adapt to growing contention to reduce it. The key value add here is achieved by utilising striping across values on writes and acting across the stripes for read.

Again, high performance primitive counters, are something I’ve desperately needed in my work lately. Imagine if you are implementing client server protocols. You may need message sequence numbers to ensure you can discard out of order/older messages. You might also need request response id correlation for which id generation is necessary. For any such id generation I wanted to use primitive longs for efficiency and as a result needed a high performance primitive long counter and now I have one!

Important: It’s important to note one limitation of these counting APIs. There are no compound methods like incrementAndGet() or addAndGet() which significantly reduces the utility of such API. I can see why this is the case: although the writes can stripe across values the read must act across all striped values and as a result is quite expensive. I therefore need to think about how much this will compromise the use of this API for the use case of an efficient id generator.

  • DoubleAdder: A high performance concurrent primitive double counter.
  • LongAdder: A high performance concurrent primitive long counter.

MaxUpdaters

The following exhibit similar performance characteristics to the adders above but instead of maintaining a count or sum they maintain a maximum value. These also use striped values for writes and reading across striped values to compute aggregate values.

  • DoubleMaxUpdater: A high performance primitive double maximum value maintainer.
  • LongMaxUpdater: A high performance primitive long maximum value maintainer.

Synchronizers

  • SequenceLock: Finally, jsr166e adds an additional synchronisation utility. This is an interesting class which took me two or three reviews of the javadoc example to understand its value add. Essentially it offers the ability to conduct a more accommodating conversation between you and the lock provider whereby you can not only choose not to lock and still retain consistent visibility but also fundamentally allow you to detect when other threads have been active simultaneously with your logic thereby allowing you to retry your behaviour until your read of any state is completely consistent at that moment in time. I can see what value this adds and how to use it but I need to think about real world use cases for this utility.

What is still missing?

Sadly, despite the above, Java shows no signs of addressing a number of other real world use cases of mine.

  • Concurrent primitive key maps
  • Concurrent primitive value maps
  • Concurrent primitive key value maps
  • Externalised (inverted) striping utilities that allow you to hash an incoming key to a particular lock across a distribution of locks. This means that you no longer have to lock entire collections but just the lock relevant to the input you are working with. This is absolutely fundamental and essential in my opinion and has already been written by EhCache for their own use but this should ideally be provided as a building block by the JDK.
  • There’s also been a lot of talk about core-striping as opposed to lock striping which I suppose is an interesting need. In other words instead of the distribution of contention being across lock instances they are across representations (IDs) of physical processor cores. Check the mailing list for details.

Summary

I’m very excited indeed by the incorporations of jsr166e not only because they have directly addressed a number of my real world use cases but also because they give an early peek at what’s to come in Java 8. The additional support for primitives is welcome as they will eliminate reliance on the ghastly autoboxing and gc churn of primitive wrappers. I’ll certainly be using these utilities for my own purposes. Keep up the great work! However, I’d love to hear why the above use cases under ‘What’s missing’ still haven’t seen any activity in Java.

LMAX disruptor framework and whitepaper

This is really old news now as I’m very late in posting it but since I’m still coming across people who have remained blissfully unaware I thought this was worth re-iterating. If you haven’t come across this yet drop everything else and read about the LMAX Disruptor framework and the associated whitepaper titled Disruptor: High performance alternative to bounded queues for exchanging data between concurrent threads. There is also an associated (and rather dated now) infoq presentation titled How to Do 100K TPS at Less than 1ms Latency.

In the beginning there was a main thread of execution, then came two and then thousands. Once we had scaled to starvation with threads came SEDA and the concept of queues, hierarchical topologies of queues and lots of writers and readers operating on queues with threads now relegated to second class citizen status. For a while the industry rested in the assurance that it had achieved equilibrium with innovation on latency. Then – out of the blue LMAX happened. LMAX (London Multi Asset eXchange) are the highest performance financial exchange in the world.

Read the whitepaper to find out just how outdated conventional wisdom on concurrent queuing in Java actually is and how a lack of awareness of how your financial code performs end-to-end hardware to VM could be created bottlenecks for your platform. The essence of the disruptor framework is a strikingly simple concept but at the same time profound not only in its effectiveness in attaining its goal – reducing latency – but also in the extent to which it leverages knowledge of hardware and the java virtual machine that it runs on.

It proves wrong beyond doubt the rather outdated mindset that questions employing Java for financial low latency use cases. Ever since Java 5 and particularly Java 6 – the JVM has dwarfed the Java language in its importance, capabilities and scope and as a result utilising Java is now fundamentally synonymous with utilising the JVM which is what makes the language so compelling.

It isn’t about the code that you write. It’s about the code that’s interpreted and then runs natively. It is naive to consider only the language as many seem to be doing in the light of the imminent release of Java 7. It’s important to bear in mind that whilst language sugar is important if runtime matters to you then you’ll want to focus on: (1) the VM (2) writing wholly non-idiomatic Java and (3) opposing conventional wisdom at every level of abstraction every step of the way.

Concurrency pattern: Concurrent multimaps in Java

Preamble

Maintaining state is one of the most important ways in which we can increase the performance of our code. I don’t mean using a database or some goliath system somewhere. I mean local memory caches. Often they can be more than adequate to allow us to maintain state and provide a dramatic improvement in performance.

Before I start with the content of this post, let me just state the obvious or at least what should be obvious if it isn’t already to you, which is that the whole domain of caching is an incredibly difficult and inexhaustible area of study and work which is why dedicated distributed cache providers have sprung up all over the place that companies normally resort to in favour of in-memory caches.

However there is often a combination of in-memory and distributed caches in use and this post focuses on one aspect of in-memory caches – concurrent multimaps in Java and this post is the resource that I wish I had when I was tackling this problem numerous times in the past. The post focuses exclusively on copy on write multimap implementations as that allows the read operation to be lock free which can be a significant advantage depending on what you want to do on read.

Singular value caches

When an in-memory cache is desired one always resorts to some kind of a map structure. And if you’re storing singular key value pairs then creating a cache can be as easy as picking a map implementation though you still have the check-then-act operation of checking whether a value exists and if so returning it otherwise populating it and returning it which can result in some blocking. Nonetheless these problems have already been solved a thousand times over out there now.

For example, Google Guava MapMaker, provides an excellent implementation of the memoization pattern for a cache as follows which is probably the most complex case of a simple singular key value pair cache.

package name.dhruba.kb.concurrency.mapmaker;

import java.util.concurrent.ConcurrentMap;

import com.google.common.base.Function;
import com.google.common.collect.MapMaker;

public class CacheWithExpensiveValues<K, V> {

    private final ConcurrentMap<K, V> cache = new MapMaker().makeComputingMap(new Function<K, V>() {
        @Override
        public V apply(K input) {
            return acquireExpensiveCacheValue();
        }
    });

    public V get(K key) { return cache.get(key); }
    private V acquireExpensiveCacheValue() { return null; }

}

This implementation, in concept similar to the memoization pattern put forward by Brian Goetz in Java Concurrency In Practice, guarantees that a value for a given key is only acquired/resolved once in total during the lifetime of the cache in the event that it hasn’t already been computed which can be very useful if creating/computing the value is an expensive call. Threads which request an uncomputed value while it is being computed wait until the computation already in progress is finished.

This can be said to be strongly consistent in its guarantees. If you are willing to compromise on the guarantees that a cache makes, making it a weakly consistent cache, you may be able to achieve faster performance in some cases by relying solely on atomic CAS operations.

Multi-value caches

A standard singular key value pair cache is fairly straightforward these days. But what happens if you suddenly realise that you actually need multiple values per key? There’s a little bit more to it than first meets the eye. If you are well informed about what’s out there – as a knee jerk reaction, you might immediately think of Google Guava multimaps and create something similar to the example below.

package name.dhruba.kb.concurrency.guava;

import java.util.List;

import com.google.common.collect.ArrayListMultimap;
import com.google.common.collect.ListMultimap;
import com.google.common.collect.Multimaps;

public class MultiMapCacheExample<K, V> {

    private final ListMultimap<K, V> cache = Multimaps.synchronizedListMultimap(ArrayListMultimap
            .<K, V> create());

    public List<V> get(K k) {
        return cache.get(k);
    }

    public List<V> remove(K k) {
        return cache.removeAll(k);
    }

    public void put(K k, V v) {
        cache.put(k, v);
    }

    public boolean remove(K k, K v) {
        return cache.remove(k, v);
    }

}

However, the astute programmer, soon realises the inadequacies of such a solution. The synchronised wrapper pattern here is very similar to that utilised in the JDK in that it synchronises all the calls in the interface. Also, it synchronises the entirety of all the methods in the interface meaning that all paths through any given method will need to contend for and acquire a lock. To put it another way no paths of execution through any method are non-blocking.

As a result, this implementation is likely to perform very poorly under heavy concurrent load. There will be a lot of contention and the cache will only be able to serve one operation at a time. So where do we go from here? Googling didn’t bring much success on the subject of concurrent multimaps in Java when I was looking for what was out there already so I decided to explore this area from first principles. Below I present the process of iteratively developing an efficient concurrent multimap implementation and over the course of a few implementations – making it eventually as non-blocking as possible.

It’s interesting to read why Google Guava have not and will not implement a concurrent multimap though I’m not sure I agree. I think a couple of general purpose concurrent multimaps or at the very least a copy on write multimap would be of value to the public as I’ve seen this pattern quite a lot over the years. But admittedly it wouldn’t just be one implementation. It would need to support a range of backing collections.

Concurrent multimaps

In the following example implementations I present only the most important mutative calls in the map interface as they are the most challenging and the best calls for illustration. Bear in mind also when reading through the implementations the following design considerations.

  • Mutative versus an immutable copy on write approach
  • Size of critical sections and thereby the degree of blocking
  • Strongly consistent or weakly consistent in mutual exclusion guarantees
  • When removing the last value for a key should the multimap remove the key and associated empty collection?

Weakly consistent implementations are very common in the industry but are prone to interleaving. For example, if a put() is in progress and someone calls remove() on the key altogether then, after the remove has been invoked, the put() will put the key value association back in which may not be desirable at all. Or maybe put tries to add to a value collection that is now no longer referenced because the key has been removed. These methods should ideally be mutually exclusive and the final implementation achieves this quality. Though it is important to bear in mind that for certain use cases weakly consistent guarantees are acceptable and it is for you to say what is acceptable to your use case and what isn’t.

Fully blocking multimap

The fully blocking implementation is equivalent to the synchronised wrapper approach because it synchronises the entirety of all the methods. This is without doubt the poorest performing implementation though on the plus side it has minimal allocation unlike the copy on write implementations that follow.

Advantages
  • Strongly consistent.
  • Doesn’t allocate any more than it needs to (unlike the copy on write pattern).
Disadvantages
  • Very poor performance.
  • Uses a hashmap which isn’t thread safe so offers no visibility guarantees.
  • All calls – reads/writes are blocking.
  • All paths through the blocking calls are blocking.
package name.dhruba.kb.concurrency.multimap;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class FullyBlockingMutativeArrayListMultiMap<K, V> {

    private final Map<K, List<V>> cache = new HashMap<K, List<V>>();

    public synchronized List<V> get(K k) {
        return cache.get(k);
    }

    public synchronized List<V> remove(K k) {
        return cache.remove(k);
    }

    public synchronized void put(K k, V v) {
        List<V> list = cache.get(k);
        if (list == null) {
            list = new ArrayList<V>();
            cache.put(k, list);
        }
        list.add(v);
    }

    public synchronized boolean remove(K k, V v) {
        List<V> list = cache.get(k);
        if (list == null) {
            return false;
        }
        if (list.isEmpty()) {
            cache.remove(k);
            return false;
        }
        boolean removed = list.remove(v);
        if (removed && list.isEmpty()) {
            cache.remove(k);
        }
        return removed;
    }

}

Copy on write multimap using synchronisation

This is an initial implementation of a copy on write approach but without using the jdk copy on write collections. It is strongly consistent but it still synchronises too much on writes.

Advantages
  • Strongly consistent.
  • Uses concurrent hash map so we can have non-blocking read.
Disadvantages
  • The synchronisation lock blocks on the entire cache.
  • The blocking calls are entirely blocking so all paths through them will block.
  • Concurrent hash map is blocking itself although at a fine grained level using stripes.
package name.dhruba.kb.concurrency.multimap;

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;

public class CopyOnWriteArrayListMultiMap<K, V> {

	private final ConcurrentMap<K, List<V>> cache = new ConcurrentHashMap<K, List<V>>();

	public List<V> get(K k) {
		return cache.get(k);
	}

	public synchronized List<V> remove(K k) {
		return cache.remove(k);
	}

	public synchronized void put(K k, V v) {
		List<V> list = cache.get(k);
		if (list == null || list.isEmpty()) {
			list = new ArrayList<V>();
		} else {
			list = new ArrayList<V>(list);
		}
		list.add(v);
		cache.put(k, list);
	}

	public synchronized boolean remove(K k, K v) {
		List<V> list = cache.get(k);
		if (list == null) {
			return false;
		}
		if (list.isEmpty()) {
			cache.remove(k);
			return false;
		}
		boolean removed = list.remove(v);
		if (removed) {
			if (list.isEmpty()) {
				cache.remove(k);
			} else {
			    list = new ArrayList<V>(list);
				cache.put(k, list);
			}
		}
		return removed;
	}

}

Copy on write multimap but using the JDK CopyOnWriteArrayList

Here we opt to use the copy on write array list from the jdk. There is no synchronisation in the class itself (only within backing structure) but it is dangerously prone to interleaving and therefore weakly consistent. Personally I wouldn’t be happy about put() and remove() not being mutually exclusive and interleaving through each other. That to me would be unacceptable. Amazingly I’ve seen this implementation all too often at work.

Advantages
  • Uses {@link ConcurrentHashMap} for thread safety and visibility.
  • Uses {@link CopyOnWriteArrayList} for list thread safety and visibility.
  • No blocking in class itself. Instead the backing jdk classes handle blocking for us.
  • Blocking has been reduced to key level granularity instead of being at the cache level.
Disadvantages
  • Prone to interleaving. It is weakly consistent and does not guarantee mutually exclusive and atomic calls. The {@link remove(K)} call can interleave through the lines of the put method and potentially key value pairs can be added back in if a{@link remove(K)} is called part way through the {@link #put(K,V)} call. To be strongly consistent the {@link #remove(K)} and {@link #put(K,V)} need to be mutually exclusive.
package name.dhruba.kb.concurrency.multimap;

import java.util.List;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;
import java.util.concurrent.CopyOnWriteArrayList;

public class JdkCopyOnWriteArrayListMultiMap<K, V> {

    private final ConcurrentMap<K, List<V>> cache = new ConcurrentHashMap<K, List<V>>();

    public List<V> get(K k) {
        return cache.get(k);
    }

    public List<V> remove(K k) {
        return cache.remove(k);
    }

    public void put(K k, V v) {
        List<V> list = cache.get(k);
        if (list == null) {
            list = new CopyOnWriteArrayList<V>();
            List<V> oldList = cache.putIfAbsent(k, list);
            if (oldList != null) {
                list = oldList;
            }
        }
        list.add(v);
    }

    public boolean remove(K k, K v) {
        List<V> list = cache.get(k);
        if (list == null) {
            return false;
        }
        if (list.isEmpty()) {
            cache.remove(k);
            return false;
        }
        boolean removed = list.remove(k);
        if (removed && list.isEmpty()) {
            cache.remove(k);
        }
        return removed;
    }

}

Partially blocking copy on write multimap

So from the previous implementation we return to a strongly consistent implementation but this time block only on certain paths through the put() and remove() methods at the cost of a little additional allocation. However the lock it uses is still a global one which means that operations on different keys will become sequential which is obviously not desirable.

Advantages
  • Strongly consistent.
  • Use of {@link ConcurrentHashMap} for thread safety and visibility guarantees.
  • The {@link #get(Object)} and {@link #remove(Object)} calls don’t block at all in this class.
  • The {@link #put(Object, Object)} and {@link #remove(Object, Object)} methods do block but only for certain paths. There are paths through these methods which won’t block at all. The {@link #put(Object, Object)} method only blocks if the {@link ConcurrentHashMap#putIfAbsent(Object, Object)} fails and the {@link #remove(Object, Object)} only blocks if there is something there to remove.
Disadvantages
  • We allocate a list initially in the {@link #put(Object, Object)} which may not be needed.
  • {@link ConcurrentHashMap} still blocks although at a finer level using stripes.
  • The blocking synchronisation we are using is still blocking the entire cache. What we really want is to block only the keys that hash to the value bucket that we are currently working with. A more fine grained blocking strategy is called for which we’ll see in the next implementation.
package name.dhruba.kb.concurrency.multimap;

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;

public class PartiallyBlockingCopyOnWriteArrayListMultiMap<K, V> {

    private final ConcurrentMap<K, List<V>> cache = new ConcurrentHashMap<K, List<V>>();

    public List<V> get(K k) {
        return cache.get(k);
    }

    public List<V> remove(K k) {
        synchronized (cache) {
            return cache.remove(k);
        }
    }

    public void put(K k, V v) {
        List<V> list = Collections.singletonList(v);
        List<V> oldList = cache.putIfAbsent(k, list);
        if (oldList != null) {
            synchronized (cache) {
                list = cache.get(k);
                if (list == null || list.isEmpty()) {
                    list = new ArrayList<V>();
                } else {
                    list = new ArrayList<V>(list);
                }
                list.add(v);
                cache.put(k, list);
            }
        }
    }

    public boolean remove(K k, K v) {
        List<V> list = cache.get(k);
        if (list == null) {
            return false;
        }
        synchronized (cache) {
            list = cache.get(k);
            if (list == null) {
                return false;
            }
            if (list.isEmpty()) {
                cache.remove(k);
                return false;
            }
            boolean removed = list.remove(v);
            if (removed) {
                if (list.isEmpty()) {
                    cache.remove(k);
                } else {
                    list = new ArrayList<V>(list);
                    cache.put(k, list);
                }
            }
            return removed;
        }
    }

}

Striped lock copy on write multimap

The final implementation – strongly consistent, non-blocking backing structure, fine grained locking at key level and locking only on necessary paths. This example uses a striped lock provider. It’s purpose is to take a key as input and provide a lock as output to lock on. However it is consistent in that it always provides the same lock for the same key guaranteeing the mutual exclusion that is necessary.

It takes the number of locks desired as a constructor input (by default 2048) which means we can decide how many locks we want to make available in the distribution. It will then accordingly consistently hash across the distribution of locks. It also provides a better key distribution over non-concurrent hash maps. The concept behind the striped lock provider and its implementation is a very interesting topic and this will form a new post of its own in the future! Stay tuned!

Advantages
  • Strongly consistent. Implements correct mutual exclusion of calls.
  • Uses {@link NonBlockingHashMap} instead of {@link ConcurrentHashMap} so the backing cache member does not block at all. Far more efficient and scalable than {@link ConcurrentHashMap}.
  • The read calls are completely non-blocking even at the cache structure level.
  • The {@link #put(Object, Object)} and {@link #remove(Object, Object)} methods do block but only for certain paths. There are paths through these methods which won’t block at all. The {@link #put(Object, Object)} method only blocks if the {@link NonBlockingHashMap#putIfAbsent(Object, Object)} fails and the {@link #remove(Object, Object)} only blocks if there is something there to remove.
  • And to save the best for last – there is no longer any blocking at the cache level. We now apply mutual exclusion only at the key level.

This implementation has the best of all worlds really as long as the copy on write approach is acceptable to you.

Disadvantages
  • Fundamentally being a copy on write approach it does more allocation than a mutative approach.
package name.dhruba.kb.concurrency.multimap;

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;

import name.dhruba.kb.concurrency.striping.IntrinsicStripedLockProvider;

import org.cliffc.high_scale_lib.NonBlockingHashMap;

public class StripedLockArrayListMultiMap<K, V> {

    private final IntrinsicStripedLockProvider stripedLockProvider = new IntrinsicStripedLockProvider();
    private final ConcurrentMap<K, List<V>> cache = new NonBlockingHashMap<K, List<V>>();

    public List<V> get(K k) {
        return cache.get(k);
    }

    public List<V> remove(K k) {
        Object lock = stripedLockProvider.getLockForKey(k);
        synchronized (lock) {
            return cache.remove(k);
        }
    }

    public void put(K k, V v) {
        List<V> list = Collections.singletonList(v);
        List<V> oldList = cache.putIfAbsent(k, list);
        if (oldList != null) {
            Object lock = stripedLockProvider.getLockForKey(k);
            synchronized (lock) {
                list = cache.get(k);
                if (list == null || list.isEmpty()) {
                    list = new ArrayList<V>();
                } else {
                    list = new ArrayList<V>(list);
                }
                list.add(v);
                cache.put(k, list);
            }
        }
    }

    public boolean remove(K k, K v) {
        List<V> list = cache.get(k);
        if (list == null) {
            return false;
        }
        Object lock = stripedLockProvider.getLockForKey(k);
        synchronized (lock) {
            list = cache.get(k);
            if (list == null) {
                return false;
            }
            if (list.isEmpty()) {
                cache.remove(k);
                return false;
            }
            boolean removed = list.remove(v);
            if (removed) {
                if (list.isEmpty()) {
                    cache.remove(k);
                } else {
                    list = new ArrayList<V>(list);
                    cache.put(k, list);
                }
            }
            return removed;
        }
    }

}

Conclusion

When designing concurrent structures it is important not to always resort blindly to what’s out there and for custom concurrent data structures there will be no ready made solution. For those cases concurrent patterns such as this are invaluable and best practices such as reducing critical sections, reducing the amount of blocking to paths that need it, reducing the granularity of the locks being used and selection of the right backing structures are absolutely key to an efficient concurrent data structure. If you have any feedback for how to do better on the above or if I’ve made any mistakes please do let me know. Enjoy and thanks for reading.

Links

FYI – it seems Joe Kearney has done an alternative implementation that does not rely on copy on write.

Update [05/07/2011]: Code updated with bug fixes for edge cases.
Update [17/04/2013]: After a long wait I finally fixed the bug that Charlie reported below. Thanks Charlie!

Java SE 6 1.6.0_18 update worthy of note

Compared to other v6 release the release of Java SE 1.6.0_18 is definitely worthy of note.

Major changes

  • VisualVM 1.2
  • Performance improvements
  • 20% faster jar file creation
  • Java Hotspot VM 16.0
    • Improved NUMA-aware allocation giving a 30% (for 32-bit) to 40% (for 64-bit) increase in performance (which is due also to arrive in jdk7 as well)
    • Garbage collection improvements including new default java heap configuration in client and server vms
    • Classloading optimisations for faster startup
    • Code generation improvements including elision of needless conversions between integer primitive types and optimisations of common string concatenation patterns
    • Garbage First (G1) garbage collector improved reliability and performance
    • New options to request a heap dump or class histogram before or after a full GC
    • Escape analysis based optimisation which appeared in 6u14 has been disabled in 6u18 to be restored in a future update

Minor changes

  • Support for Windows 7
  • JavaDB 10.5.3.0
  • Application startup improvements
  • Runtime performance improvements for UI applications
  • Ability to read larger zip files of size up to 4GB

As the community eagerly awaits Java 7 in Q4 of 2010 releases such as this one serve to provide a little excitement in the mean time. It’s interesting to see that certain features that are due to arrive with Java 7 are also being offered in Java 6 such as NUMA-aware allocation, G1 collector and escape analysis. For whatever reason escape analysis has been disabled in this release but no reason has been given – it would be interesting to know why.

Spring AOP timing aspect

At some point or another everyone writes a timing aspect. It’s simple yet fun. Here’s mine. This is a timing aspect with a small difference. There exists an annotation called @Timed and you can apply that annotation at a class level or method level and accordingly the invocations shall be timed and the output printed to stdout. Although it is intended to be illustrative to a degree it hopefully offers some utility value also.

Summary

The following aspect acts on annotated classes and methods and times invocations producing output to stdout (which can be changed easily to a logger). In the background it uses Spring AOP driven by jdk proxies or cglib proxies depending on what the attribute proxy-target-class is set to in the spring xml.

Continue reading

JUnit3, JUnit4 and Spring testing tips and trivia

If you’re in a legacy environment (as I am for the moment) with only the old version of junit at your disposal the following may be useful.

JUnit3 Multiple Instantiation [#]

import junit.framework.TestCase;

public class Junit3MultipleInstancesTest extends TestCase {

    public Junit3MultipleInstancesTest() {
        System.out.println("test instance created");
    }

    public void test1() {
        System.out.println("test1");
    }

    public void test2() {
        System.out.println("test2");
    }
    
}

What would you expect the above to print? Answer below.

test instance created
test instance created
test1
test2

JUnit4 Multiple Instantiation

import org.junit.Test;

public class Junit4MultipleInstancesTest {

    public Junit4MultipleInstancesTest() {
        System.out.println("test instance created");
    }

    @Test
    public void test1() {
        System.out.println("test1");
    }

    @Test
    public void test2() {
        System.out.println("test2");
    }

}

Output:

test instance created
test1
test instance created
test2

Would you want your test class to be instantiated on every test? Are you satisfied with the justification that it is done for complete test isolation between test methods? Does this make it impossible to maintain state between test methods without resorting to static context?

JUnit3 one time setup and one time teardown

Want to declare your one-time setup and one-time teardown in one place and reuse it for all your tests? You can.

Base class
import junit.framework.TestCase;

public class Junit3TestSuperClass extends TestCase {

    protected static void setUpBeforeClass() {
        System.out.println("set up before class");
    }

    protected static void tearDownAfterClass() {
        System.out.println("tear down after class");
    }

}
Child class
import junit.extensions.TestSetup;
import junit.framework.Test;
import junit.framework.TestSuite;

public class Junit3TestSubClass extends Junit3TestSuperClass {

    public static Test suite() {

        TestSuite suite = new TestSuite(Junit3TestSubClass.class);

        return new TestSetup(suite) {

            protected void setUp() {
                setUpBeforeClass();
            }

            protected void tearDown() {
                tearDownAfterClass();
            }
        };

    }

    public void test1() {
        System.out.println("test1");
    }

    public void test2() {
        System.out.println("test2");
    }

}

The subclass, when run, prints the following.

set up before class
test1
test2
tear down after class

Yes – the additional infrastructural suite() method in the subclass is nasty but then again junit3 makes no other provisions. This is made a hell of a lot nicer in juni4 with the allowance of @BeforeClass and @AfterClass annotations.

JUnit4 one time setup and one time teardown

import org.junit.AfterClass;
import org.junit.BeforeClass;
import org.junit.Test;

public class Junit4Test {

    @BeforeClass
    public static void setUpBeforeClass() throws Exception {
        System.out.println("set up before class");
    }

    @AfterClass
    public static void tearDownAfterClass() throws Exception {
        System.out.println("tear down after class");
    }

    @Test
    public void test1() {
        System.out.println("test1");
    }

    @Test
    public void test2() {
        System.out.println("test2");
    }

}

Performance optimisation

When testing one might want to reduce overhead and run tests quickly. This may be because you have a lot of tests or because you are waiting on them repeatedly to verify that your code is working correctly as per the laws of test-driven development.

For unit tests time, overhead and waiting are not normally a problem as they are pretty succinct, quick and free of other dependencies i.e are isolated. However, integration tests need a whole other approach as they are end-to-end and often rely on environmental and infrastructual tools and at times human intervention to work satisfactorily.

Take the rather common use-case of a Spring integration test. Spring have gone out of their way to provide a fully featured test framework and numerous mock classes to make testing quick and easy. However if certain considerations are overlooked it can make your integration test suite significantly slower. Adhering to the following helps.

  • Use the Spring test framework and paradigms. Don’t reinvent the wheel. You will almost certainly end up with a more inferior solution. If there’s a quirk of your application which prevents you from using a Spring based testing architecture for your spring application sort it out. It’s often the least time consuming option and aids better long term maintainability.
  • Use the context cache. Load the context once and run many integration tests. Spring 2.0.x and 2.5.x have native support for this out of the box. However if you are loading the context manually one can still cache using the above one-time setup and teardown paradigm. If your tests are modifying the context or in Spring’s terminology dirtying the context then it has support for reloading after designated dirty actions.
  • Use test context files. In this way heavyweight beans can be replaced with lighter and simple alternatives if necessary. Mocks can be introduced and a more custom aggregation of context files can be achieved. Test contexts are not a “bad thing” – in fact I would argue that they are a “good thing” and a necessity depending on the application. Ultimately tests are not secondary – they are part of your application – if done correctly then almost half of your application. You want them to be maintainable and efficient.
  • Use stubs and mocks for when you cannot rely on your environment or for better isolation. A stub is a dummy class which implements the same theoretical interface api as the original class but does nothing. A mock is like a dynamic stub which can be told to adapt to a particular class interface and can be told what input to receive and how to react to it. Mocks can make slow end-to-end integration tests run almost instantly unless of course you really do need end-to-end testing.
  • Use default lazy initialisation/loading of spring beans so that you only load the bean dependencies of the beans you require and are testing. If certain specific beans like scheduled jobs are not referred to by any other bean and need to be eagerly loaded at context startup then explicitly mark them as eagerly loaded. This is the same mentality as in hibernate.
  • Designate different types of tests in different areas. Layer them by responsibility. Data layer tests, business layer tests, service layer tests and controller tests all have different responsibilities and should be separated. The advantages are that you can run each responsibility separately but also it hugely improves structuring and maintainability. If using a build tool or continuous integration you can also ask these tools to run specific test groups at given time intervals or to skip certain problematic or slow test groups.

Essential test attributes

Incidentally, InfoQ, had an excellent article on testing recently. Here’s what I extracted to be the key attributes of a good test from my point of view.

  • Automated
  • Environment independent (to extent possible)
  • Easy to run
  • Quick to run
  • Provides regression testing capability
  • Can be run repeatedly
  • Demonstrates use of api through example
  • Order independent

Ultimately one has to be realistic. True TDD i.e. tests before code is difficult to achieve in a commercial hard driven environment and most people don’t expect it nor expect to be doing it. However, tests should at least be written after the task is done. After all, without that, how do we know anything works at any moment in time?

Recently I’ve overheard people question what benefits junit4 and spring 2.5 test context framework provide over the old ways of doing things. My short answer would be huge benefits. If I get time I’ll do a summary post on this topic. For now here are Junit4‘s and Spring‘s official answer.