Tag Archives: efficiency

jsr166e: Upcoming java.util.concurrent APIs for Java 8

Jsr166e is to Java8 as Jsr166y was to Java7. Jsr166y introduced the fork join framework and Phaser to Java7 which are worthy of blog posts of their own. The fork join framework will enable us to introduce fine grained inversion of concurrency whereby we can code logic without really needing to think about or implement how that logic will perform on arbitrary hardware.

Now that Java 7 has been released jsr166e has emerged as a repository of utilities that are intended for inclusion into Java 8 next year. Having followed the discussion on the concurrency mailing list I’ve become absolutely fascinated by the work going on in jsr166e for the simple reason that it is catering for use cases that have been directly relevant to my recent work. So without further delay here is an abridged guide to the current state of jsr166e.

Collections

  • ConcurrentHashMapV8: A candidate replacement for java.util.concurrent.ConcurrentHashMap with lower memory footprint. The exact improvements of this over the old implementation I’m yet to explore. Here’s the mail thread announcement and discussion.
  • ConcurrentHashMapV8.MappingFunction: A (well overdue) mechanism for automatically computing a value for a key that doesn’t already have one. I’ve waited a long time for this and in my opinion this is the most basic requirement of a concurrent map as without this you always end up locking to create a new mapping.
  • LongAdderTable: A concurrent counting map where a key is associated with an efficient concurrent primitive counter. This provides significantly improved performance over AtomicLong under high contention as it utilises striping across multiple values. I’ve desperately needed this in my job and I am overjoyed that this has finally been written by the experts group. I’ve been exploring and coding up various implementations of my own of such a class recently but I’d rather have this provided by the JDK. Again a very basic requirement and a class that’s well overdue.
  • ReadMostlyVector: Same as Vector but with reduced contention and better throughput for concurrent reads. I’m a little surprised about this one. Does anyone even use Vector anymore? Why not just replace the underlying implementations of Hashtable and Vector with more performant ones? Is there any backward compatibility constraint that’s restricting this?

Adders

The following adders are essentially high performance concurrent primitive counters that dynamically adapt to growing contention to reduce it. The key value add here is achieved by utilising striping across values on writes and acting across the stripes for read.

Again, high performance primitive counters, are something I’ve desperately needed in my work lately. Imagine if you are implementing client server protocols. You may need message sequence numbers to ensure you can discard out of order/older messages. You might also need request response id correlation for which id generation is necessary. For any such id generation I wanted to use primitive longs for efficiency and as a result needed a high performance primitive long counter and now I have one!

Important: It’s important to note one limitation of these counting APIs. There are no compound methods like incrementAndGet() or addAndGet() which significantly reduces the utility of such API. I can see why this is the case: although the writes can stripe across values the read must act across all striped values and as a result is quite expensive. I therefore need to think about how much this will compromise the use of this API for the use case of an efficient id generator.

  • DoubleAdder: A high performance concurrent primitive double counter.
  • LongAdder: A high performance concurrent primitive long counter.

MaxUpdaters

The following exhibit similar performance characteristics to the adders above but instead of maintaining a count or sum they maintain a maximum value. These also use striped values for writes and reading across striped values to compute aggregate values.

  • DoubleMaxUpdater: A high performance primitive double maximum value maintainer.
  • LongMaxUpdater: A high performance primitive long maximum value maintainer.

Synchronizers

  • SequenceLock: Finally, jsr166e adds an additional synchronisation utility. This is an interesting class which took me two or three reviews of the javadoc example to understand its value add. Essentially it offers the ability to conduct a more accommodating conversation between you and the lock provider whereby you can not only choose not to lock and still retain consistent visibility but also fundamentally allow you to detect when other threads have been active simultaneously with your logic thereby allowing you to retry your behaviour until your read of any state is completely consistent at that moment in time. I can see what value this adds and how to use it but I need to think about real world use cases for this utility.

What is still missing?

Sadly, despite the above, Java shows no signs of addressing a number of other real world use cases of mine.

  • Concurrent primitive key maps
  • Concurrent primitive value maps
  • Concurrent primitive key value maps
  • Externalised (inverted) striping utilities that allow you to hash an incoming key to a particular lock across a distribution of locks. This means that you no longer have to lock entire collections but just the lock relevant to the input you are working with. This is absolutely fundamental and essential in my opinion and has already been written by EhCache for their own use but this should ideally be provided as a building block by the JDK.
  • There’s also been a lot of talk about core-striping as opposed to lock striping which I suppose is an interesting need. In other words instead of the distribution of contention being across lock instances they are across representations (IDs) of physical processor cores. Check the mailing list for details.

Summary

I’m very excited indeed by the incorporations of jsr166e not only because they have directly addressed a number of my real world use cases but also because they give an early peek at what’s to come in Java 8. The additional support for primitives is welcome as they will eliminate reliance on the ghastly autoboxing and gc churn of primitive wrappers. I’ll certainly be using these utilities for my own purposes. Keep up the great work! However, I’d love to hear why the above use cases under ‘What’s missing’ still haven’t seen any activity in Java.

Concurrency pattern: Concurrent multimaps in Java

Preamble

Maintaining state is one of the most important ways in which we can increase the performance of our code. I don’t mean using a database or some goliath system somewhere. I mean local memory caches. Often they can be more than adequate to allow us to maintain state and provide a dramatic improvement in performance.

Before I start with the content of this post, let me just state the obvious or at least what should be obvious if it isn’t already to you, which is that the whole domain of caching is an incredibly difficult and inexhaustible area of study and work which is why dedicated distributed cache providers have sprung up all over the place that companies normally resort to in favour of in-memory caches.

However there is often a combination of in-memory and distributed caches in use and this post focuses on one aspect of in-memory caches – concurrent multimaps in Java and this post is the resource that I wish I had when I was tackling this problem numerous times in the past. The post focuses exclusively on copy on write multimap implementations as that allows the read operation to be lock free which can be a significant advantage depending on what you want to do on read.

Singular value caches

When an in-memory cache is desired one always resorts to some kind of a map structure. And if you’re storing singular key value pairs then creating a cache can be as easy as picking a map implementation though you still have the check-then-act operation of checking whether a value exists and if so returning it otherwise populating it and returning it which can result in some blocking. Nonetheless these problems have already been solved a thousand times over out there now.

For example, Google Guava MapMaker, provides an excellent implementation of the memoization pattern for a cache as follows which is probably the most complex case of a simple singular key value pair cache.

package name.dhruba.kb.concurrency.mapmaker;

import java.util.concurrent.ConcurrentMap;

import com.google.common.base.Function;
import com.google.common.collect.MapMaker;

public class CacheWithExpensiveValues<K, V> {

    private final ConcurrentMap<K, V> cache = new MapMaker().makeComputingMap(new Function<K, V>() {
        @Override
        public V apply(K input) {
            return acquireExpensiveCacheValue();
        }
    });

    public V get(K key) { return cache.get(key); }
    private V acquireExpensiveCacheValue() { return null; }

}

This implementation, in concept similar to the memoization pattern put forward by Brian Goetz in Java Concurrency In Practice, guarantees that a value for a given key is only acquired/resolved once in total during the lifetime of the cache in the event that it hasn’t already been computed which can be very useful if creating/computing the value is an expensive call. Threads which request an uncomputed value while it is being computed wait until the computation already in progress is finished.

This can be said to be strongly consistent in its guarantees. If you are willing to compromise on the guarantees that a cache makes, making it a weakly consistent cache, you may be able to achieve faster performance in some cases by relying solely on atomic CAS operations.

Multi-value caches

A standard singular key value pair cache is fairly straightforward these days. But what happens if you suddenly realise that you actually need multiple values per key? There’s a little bit more to it than first meets the eye. If you are well informed about what’s out there – as a knee jerk reaction, you might immediately think of Google Guava multimaps and create something similar to the example below.

package name.dhruba.kb.concurrency.guava;

import java.util.List;

import com.google.common.collect.ArrayListMultimap;
import com.google.common.collect.ListMultimap;
import com.google.common.collect.Multimaps;

public class MultiMapCacheExample<K, V> {

    private final ListMultimap<K, V> cache = Multimaps.synchronizedListMultimap(ArrayListMultimap
            .<K, V> create());

    public List<V> get(K k) {
        return cache.get(k);
    }

    public List<V> remove(K k) {
        return cache.removeAll(k);
    }

    public void put(K k, V v) {
        cache.put(k, v);
    }

    public boolean remove(K k, K v) {
        return cache.remove(k, v);
    }

}

However, the astute programmer, soon realises the inadequacies of such a solution. The synchronised wrapper pattern here is very similar to that utilised in the JDK in that it synchronises all the calls in the interface. Also, it synchronises the entirety of all the methods in the interface meaning that all paths through any given method will need to contend for and acquire a lock. To put it another way no paths of execution through any method are non-blocking.

As a result, this implementation is likely to perform very poorly under heavy concurrent load. There will be a lot of contention and the cache will only be able to serve one operation at a time. So where do we go from here? Googling didn’t bring much success on the subject of concurrent multimaps in Java when I was looking for what was out there already so I decided to explore this area from first principles. Below I present the process of iteratively developing an efficient concurrent multimap implementation and over the course of a few implementations – making it eventually as non-blocking as possible.

It’s interesting to read why Google Guava have not and will not implement a concurrent multimap though I’m not sure I agree. I think a couple of general purpose concurrent multimaps or at the very least a copy on write multimap would be of value to the public as I’ve seen this pattern quite a lot over the years. But admittedly it wouldn’t just be one implementation. It would need to support a range of backing collections.

Concurrent multimaps

In the following example implementations I present only the most important mutative calls in the map interface as they are the most challenging and the best calls for illustration. Bear in mind also when reading through the implementations the following design considerations.

  • Mutative versus an immutable copy on write approach
  • Size of critical sections and thereby the degree of blocking
  • Strongly consistent or weakly consistent in mutual exclusion guarantees
  • When removing the last value for a key should the multimap remove the key and associated empty collection?

Weakly consistent implementations are very common in the industry but are prone to interleaving. For example, if a put() is in progress and someone calls remove() on the key altogether then, after the remove has been invoked, the put() will put the key value association back in which may not be desirable at all. Or maybe put tries to add to a value collection that is now no longer referenced because the key has been removed. These methods should ideally be mutually exclusive and the final implementation achieves this quality. Though it is important to bear in mind that for certain use cases weakly consistent guarantees are acceptable and it is for you to say what is acceptable to your use case and what isn’t.

Fully blocking multimap

The fully blocking implementation is equivalent to the synchronised wrapper approach because it synchronises the entirety of all the methods. This is without doubt the poorest performing implementation though on the plus side it has minimal allocation unlike the copy on write implementations that follow.

Advantages
  • Strongly consistent.
  • Doesn’t allocate any more than it needs to (unlike the copy on write pattern).
Disadvantages
  • Very poor performance.
  • Uses a hashmap which isn’t thread safe so offers no visibility guarantees.
  • All calls – reads/writes are blocking.
  • All paths through the blocking calls are blocking.
package name.dhruba.kb.concurrency.multimap;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class FullyBlockingMutativeArrayListMultiMap<K, V> {

    private final Map<K, List<V>> cache = new HashMap<K, List<V>>();

    public synchronized List<V> get(K k) {
        return cache.get(k);
    }

    public synchronized List<V> remove(K k) {
        return cache.remove(k);
    }

    public synchronized void put(K k, V v) {
        List<V> list = cache.get(k);
        if (list == null) {
            list = new ArrayList<V>();
            cache.put(k, list);
        }
        list.add(v);
    }

    public synchronized boolean remove(K k, V v) {
        List<V> list = cache.get(k);
        if (list == null) {
            return false;
        }
        if (list.isEmpty()) {
            cache.remove(k);
            return false;
        }
        boolean removed = list.remove(v);
        if (removed && list.isEmpty()) {
            cache.remove(k);
        }
        return removed;
    }

}

Copy on write multimap using synchronisation

This is an initial implementation of a copy on write approach but without using the jdk copy on write collections. It is strongly consistent but it still synchronises too much on writes.

Advantages
  • Strongly consistent.
  • Uses concurrent hash map so we can have non-blocking read.
Disadvantages
  • The synchronisation lock blocks on the entire cache.
  • The blocking calls are entirely blocking so all paths through them will block.
  • Concurrent hash map is blocking itself although at a fine grained level using stripes.
package name.dhruba.kb.concurrency.multimap;

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;

public class CopyOnWriteArrayListMultiMap<K, V> {

	private final ConcurrentMap<K, List<V>> cache = new ConcurrentHashMap<K, List<V>>();

	public List<V> get(K k) {
		return cache.get(k);
	}

	public synchronized List<V> remove(K k) {
		return cache.remove(k);
	}

	public synchronized void put(K k, V v) {
		List<V> list = cache.get(k);
		if (list == null || list.isEmpty()) {
			list = new ArrayList<V>();
		} else {
			list = new ArrayList<V>(list);
		}
		list.add(v);
		cache.put(k, list);
	}

	public synchronized boolean remove(K k, K v) {
		List<V> list = cache.get(k);
		if (list == null) {
			return false;
		}
		if (list.isEmpty()) {
			cache.remove(k);
			return false;
		}
		boolean removed = list.remove(v);
		if (removed) {
			if (list.isEmpty()) {
				cache.remove(k);
			} else {
			    list = new ArrayList<V>(list);
				cache.put(k, list);
			}
		}
		return removed;
	}

}

Copy on write multimap but using the JDK CopyOnWriteArrayList

Here we opt to use the copy on write array list from the jdk. There is no synchronisation in the class itself (only within backing structure) but it is dangerously prone to interleaving and therefore weakly consistent. Personally I wouldn’t be happy about put() and remove() not being mutually exclusive and interleaving through each other. That to me would be unacceptable. Amazingly I’ve seen this implementation all too often at work.

Advantages
  • Uses {@link ConcurrentHashMap} for thread safety and visibility.
  • Uses {@link CopyOnWriteArrayList} for list thread safety and visibility.
  • No blocking in class itself. Instead the backing jdk classes handle blocking for us.
  • Blocking has been reduced to key level granularity instead of being at the cache level.
Disadvantages
  • Prone to interleaving. It is weakly consistent and does not guarantee mutually exclusive and atomic calls. The {@link remove(K)} call can interleave through the lines of the put method and potentially key value pairs can be added back in if a{@link remove(K)} is called part way through the {@link #put(K,V)} call. To be strongly consistent the {@link #remove(K)} and {@link #put(K,V)} need to be mutually exclusive.
package name.dhruba.kb.concurrency.multimap;

import java.util.List;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;
import java.util.concurrent.CopyOnWriteArrayList;

public class JdkCopyOnWriteArrayListMultiMap<K, V> {

    private final ConcurrentMap<K, List<V>> cache = new ConcurrentHashMap<K, List<V>>();

    public List<V> get(K k) {
        return cache.get(k);
    }

    public List<V> remove(K k) {
        return cache.remove(k);
    }

    public void put(K k, V v) {
        List<V> list = cache.get(k);
        if (list == null) {
            list = new CopyOnWriteArrayList<V>();
            List<V> oldList = cache.putIfAbsent(k, list);
            if (oldList != null) {
                list = oldList;
            }
        }
        list.add(v);
    }

    public boolean remove(K k, K v) {
        List<V> list = cache.get(k);
        if (list == null) {
            return false;
        }
        if (list.isEmpty()) {
            cache.remove(k);
            return false;
        }
        boolean removed = list.remove(k);
        if (removed && list.isEmpty()) {
            cache.remove(k);
        }
        return removed;
    }

}

Partially blocking copy on write multimap

So from the previous implementation we return to a strongly consistent implementation but this time block only on certain paths through the put() and remove() methods at the cost of a little additional allocation. However the lock it uses is still a global one which means that operations on different keys will become sequential which is obviously not desirable.

Advantages
  • Strongly consistent.
  • Use of {@link ConcurrentHashMap} for thread safety and visibility guarantees.
  • The {@link #get(Object)} and {@link #remove(Object)} calls don’t block at all in this class.
  • The {@link #put(Object, Object)} and {@link #remove(Object, Object)} methods do block but only for certain paths. There are paths through these methods which won’t block at all. The {@link #put(Object, Object)} method only blocks if the {@link ConcurrentHashMap#putIfAbsent(Object, Object)} fails and the {@link #remove(Object, Object)} only blocks if there is something there to remove.
Disadvantages
  • We allocate a list initially in the {@link #put(Object, Object)} which may not be needed.
  • {@link ConcurrentHashMap} still blocks although at a finer level using stripes.
  • The blocking synchronisation we are using is still blocking the entire cache. What we really want is to block only the keys that hash to the value bucket that we are currently working with. A more fine grained blocking strategy is called for which we’ll see in the next implementation.
package name.dhruba.kb.concurrency.multimap;

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;

public class PartiallyBlockingCopyOnWriteArrayListMultiMap<K, V> {

    private final ConcurrentMap<K, List<V>> cache = new ConcurrentHashMap<K, List<V>>();

    public List<V> get(K k) {
        return cache.get(k);
    }

    public List<V> remove(K k) {
        synchronized (cache) {
            return cache.remove(k);
        }
    }

    public void put(K k, V v) {
        List<V> list = Collections.singletonList(v);
        List<V> oldList = cache.putIfAbsent(k, list);
        if (oldList != null) {
            synchronized (cache) {
                list = cache.get(k);
                if (list == null || list.isEmpty()) {
                    list = new ArrayList<V>();
                } else {
                    list = new ArrayList<V>(list);
                }
                list.add(v);
                cache.put(k, list);
            }
        }
    }

    public boolean remove(K k, K v) {
        List<V> list = cache.get(k);
        if (list == null) {
            return false;
        }
        synchronized (cache) {
            list = cache.get(k);
            if (list == null) {
                return false;
            }
            if (list.isEmpty()) {
                cache.remove(k);
                return false;
            }
            boolean removed = list.remove(v);
            if (removed) {
                if (list.isEmpty()) {
                    cache.remove(k);
                } else {
                    list = new ArrayList<V>(list);
                    cache.put(k, list);
                }
            }
            return removed;
        }
    }

}

Striped lock copy on write multimap

The final implementation – strongly consistent, non-blocking backing structure, fine grained locking at key level and locking only on necessary paths. This example uses a striped lock provider. It’s purpose is to take a key as input and provide a lock as output to lock on. However it is consistent in that it always provides the same lock for the same key guaranteeing the mutual exclusion that is necessary.

It takes the number of locks desired as a constructor input (by default 2048) which means we can decide how many locks we want to make available in the distribution. It will then accordingly consistently hash across the distribution of locks. It also provides a better key distribution over non-concurrent hash maps. The concept behind the striped lock provider and its implementation is a very interesting topic and this will form a new post of its own in the future! Stay tuned!

Advantages
  • Strongly consistent. Implements correct mutual exclusion of calls.
  • Uses {@link NonBlockingHashMap} instead of {@link ConcurrentHashMap} so the backing cache member does not block at all. Far more efficient and scalable than {@link ConcurrentHashMap}.
  • The read calls are completely non-blocking even at the cache structure level.
  • The {@link #put(Object, Object)} and {@link #remove(Object, Object)} methods do block but only for certain paths. There are paths through these methods which won’t block at all. The {@link #put(Object, Object)} method only blocks if the {@link NonBlockingHashMap#putIfAbsent(Object, Object)} fails and the {@link #remove(Object, Object)} only blocks if there is something there to remove.
  • And to save the best for last – there is no longer any blocking at the cache level. We now apply mutual exclusion only at the key level.

This implementation has the best of all worlds really as long as the copy on write approach is acceptable to you.

Disadvantages
  • Fundamentally being a copy on write approach it does more allocation than a mutative approach.
package name.dhruba.kb.concurrency.multimap;

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;

import name.dhruba.kb.concurrency.striping.IntrinsicStripedLockProvider;

import org.cliffc.high_scale_lib.NonBlockingHashMap;

public class StripedLockArrayListMultiMap<K, V> {

    private final IntrinsicStripedLockProvider stripedLockProvider = new IntrinsicStripedLockProvider();
    private final ConcurrentMap<K, List<V>> cache = new NonBlockingHashMap<K, List<V>>();

    public List<V> get(K k) {
        return cache.get(k);
    }

    public List<V> remove(K k) {
        Object lock = stripedLockProvider.getLockForKey(k);
        synchronized (lock) {
            return cache.remove(k);
        }
    }

    public void put(K k, V v) {
        List<V> list = Collections.singletonList(v);
        List<V> oldList = cache.putIfAbsent(k, list);
        if (oldList != null) {
            Object lock = stripedLockProvider.getLockForKey(k);
            synchronized (lock) {
                list = cache.get(k);
                if (list == null || list.isEmpty()) {
                    list = new ArrayList<V>();
                } else {
                    list = new ArrayList<V>(list);
                }
                list.add(v);
                cache.put(k, list);
            }
        }
    }

    public boolean remove(K k, K v) {
        List<V> list = cache.get(k);
        if (list == null) {
            return false;
        }
        Object lock = stripedLockProvider.getLockForKey(k);
        synchronized (lock) {
            list = cache.get(k);
            if (list == null) {
                return false;
            }
            if (list.isEmpty()) {
                cache.remove(k);
                return false;
            }
            boolean removed = list.remove(v);
            if (removed) {
                if (list.isEmpty()) {
                    cache.remove(k);
                } else {
                    list = new ArrayList<V>(list);
                    cache.put(k, list);
                }
            }
            return removed;
        }
    }

}

Conclusion

When designing concurrent structures it is important not to always resort blindly to what’s out there and for custom concurrent data structures there will be no ready made solution. For those cases concurrent patterns such as this are invaluable and best practices such as reducing critical sections, reducing the amount of blocking to paths that need it, reducing the granularity of the locks being used and selection of the right backing structures are absolutely key to an efficient concurrent data structure. If you have any feedback for how to do better on the above or if I’ve made any mistakes please do let me know. Enjoy and thanks for reading.

Links

FYI – it seems Joe Kearney has done an alternative implementation that does not rely on copy on write.

Update [05/07/2011]: Code updated with bug fixes for edge cases.
Update [17/04/2013]: After a long wait I finally fixed the bug that Charlie reported below. Thanks Charlie!