Doug Lea has announced that fork join has been updated and that these updates are available through jsr166y. If you’re not familiar with FJ give jsr166y (fork-join) and extra166y (parallel array) a try on Java 6 or above. They’re pretty cool and the next natural step up in Java concurrency.
Jsr166e is to Java8 as Jsr166y was to Java7. Jsr166y introduced the fork join framework and Phaser to Java7 which are worthy of blog posts of their own. The fork join framework will enable us to introduce fine grained inversion of concurrency whereby we can code logic without really needing to think about or implement how that logic will perform on arbitrary hardware.
Now that Java 7 has been released jsr166e has emerged as a repository of utilities that are intended for inclusion into Java 8 next year. Having followed the discussion on the concurrency mailing list I’ve become absolutely fascinated by the work going on in jsr166e for the simple reason that it is catering for use cases that have been directly relevant to my recent work. So without further delay here is an abridged guide to the current state of jsr166e.
- ConcurrentHashMapV8: A candidate replacement for java.util.concurrent.ConcurrentHashMap with lower memory footprint. The exact improvements of this over the old implementation I’m yet to explore. Here’s the mail thread announcement and discussion.
- ConcurrentHashMapV8.MappingFunction: A (well overdue) mechanism for automatically computing a value for a key that doesn’t already have one. I’ve waited a long time for this and in my opinion this is the most basic requirement of a concurrent map as without this you always end up locking to create a new mapping.
- LongAdderTable: A concurrent counting map where a key is associated with an efficient concurrent primitive counter. This provides significantly improved performance over AtomicLong under high contention as it utilises striping across multiple values. I’ve desperately needed this in my job and I am overjoyed that this has finally been written by the experts group. I’ve been exploring and coding up various implementations of my own of such a class recently but I’d rather have this provided by the JDK. Again a very basic requirement and a class that’s well overdue.
- ReadMostlyVector: Same as Vector but with reduced contention and better throughput for concurrent reads. I’m a little surprised about this one. Does anyone even use Vector anymore? Why not just replace the underlying implementations of Hashtable and Vector with more performant ones? Is there any backward compatibility constraint that’s restricting this?
The following adders are essentially high performance concurrent primitive counters that dynamically adapt to growing contention to reduce it. The key value add here is achieved by utilising striping across values on writes and acting across the stripes for read.
Again, high performance primitive counters, are something I’ve desperately needed in my work lately. Imagine if you are implementing client server protocols. You may need message sequence numbers to ensure you can discard out of order/older messages. You might also need request response id correlation for which id generation is necessary. For any such id generation I wanted to use primitive longs for efficiency and as a result needed a high performance primitive long counter and now I have one!
Important: It’s important to note one limitation of these counting APIs. There are no compound methods like incrementAndGet() or addAndGet() which significantly reduces the utility of such API. I can see why this is the case: although the writes can stripe across values the read must act across all striped values and as a result is quite expensive. I therefore need to think about how much this will compromise the use of this API for the use case of an efficient id generator.
- DoubleAdder: A high performance concurrent primitive double counter.
- LongAdder: A high performance concurrent primitive long counter.
The following exhibit similar performance characteristics to the adders above but instead of maintaining a count or sum they maintain a maximum value. These also use striped values for writes and reading across striped values to compute aggregate values.
- DoubleMaxUpdater: A high performance primitive double maximum value maintainer.
- LongMaxUpdater: A high performance primitive long maximum value maintainer.
- SequenceLock: Finally, jsr166e adds an additional synchronisation utility. This is an interesting class which took me two or three reviews of the javadoc example to understand its value add. Essentially it offers the ability to conduct a more accommodating conversation between you and the lock provider whereby you can not only choose not to lock and still retain consistent visibility but also fundamentally allow you to detect when other threads have been active simultaneously with your logic thereby allowing you to retry your behaviour until your read of any state is completely consistent at that moment in time. I can see what value this adds and how to use it but I need to think about real world use cases for this utility.
What is still missing?
Sadly, despite the above, Java shows no signs of addressing a number of other real world use cases of mine.
- Concurrent primitive key maps
- Concurrent primitive value maps
- Concurrent primitive key value maps
- Externalised (inverted) striping utilities that allow you to hash an incoming key to a particular lock across a distribution of locks. This means that you no longer have to lock entire collections but just the lock relevant to the input you are working with. This is absolutely fundamental and essential in my opinion and has already been written by EhCache for their own use but this should ideally be provided as a building block by the JDK.
- There’s also been a lot of talk about core-striping as opposed to lock striping which I suppose is an interesting need. In other words instead of the distribution of contention being across lock instances they are across representations (IDs) of physical processor cores. Check the mailing list for details.
I’m very excited indeed by the incorporations of jsr166e not only because they have directly addressed a number of my real world use cases but also because they give an early peek at what’s to come in Java 8. The additional support for primitives is welcome as they will eliminate reliance on the ghastly autoboxing and gc churn of primitive wrappers. I’ll certainly be using these utilities for my own purposes. Keep up the great work! However, I’d love to hear why the above use cases under ‘What’s missing’ still haven’t seen any activity in Java.
With the JDK 7 developer preview out and a final release fast approaching it’s important to not only to become aware of what the new version offers but also, in certain areas where existing programming paradigms have radically changed, to make a mental shift in the way we think and understand how we can leverage these new paradigms best to our advantage. One such area is that of finding and exploiting areas of latent parallelism using a coarse grained parallelism approach.
As I mentioned in my previous post about the JDK7 developer preview being released – we’ve been using jsr166y and extra166y at work for some time now and this post really stems from an impassioned discussion that took place on finding and exploiting areas of latent parallelism in code so here’s what I have to say on the matter (inspired obviously by Doug Lea, Brian Goetz and my esteemed colleagues). The traditional and very much outdated mindset has only understood threads and ever since java 5 the executor framework on top. However this mechanism is fundamentally limited in its design in the extent of parallelism it can offer.
Firstly threads are expensive not only in their creation and stack size allocation but also in terms of context switching between them. Deciding on how many threads to have is also always at best an educated guess. A particular service within a process may decide to use all available cores but if every service in the process does the same then you have a disproportionately large number of threads and I have worked with applications with more than 150-200 threads operating at a time. Secondly, the executor framework has helped considerably in taking away some of the decision making from the developer and absorbing that complexity but it still suffers from heavy contention from multiple threads on the internal queue of tasks that it holds again adversely impacting performance. Thirdly, threads and executor frameworks normally do not scale up or down based on the hardware that they’re running on and certainly do not scale based on load. Their performance is very much constant by way of their underlying design.
Enter the fork join framework and parallel arrays. This is not a paragraph about how to use these new features but, in my opinion, a far more important note on how to rid ourselves of a legacy mindset on parallelism and make room for a new one. The fork join framework and parallel arrays (which are backed by the fork join framework and fork join pool internally) should not be perceived only as threading tools. That’s very dangerous as it means that we are only likely to use them in those areas where we previously used threads. They can in fact help us find and exploit areas of latent parallelism.
What does that mean? In all applications there are areas of code that operate sequentially. This code may be thread confined or stack confined and we almost never reconsider the way they perform. With FJ/PA we can now start making these areas concurrent. How is this an improvement? Well FJ/PA offer the following key features which makes them an ideal fit for such a use case.
Firstly, they are fundamentally decoupled from the number of threads in the way they add value which is a good thing. They tend to perform well regardless of how many threads they are using. Secondly, instead of using a single work queue for all threads they use one work queue per thread. This means further decoupling between threads and the way tasks are stored. Thirdly, given multiple work queues and multiple threads, FJ/PA perform work stealing. Every queue is a double ended queue and when one thread has completed all its tasks it then starts to process the tasks from the tail of another queue and because it is dequeuing off the tail there is no contention on the head of the queue from which the owner of the queue is dequeuing. Not only that but the largest tasks are placed towards the end of queues so that when another thread does steal work off another queue it gets enough work to effectively reduce the interval at which it steals again thereby again reducing contention. And finally, and most importantly, given a piece of FJ/PA code it will not only scale up but effectively scale down based not only on the hardware it runs but also on the load of the incoming work. When you understand this new paradigm suddenly the legacy paradigm seems so primitive and fundamentally stunted.
So the next time you are browsing your code consider using jsr166y and extra166y to find and exploit latent areas of parallelism. Generally the rule of thumb should be that this approach works best for operations that are cpu intensive and the legacy paradigm is better for io or network bound operations for obvious reasons. If operations are io or network bound there is less contention and the limitations of the legacy paradigm are less exposed. Don’t forget that the two libraries above can be used in java 6 so there’s no need to wait for java 7!