Tag Archives: performance pattern

Performance Pattern: Multiplication is ~20x faster than Math.pow()

Last time, in the performance pattern series, I wrote about how a bitwise equivalent implementation of the modulo operator was a significantly faster alternative to the modulo operator (around 5x faster). This time, I look at a possible ~20x speedup simply using multiplication in place of the Math.pow() operator. As a reference implementation I use the cumulative distribution function from the Black Scholes algorithm (and the Greeks) because it uses the Math.pow() function quite a bit. First – let’s look at the pow based implementation.

Cumulative distribution function – Math.pow() based implementation

package name.dhruba.black.scholes.utils;

import static java.lang.Math.PI;
import static java.lang.Math.abs;
import static java.lang.Math.exp;
import static java.lang.Math.pow;
import static java.lang.Math.sqrt;

public enum CumulativeDistributionUsingPow {

    _;

    private static final double P = 0.2316419;
    private static final double B1 = 0.319381530;
    private static final double B2 = -0.356563782;
    private static final double B3 = 1.781477937;
    private static final double B4 = -1.821255978;
    private static final double B5 = 1.330274429;

    public static double cumulativeDistribution(double x) {
        double t = 1 / (1 + P * abs(x));
        double t1 = B1 * pow(t, 1);
        double t2 = B2 * pow(t, 2);
        double t3 = B3 * pow(t, 3);
        double t4 = B4 * pow(t, 4);
        double t5 = B5 * pow(t, 5);
        double b = t1 + t2 + t3 + t4 + t5;
        double cd = 1 - standardNormalDistribution(x) * b;
        return x < 0 ? 1 - cd : cd;
    }

    public static double standardNormalDistribution(double x) {
        double top = exp(-0.5 * pow(x, 2));
        double bottom = sqrt(2 * PI);
        return top / bottom;
    }

}

As you can see there are a number of calls to Math.pow() in both the cumulative distribution function and the standard normal distribution. Let us now replace those with equivalent multiplication operations.

Cumulative distribution function – Multiplication based implementation

package name.dhruba.black.scholes.utils;

import static java.lang.Math.PI;
import static java.lang.Math.abs;
import static java.lang.Math.exp;
import static java.lang.Math.sqrt;

public enum CumulativeDistributionUsingMult {

    _;

    private static final double P = 0.2316419;
    private static final double B1 = 0.319381530;
    private static final double B2 = -0.356563782;
    private static final double B3 = 1.781477937;
    private static final double B4 = -1.821255978;
    private static final double B5 = 1.330274429;

    public static double cumulativeDistribution(double x) {
        double t = 1 / (1 + P * abs(x));
        double t1 = t;
        double t2 = t * t;
        double t3 = t2 * t;
        double t4 = t3 * t;
        double t5 = t4 * t;
        double b1 = B1 * t1;
        double b2 = B2 * t2;
        double b3 = B3 * t3;
        double b4 = B4 * t4;
        double b5 = B5 * t5;
        double b = b1 + b2 + b3 + b4 + b5;
        double cd = 1 - standardNormalDistribution(x) * b;
        return x < 0 ? 1 - cd : cd;
    }

    public static double standardNormalDistribution(double x) {
        double top = exp(-0.5 * x * x);
        double bottom = sqrt(2 * PI);
        return top / bottom;
    }

}

Benchmark – Math.pow() versus Multiplication

Benchmarks are generally speaking remarkably difficult to do correctly. Here I’ve done a best effort implementation. The first thing this test does is run both implementations for 10m iterations to forcibly enable JIT. It normally takes around 10k iterations for JIT to come on but I’ve deliberately overcompensated.

Then it runs both implementations for a variety of iteration counts ranging from 10k to 10m each time increasing iteration count by a factor of 10. For each iteration count it stores the results but does not print output until the very end to not involve the system libraries while benchmarking. At the end it prints the results for each iteration count.

package name.dhruba.black.scholes;

import java.util.Random;

import name.dhruba.black.scholes.utils.CumulativeDistributionUsingMult;
import name.dhruba.black.scholes.utils.CumulativeDistributionUsingPow;

import org.junit.Test;

public class TestCumulativeDistributionCalculator {

    @Test
    public void test() {

        Random r = new Random();

        // 10m iteration warmup to ensure jit is underway
        for (int i = 0; i < Math.pow(10, 7); i++) {
            double d = r.nextDouble();
            CumulativeDistributionUsingPow.cumulativeDistribution(d);
            CumulativeDistributionUsingMult.cumulativeDistribution(d);
        }

        // execute both for a variety of iterations and store results
        long[][] times = new long[4][4];
        for (int i = 0; i < 4; i++) {
            int iterations = (int) Math.pow(10, i + 4);
            times[i] = testHelper(iterations, r);
        }

        // computations complete; print times
        for (int i = 0; i < times.length; i++) {
            System.out.format("Over %1$9s iterations pow took %2$5s ms and "
                    + "mult took %3$4s ms; mult was %4$3s faster.n",
                    times[i][0], times[i][1], times[i][2], times[i][3]);
        }

    }

    private long[] testHelper(int iterations, Random r) {

        // generate data for given iterations
        double[] data = new double[iterations];
        for (int i = 0; i < iterations; i++) {
            data[i] = r.nextDouble();
        }

        // benchmark pow for given iterations
        long t1 = System.currentTimeMillis();
        for (int i = 0; i < iterations; i++) {
            CumulativeDistributionUsingPow.cumulativeDistribution(data[i]);
        }
        t1 = System.currentTimeMillis() - t1;

        // benchmark multiplication for given iterations
        long t2 = System.currentTimeMillis();
        for (int i = 0; i < iterations; i++) {
            CumulativeDistributionUsingMult.cumulativeDistribution(data[i]);
        }
        t2 = System.currentTimeMillis() - t2;

        // save times taken
        return new long[] { iterations, t1, t2, t1 / t2 };

    }

}

Running times and speedup factors

Running the above benchmark test prints the following output.

Over     10000 iterations pow took     8 ms and mult took    2 ms; mult was   4 faster.
Over    100000 iterations pow took    82 ms and mult took    4 ms; mult was  20 faster.
Over   1000000 iterations pow took   794 ms and mult took   36 ms; mult was  22 faster.
Over  10000000 iterations pow took  9116 ms and mult took  422 ms; mult was  21 faster.

So, unless I’ve made mistakes above, you can see that simple multiplication can be ~20 faster than equivalent Math.pow() operations and this, I believe, is universal. It will be the same for C++ and probably other languages too. And this is because Math.pow() deals with the generic case of raising a number to any given power. The lookup involved in computing the result (via a native call in Java) is what takes additional time.

Once again, we’ve looked at a way of optimising performance that isn’t entirely obvious and cannot generally be found in official documentation. These are things one learns by working with others who know more than we do. And, by the way, for all the “premature optimisation is the root of all evil” hippie believers I don’t want to hear any of it. If you’re writing high latency code such as webapps or data access code this approach of remaining oblivious to performance may be acceptable but low latency requires an opposite mindset. In low latency you have to preempt implementations with faster ones and you have to question everything you see. It’s a dark art and one that takes a lifetime to learn and here I try to share with you whatever I learn one tip at a time.

Performance pattern: Modulo and powers of two

The modulo operator is rare but does occur in certain vital use-cases in Java programming. I’ve been seeing it a lot in striping, segmenting, pipelining and circularity use cases lately. The normal and naive implementation is as below.

    public static int modulo(int x, int y) {
        return x % y;
    }

Recently I saw the following pipelining logic in quite a few places in the codebase on a fast runtime path. This essentially takes an incoming message and uses the following to resolve a queue to enqueue the message onto for later dequeuing by the next stage of the workflow.

int chosenPipeline = input.hashCode() % numberOfPipelines

It is little known, however, that on most hardware the division operation and as a result the modulo operation can actually be quite expensive. You’ll notice that for example modulo is never used in hash functions for example. Have you ever asked why? The reason is that there is a far quicker alternative: bitwise AND. This does involve making a small compromise however in the inputs to the original problem. If we are willing to always supply y in the above method as a power of two we can do the following instead.

    public static int moduloPowerOfTwo(int x, int powerOfTwoY) {
        return x & (powerOfTwoY - 1);
    }

This is dramatically quicker. To give you some stats which you should be asking for at this point see the table below.

Iterations Modulo (ms) PowerOfTwoModulo (ms)
10*4*32 5 1
10*5*32 24 4
10*6*32 237 54
10*7*32 2348 549
10*8*32 30829 5504
10*9*32 320755 54947

You might be thinking at this point – if I’m expecting a power of two I should validate the input to make sure it is so. Well, that’s one viewpoint. The other is, if you’re supplying y or if it is statically configured at startup then you can make sure it is a power of two without taking the performance hit of a runtime check. But if you really want to check here’s how to do so.

    public static boolean isPowerOfTwo(int i) {
        return (i & (i - 1)) == 0;
    }

So the next time you’re writing something that falls into one of the above use cases or any other for the modulo operator and your method needs to be fast at runtime for one reason or another consider the faster alternative. Certainly for price streaming (which is what I’m doing) latency matters! It would be interesting to check whether the java compiler actually makes this optimisation by substitution for you automatically. If so one can stick with the slower alternative for better readability.

The intelligent reader might say that in any such typical modulo use case the use of a bounded data structure and the resulting contention will far outweigh the costs of the modulo operation and the reader would be right in saying so but that’s another problem space entirely that I intend to explore separately. In short – there’s no need to be limited by a bounded data structure 🙂