Tag Archives: speed

Performance Pattern: Multiplication is ~20x faster than Math.pow()

Last time, in the performance pattern series, I wrote about how a bitwise equivalent implementation of the modulo operator was a significantly faster alternative to the modulo operator (around 5x faster). This time, I look at a possible ~20x speedup simply using multiplication in place of the Math.pow() operator. As a reference implementation I use the cumulative distribution function from the Black Scholes algorithm (and the Greeks) because it uses the Math.pow() function quite a bit. First – let’s look at the pow based implementation.

Cumulative distribution function – Math.pow() based implementation

package name.dhruba.black.scholes.utils;

import static java.lang.Math.PI;
import static java.lang.Math.abs;
import static java.lang.Math.exp;
import static java.lang.Math.pow;
import static java.lang.Math.sqrt;

public enum CumulativeDistributionUsingPow {

    _;

    private static final double P = 0.2316419;
    private static final double B1 = 0.319381530;
    private static final double B2 = -0.356563782;
    private static final double B3 = 1.781477937;
    private static final double B4 = -1.821255978;
    private static final double B5 = 1.330274429;

    public static double cumulativeDistribution(double x) {
        double t = 1 / (1 + P * abs(x));
        double t1 = B1 * pow(t, 1);
        double t2 = B2 * pow(t, 2);
        double t3 = B3 * pow(t, 3);
        double t4 = B4 * pow(t, 4);
        double t5 = B5 * pow(t, 5);
        double b = t1 + t2 + t3 + t4 + t5;
        double cd = 1 - standardNormalDistribution(x) * b;
        return x < 0 ? 1 - cd : cd;
    }

    public static double standardNormalDistribution(double x) {
        double top = exp(-0.5 * pow(x, 2));
        double bottom = sqrt(2 * PI);
        return top / bottom;
    }

}

As you can see there are a number of calls to Math.pow() in both the cumulative distribution function and the standard normal distribution. Let us now replace those with equivalent multiplication operations.

Cumulative distribution function – Multiplication based implementation

package name.dhruba.black.scholes.utils;

import static java.lang.Math.PI;
import static java.lang.Math.abs;
import static java.lang.Math.exp;
import static java.lang.Math.sqrt;

public enum CumulativeDistributionUsingMult {

    _;

    private static final double P = 0.2316419;
    private static final double B1 = 0.319381530;
    private static final double B2 = -0.356563782;
    private static final double B3 = 1.781477937;
    private static final double B4 = -1.821255978;
    private static final double B5 = 1.330274429;

    public static double cumulativeDistribution(double x) {
        double t = 1 / (1 + P * abs(x));
        double t1 = t;
        double t2 = t * t;
        double t3 = t2 * t;
        double t4 = t3 * t;
        double t5 = t4 * t;
        double b1 = B1 * t1;
        double b2 = B2 * t2;
        double b3 = B3 * t3;
        double b4 = B4 * t4;
        double b5 = B5 * t5;
        double b = b1 + b2 + b3 + b4 + b5;
        double cd = 1 - standardNormalDistribution(x) * b;
        return x < 0 ? 1 - cd : cd;
    }

    public static double standardNormalDistribution(double x) {
        double top = exp(-0.5 * x * x);
        double bottom = sqrt(2 * PI);
        return top / bottom;
    }

}

Benchmark – Math.pow() versus Multiplication

Benchmarks are generally speaking remarkably difficult to do correctly. Here I’ve done a best effort implementation. The first thing this test does is run both implementations for 10m iterations to forcibly enable JIT. It normally takes around 10k iterations for JIT to come on but I’ve deliberately overcompensated.

Then it runs both implementations for a variety of iteration counts ranging from 10k to 10m each time increasing iteration count by a factor of 10. For each iteration count it stores the results but does not print output until the very end to not involve the system libraries while benchmarking. At the end it prints the results for each iteration count.

package name.dhruba.black.scholes;

import java.util.Random;

import name.dhruba.black.scholes.utils.CumulativeDistributionUsingMult;
import name.dhruba.black.scholes.utils.CumulativeDistributionUsingPow;

import org.junit.Test;

public class TestCumulativeDistributionCalculator {

    @Test
    public void test() {

        Random r = new Random();

        // 10m iteration warmup to ensure jit is underway
        for (int i = 0; i < Math.pow(10, 7); i++) {
            double d = r.nextDouble();
            CumulativeDistributionUsingPow.cumulativeDistribution(d);
            CumulativeDistributionUsingMult.cumulativeDistribution(d);
        }

        // execute both for a variety of iterations and store results
        long[][] times = new long[4][4];
        for (int i = 0; i < 4; i++) {
            int iterations = (int) Math.pow(10, i + 4);
            times[i] = testHelper(iterations, r);
        }

        // computations complete; print times
        for (int i = 0; i < times.length; i++) {
            System.out.format("Over %1$9s iterations pow took %2$5s ms and "
                    + "mult took %3$4s ms; mult was %4$3s faster.n",
                    times[i][0], times[i][1], times[i][2], times[i][3]);
        }

    }

    private long[] testHelper(int iterations, Random r) {

        // generate data for given iterations
        double[] data = new double[iterations];
        for (int i = 0; i < iterations; i++) {
            data[i] = r.nextDouble();
        }

        // benchmark pow for given iterations
        long t1 = System.currentTimeMillis();
        for (int i = 0; i < iterations; i++) {
            CumulativeDistributionUsingPow.cumulativeDistribution(data[i]);
        }
        t1 = System.currentTimeMillis() - t1;

        // benchmark multiplication for given iterations
        long t2 = System.currentTimeMillis();
        for (int i = 0; i < iterations; i++) {
            CumulativeDistributionUsingMult.cumulativeDistribution(data[i]);
        }
        t2 = System.currentTimeMillis() - t2;

        // save times taken
        return new long[] { iterations, t1, t2, t1 / t2 };

    }

}

Running times and speedup factors

Running the above benchmark test prints the following output.

Over     10000 iterations pow took     8 ms and mult took    2 ms; mult was   4 faster.
Over    100000 iterations pow took    82 ms and mult took    4 ms; mult was  20 faster.
Over   1000000 iterations pow took   794 ms and mult took   36 ms; mult was  22 faster.
Over  10000000 iterations pow took  9116 ms and mult took  422 ms; mult was  21 faster.

So, unless I’ve made mistakes above, you can see that simple multiplication can be ~20 faster than equivalent Math.pow() operations and this, I believe, is universal. It will be the same for C++ and probably other languages too. And this is because Math.pow() deals with the generic case of raising a number to any given power. The lookup involved in computing the result (via a native call in Java) is what takes additional time.

Once again, we’ve looked at a way of optimising performance that isn’t entirely obvious and cannot generally be found in official documentation. These are things one learns by working with others who know more than we do. And, by the way, for all the “premature optimisation is the root of all evil” hippie believers I don’t want to hear any of it. If you’re writing high latency code such as webapps or data access code this approach of remaining oblivious to performance may be acceptable but low latency requires an opposite mindset. In low latency you have to preempt implementations with faster ones and you have to question everything you see. It’s a dark art and one that takes a lifetime to learn and here I try to share with you whatever I learn one tip at a time.

Presentation: Development at the Speed and Scale of Google

Since I’ve never had the good fortune of being able to afford QCon (one day this will change) I appreciate the fact that InfoQ post QCon videos online for free albeit late. Recently I watched ‘Development at the Speed and Scale of Google‘.

Prior to watching this presentation I knew only what I had encountered in the wider industry and really could not have foreseen any of what I was about to watch. The tools that I use on a daily basis and the difficulties that impede me now both seem primitive and outdated in comparison to the progress that Google has made. The key point on the subject matter of this presentation is that it is not about development but what makes it possible to develop at the speed and scale of google: in this case – build and release engineering.

Highlights from the talk that I found worthy of note are listed below.

  • Working on build and engineering tools requires strong computer science skills and as such the best people.
  • We cannot improve what we cannot measure. Measure everything. This, in my opinion, is a fantastic quote. This stops a team going off on open ended endeavours that yield either intangible or no results.
  • Compute intensive IDE functions have been migrated to the cloud such as generating and searching indexes for cross referencing types across a large codebase.
  • The codebase required for building and running tests is generally larger than that which is worked upon but delivering the entire codebase to every developer either in source or in binary form would kill the network. Here – a fuse daemon detects when surrounding code is required using a fuse (user space) filesystem and retrieves it incrementally on demand.
  • For similar reasons to the above point – they’ve developed a virtual filesystem under Eclipse and contributed it back. The obvious benefit is that directly importing a large code base into Eclipse kills it whereas incremental loads perform.
  • They build off source and not binaries and maintain an extremely stable trunk from which they release. If you imagine that all code is in a single repository (in fact the largest Perforce repository in the world) then it really puts into perspective the achievement of using only trunk.
  • The designated owners for a given project who review code have at their fingertips all the intelligence metadata on the code to assist them in the reviewing process. If you think about it that makes a lot of sense. To review you need more than just the code to spend your time effectively. You may want the output of introspection, test runs etc.
  • Compilations are distributed and parallelised in the cloud and output is aggressively cached. It’s fascinating to hear a case study where this has actually been implemented. I’ve often considered remote compilations but never come across a concrete implementation until now.

The importance of build and release engineering is often underestimated. It is often portrayed and perceived as an area of work that’s second class in nature and rather unglamorous. However, as this talk attests, it is very much the contrary. It can massively boost developer and organisational productivity and efficiency and requires the best people. I’ll end with a quote from the presenter: “Every developer worth their salt has worked on a build system at least once”.