Last time, in the performance pattern series, I wrote about how a bitwise equivalent implementation of the modulo operator was a significantly faster alternative to the modulo operator (around 5x faster). This time, I look at a possible ~20x speedup simply using multiplication in place of the Math.pow() operator. As a reference implementation I use the cumulative distribution function from the Black Scholes algorithm (and the Greeks) because it uses the Math.pow() function quite a bit. First – let’s look at the pow based implementation.

### Cumulative distribution function – Math.pow() based implementation

package name.dhruba.black.scholes.utils; import static java.lang.Math.PI; import static java.lang.Math.abs; import static java.lang.Math.exp; import static java.lang.Math.pow; import static java.lang.Math.sqrt; public enum CumulativeDistributionUsingPow { _; private static final double P = 0.2316419; private static final double B1 = 0.319381530; private static final double B2 = -0.356563782; private static final double B3 = 1.781477937; private static final double B4 = -1.821255978; private static final double B5 = 1.330274429; public static double cumulativeDistribution(double x) { double t = 1 / (1 + P * abs(x)); double t1 = B1 * pow(t, 1); double t2 = B2 * pow(t, 2); double t3 = B3 * pow(t, 3); double t4 = B4 * pow(t, 4); double t5 = B5 * pow(t, 5); double b = t1 + t2 + t3 + t4 + t5; double cd = 1 - standardNormalDistribution(x) * b; return x < 0 ? 1 - cd : cd; } public static double standardNormalDistribution(double x) { double top = exp(-0.5 * pow(x, 2)); double bottom = sqrt(2 * PI); return top / bottom; } }

As you can see there are a number of calls to Math.pow() in both the cumulative distribution function and the standard normal distribution. Let us now replace those with equivalent multiplication operations.

### Cumulative distribution function – Multiplication based implementation

package name.dhruba.black.scholes.utils; import static java.lang.Math.PI; import static java.lang.Math.abs; import static java.lang.Math.exp; import static java.lang.Math.sqrt; public enum CumulativeDistributionUsingMult { _; private static final double P = 0.2316419; private static final double B1 = 0.319381530; private static final double B2 = -0.356563782; private static final double B3 = 1.781477937; private static final double B4 = -1.821255978; private static final double B5 = 1.330274429; public static double cumulativeDistribution(double x) { double t = 1 / (1 + P * abs(x)); double t1 = t; double t2 = t * t; double t3 = t2 * t; double t4 = t3 * t; double t5 = t4 * t; double b1 = B1 * t1; double b2 = B2 * t2; double b3 = B3 * t3; double b4 = B4 * t4; double b5 = B5 * t5; double b = b1 + b2 + b3 + b4 + b5; double cd = 1 - standardNormalDistribution(x) * b; return x < 0 ? 1 - cd : cd; } public static double standardNormalDistribution(double x) { double top = exp(-0.5 * x * x); double bottom = sqrt(2 * PI); return top / bottom; } }

### Benchmark – Math.pow() versus Multiplication

Benchmarks are generally speaking remarkably difficult to do correctly. Here I’ve done a best effort implementation. The first thing this test does is run both implementations for 10m iterations to forcibly enable JIT. It normally takes around 10k iterations for JIT to come on but I’ve deliberately overcompensated.

Then it runs both implementations for a variety of iteration counts ranging from 10k to 10m each time increasing iteration count by a factor of 10. For each iteration count it stores the results but does not print output until the very end to not involve the system libraries while benchmarking. At the end it prints the results for each iteration count.

package name.dhruba.black.scholes; import java.util.Random; import name.dhruba.black.scholes.utils.CumulativeDistributionUsingMult; import name.dhruba.black.scholes.utils.CumulativeDistributionUsingPow; import org.junit.Test; public class TestCumulativeDistributionCalculator { @Test public void test() { Random r = new Random(); // 10m iteration warmup to ensure jit is underway for (int i = 0; i < Math.pow(10, 7); i++) { double d = r.nextDouble(); CumulativeDistributionUsingPow.cumulativeDistribution(d); CumulativeDistributionUsingMult.cumulativeDistribution(d); } // execute both for a variety of iterations and store results long[][] times = new long[4][4]; for (int i = 0; i < 4; i++) { int iterations = (int) Math.pow(10, i + 4); times[i] = testHelper(iterations, r); } // computations complete; print times for (int i = 0; i < times.length; i++) { System.out.format("Over %1$9s iterations pow took %2$5s ms and " + "mult took %3$4s ms; mult was %4$3s faster.n", times[i][0], times[i][1], times[i][2], times[i][3]); } } private long[] testHelper(int iterations, Random r) { // generate data for given iterations double[] data = new double[iterations]; for (int i = 0; i < iterations; i++) { data[i] = r.nextDouble(); } // benchmark pow for given iterations long t1 = System.currentTimeMillis(); for (int i = 0; i < iterations; i++) { CumulativeDistributionUsingPow.cumulativeDistribution(data[i]); } t1 = System.currentTimeMillis() - t1; // benchmark multiplication for given iterations long t2 = System.currentTimeMillis(); for (int i = 0; i < iterations; i++) { CumulativeDistributionUsingMult.cumulativeDistribution(data[i]); } t2 = System.currentTimeMillis() - t2; // save times taken return new long[] { iterations, t1, t2, t1 / t2 }; } }

### Running times and speedup factors

Running the above benchmark test prints the following output.

Over 10000 iterations pow took 8 ms and mult took 2 ms; mult was 4 faster. Over 100000 iterations pow took 82 ms and mult took 4 ms; mult was 20 faster. Over 1000000 iterations pow took 794 ms and mult took 36 ms; mult was 22 faster. Over 10000000 iterations pow took 9116 ms and mult took 422 ms; mult was 21 faster.

So, unless I’ve made mistakes above, you can see that simple multiplication can be ~20 faster than equivalent Math.pow() operations and this, I believe, is universal. It will be the same for C++ and probably other languages too. And this is because Math.pow() deals with the generic case of raising a number to any given power. The lookup involved in computing the result (via a native call in Java) is what takes additional time.

Once again, we’ve looked at a way of optimising performance that isn’t entirely obvious and cannot generally be found in official documentation. These are things one learns by working with others who know more than we do. And, by the way, for all the “premature optimisation is the root of all evil” hippie believers I don’t want to hear any of it. If you’re writing high latency code such as webapps or data access code this approach of remaining oblivious to performance may be acceptable but low latency requires an opposite mindset. In low latency you have to preempt implementations with faster ones and you have to question everything you see. It’s a dark art and one that takes a lifetime to learn and here I try to share with you whatever I learn one tip at a time.

Pingback: The Black Scholes Algorithm: The Greeks in Java - Dhruba Bandopadhyay

Michael WilsonThe funny thing here is that the method is doing an approximation of the normal distribution using a fifth order polynomial, but when you call pow() you are forcing the CPU to interpolate and calculate another polynomial for each term in the original polynomial, plus unavoiable logic to handle negative numbers, infinities etc. Sadly I have yet to encounter a compiler that has a peephole optimisation for integral powers.

Kurt FickieI liked your article on Black Scholes which led me here. Your intuition is correct. What you might want to look at is Horner’s rule. Sedgewick’s book probably has a nice writeup, but any numerical analysis text will cover the same material which you discovered via thoughtful experimentation. Improved algorithms not only run faster, but often result in more accuracy because roundoff error is reduced.

Funny thing about computers is that just writing out the formulas and coding in a straightforward approach can miss the forest for the trees. The history of the Fast Fourier Transform was an example where the faster algorithm was achieved by the way a clever person would perform the the calculation by hand. Follows the same principle of Horner’s rule: nesting of calculations via recursion or loops.

FranciscoThis is not completely fair, since you are ‘buffering’ the values and . You can only do that in some scenarios. I think this would be a better comparison:

double b1 = B1 * t;

double b2 = B2 * t*t;

double b3 = B3 * t*t*t;

double b4 = B4 * t*t*t*t;

double b5 = B5 * t*t*t*t*t;

Besides, the one in the example can easily be looped, while both the one you wrote and this one cannot (;

Dhruba BandopadhyayPost authorGood point. Well spotted. Agreed. 🙂

Raúl G.I’m sure that Math.pow is slower than multiplication (at least for elevate to square), but I wonder if the results are the same. There are the same accuracy using one or other?

aikarAnother thing to consider. You’re using small numbers. Maybe Math.pow() can be quicker depending on input? It’s when you add in the loops that the system calls add up.

How about large fluctuating powers? Math.pow(93222, 11) for example.

DavidI realise this is a somewhat old post now, but I’ll leave this here for anyone who stumbles upon it like I did.

This would be an even faster method again, and is how I evaluate many of my polynomials:

double b = ( ( ( (B5*t + B4)*t + B3)*t + B2)*t + B1)*t;

I write in C++ and there there is the additional benefit of FMA instructions on newer CPU architectures which means the above code can be executed in 5 instructions. The downside here is that it would have to go through the CPU pipeline 5 times so the latency penalty is high if the CPU has no other out-of-order work to be doing. Without FMA it will take 5 multiplies and 4 adds, and have to run through the pipeline 9 times!

Rewriting again for the sake of more efficient pipelining, this only goes through the pipeline 3 times with FMA and 6 times without:

double t2 = t*t;

double b1 = B1*t;

double b2 = B3*t + B2;

double b3 = B5*t + B4;

double b = (b3*t2 + b2)*t2 + b1;

If you don’t have FMA instructions available (as I suspect you may not in Java) then a slight modification gives a method that only goes through the pipeline 5 times:

double t2 = t*t;

double b1 = B1*t;

double b2 = B3*t + B2;

double b3 = B5*t + B4;

double b = (b3*t2*t2) + (b2*t2 + b1);

For even better overall performance you can move that call to the normal distribution function to the top and getting it running sooner so that you can use it more efficiently later:

double t = 1 / (1 + P * abs(x));

double norm = standardNormalDistribution(x);

double t2 = t*t;

double b1 = B2*t + B1;

double b2 = B4*t + B3;

double cd = ((B5*t2 + b2)*t2 + b1) * (norm*t); // With FMA

double cd = ((B5*t2*t2 + b1) + b2*t2) * (norm*t); // Without FMA

return x < 0 ? cd : 1 – cd;

Of course, the bulk of the speed gains have already been made by removing those pow functions, but it would be interesting to benchmark the code and see if these improvements actually make any noticable difference. Might need to run it on a billion iterations to see it though. 😉