Last time, in the performance pattern series, I wrote about how a bitwise equivalent implementation of the modulo operator was a significantly faster alternative to the modulo operator (around 5x faster). This time, I look at a possible ~20x speedup simply using multiplication in place of the Math.pow() operator. As a reference implementation I use the cumulative distribution function from the Black Scholes algorithm (and the Greeks) because it uses the Math.pow() function quite a bit. First – let’s look at the pow based implementation.

### Cumulative distribution function – Math.pow() based implementation

package name.dhruba.black.scholes.utils; import static java.lang.Math.PI; import static java.lang.Math.abs; import static java.lang.Math.exp; import static java.lang.Math.pow; import static java.lang.Math.sqrt; public enum CumulativeDistributionUsingPow { _; private static final double P = 0.2316419; private static final double B1 = 0.319381530; private static final double B2 = -0.356563782; private static final double B3 = 1.781477937; private static final double B4 = -1.821255978; private static final double B5 = 1.330274429; public static double cumulativeDistribution(double x) { double t = 1 / (1 + P * abs(x)); double t1 = B1 * pow(t, 1); double t2 = B2 * pow(t, 2); double t3 = B3 * pow(t, 3); double t4 = B4 * pow(t, 4); double t5 = B5 * pow(t, 5); double b = t1 + t2 + t3 + t4 + t5; double cd = 1 - standardNormalDistribution(x) * b; return x < 0 ? 1 - cd : cd; } public static double standardNormalDistribution(double x) { double top = exp(-0.5 * pow(x, 2)); double bottom = sqrt(2 * PI); return top / bottom; } }

As you can see there are a number of calls to Math.pow() in both the cumulative distribution function and the standard normal distribution. Let us now replace those with equivalent multiplication operations.

### Cumulative distribution function – Multiplication based implementation

package name.dhruba.black.scholes.utils; import static java.lang.Math.PI; import static java.lang.Math.abs; import static java.lang.Math.exp; import static java.lang.Math.sqrt; public enum CumulativeDistributionUsingMult { _; private static final double P = 0.2316419; private static final double B1 = 0.319381530; private static final double B2 = -0.356563782; private static final double B3 = 1.781477937; private static final double B4 = -1.821255978; private static final double B5 = 1.330274429; public static double cumulativeDistribution(double x) { double t = 1 / (1 + P * abs(x)); double t1 = t; double t2 = t * t; double t3 = t2 * t; double t4 = t3 * t; double t5 = t4 * t; double b1 = B1 * t1; double b2 = B2 * t2; double b3 = B3 * t3; double b4 = B4 * t4; double b5 = B5 * t5; double b = b1 + b2 + b3 + b4 + b5; double cd = 1 - standardNormalDistribution(x) * b; return x < 0 ? 1 - cd : cd; } public static double standardNormalDistribution(double x) { double top = exp(-0.5 * x * x); double bottom = sqrt(2 * PI); return top / bottom; } }

### Benchmark – Math.pow() versus Multiplication

Benchmarks are generally speaking remarkably difficult to do correctly. Here I’ve done a best effort implementation. The first thing this test does is run both implementations for 10m iterations to forcibly enable JIT. It normally takes around 10k iterations for JIT to come on but I’ve deliberately overcompensated.

Then it runs both implementations for a variety of iteration counts ranging from 10k to 10m each time increasing iteration count by a factor of 10. For each iteration count it stores the results but does not print output until the very end to not involve the system libraries while benchmarking. At the end it prints the results for each iteration count.

package name.dhruba.black.scholes; import java.util.Random; import name.dhruba.black.scholes.utils.CumulativeDistributionUsingMult; import name.dhruba.black.scholes.utils.CumulativeDistributionUsingPow; import org.junit.Test; public class TestCumulativeDistributionCalculator { @Test public void test() { Random r = new Random(); // 10m iteration warmup to ensure jit is underway for (int i = 0; i < Math.pow(10, 7); i++) { double d = r.nextDouble(); CumulativeDistributionUsingPow.cumulativeDistribution(d); CumulativeDistributionUsingMult.cumulativeDistribution(d); } // execute both for a variety of iterations and store results long[][] times = new long[4][4]; for (int i = 0; i < 4; i++) { int iterations = (int) Math.pow(10, i + 4); times[i] = testHelper(iterations, r); } // computations complete; print times for (int i = 0; i < times.length; i++) { System.out.format("Over %1$9s iterations pow took %2$5s ms and " + "mult took %3$4s ms; mult was %4$3s faster.n", times[i][0], times[i][1], times[i][2], times[i][3]); } } private long[] testHelper(int iterations, Random r) { // generate data for given iterations double[] data = new double[iterations]; for (int i = 0; i < iterations; i++) { data[i] = r.nextDouble(); } // benchmark pow for given iterations long t1 = System.currentTimeMillis(); for (int i = 0; i < iterations; i++) { CumulativeDistributionUsingPow.cumulativeDistribution(data[i]); } t1 = System.currentTimeMillis() - t1; // benchmark multiplication for given iterations long t2 = System.currentTimeMillis(); for (int i = 0; i < iterations; i++) { CumulativeDistributionUsingMult.cumulativeDistribution(data[i]); } t2 = System.currentTimeMillis() - t2; // save times taken return new long[] { iterations, t1, t2, t1 / t2 }; } }

### Running times and speedup factors

Running the above benchmark test prints the following output.

Over 10000 iterations pow took 8 ms and mult took 2 ms; mult was 4 faster. Over 100000 iterations pow took 82 ms and mult took 4 ms; mult was 20 faster. Over 1000000 iterations pow took 794 ms and mult took 36 ms; mult was 22 faster. Over 10000000 iterations pow took 9116 ms and mult took 422 ms; mult was 21 faster.

So, unless I’ve made mistakes above, you can see that simple multiplication can be ~20 faster than equivalent Math.pow() operations and this, I believe, is universal. It will be the same for C++ and probably other languages too. And this is because Math.pow() deals with the generic case of raising a number to any given power. The lookup involved in computing the result (via a native call in Java) is what takes additional time.

Once again, we’ve looked at a way of optimising performance that isn’t entirely obvious and cannot generally be found in official documentation. These are things one learns by working with others who know more than we do. And, by the way, for all the “premature optimisation is the root of all evil” hippie believers I don’t want to hear any of it. If you’re writing high latency code such as webapps or data access code this approach of remaining oblivious to performance may be acceptable but low latency requires an opposite mindset. In low latency you have to preempt implementations with faster ones and you have to question everything you see. It’s a dark art and one that takes a lifetime to learn and here I try to share with you whatever I learn one tip at a time.