What’s faster in Numba @jit functions, NumPy or the math package?

Update 2016-01-16: Numba 0.23 released and tested – results added at the end of this post

A while back I was using Numba to accelerate some image processing I was doing and noticed that there was a difference in speed whether I used functions from NumPy or their equivalent from the standard Python math package within the function I was accelerating using Numba. If memory serves, I was using the exp function for something and noticed that replacing numpy.exp with math.exp in the function I had decorated with @jit made a noticeable difference in running time. I didn’t investigate this any further at the time, but now, several versions of Numba and NumPy later, I wanted to find out what was causing this difference and what the current status was in terms of which is faster to use.

My test examines 26 functions from the math package that all take a single input and produce a single output and which have an analogous function in NumPy. Rather than writing 26 functions, each following the same template of looping through an array calling the function on each element of that array, I decided to make use of Python’s compile and exec functions. First, I defined a template string that defined what my function was going to look like:

code = '''
import {module:s}
from numba import jit
def f(x):
    sum = 0.0
    for idx in range(x.size):
        sum += {module:s}.{function:s}(x[idx])
    return sum
jit_f = jit(f)
array_fn = {module:s}.{function:s}
'''

The problem with using a template is that all the function names need to be identical between NumPy and the standard math package, otherwise an exception will be thrown. I got around that by defining the following look-up function that maps function names in the math package to function names in NumPy:

def map_to_numpy_name(fnname):
    npnames = {'acos':'arccos', 'asin':'arcsin', 'atan':'arctan',
               'acosh':'arccosh', 'asinh':'arcsinh', 'atanh':'arctanh'}
    if fnname in npnames:
        return npnames[fnname]
    else:
        return fnname

The module place-holder gets replaced with either ‘numpy’ or ‘math’, while the function place-holder gets replaced with one of the tested function names. This code, when executed, defines three functions: ‘f’, ‘jit_f’ and ‘array_fn’. ‘f’ is just the plain Python loop, so it will be slowest version. ‘jit_f’ is the compiled function courtesy of Numba. ‘array_fn’ is defined so that I have an easy way of calling the tested function from my timing code when the function operates on arrays. That gets used to compare NumPy array operations to looping through the array manually. This is how to compile the code and execute it, passing it its own namespace in which to keep its variables and functions:

compiled_code = compile(code.format(module=module, function=fnname), '<string>', 'exec')
namespace = {}
exec(compiled_code, namespace)

Once that’s done, I can time one of the three functions using the following code:

def time_individual_function(type_name, function_ptr, num_repeats, num_loops, verbose, test_data):
    durations = np.zeros((num_repeats,))
    for repetition in range(num_repeats):
        tstart = clock()
        for loop in range(num_loops):
            function_ptr(test_data)
        tend = clock()
        durations[repetition] = (tend - tstart) / num_loops
    return durations

running_times[('numba', fnname, module)] = time_individual_function('Numba JITted', namespace['jit_f'], num_repeats, num_loops, verbose, test_data)

The full code is available online here with some additional debugging messages and a CLI interface. Lets get to some results instead:

Function NumPy (Python) Math (Python) NumPy (Numba) Math (Numba) NumPy array operation
ceil 1.23475341088 0.217901148218 0.0083829793621 0.000920285178236 0.00982844277674
fabs 1.31707992495 0.224147452158 0.00166955347092 0.00087252532833 0.00334940337711
floor 1.22943912946 0.218479819887 0.00824501313321 0.000875317073172 0.00973097185741
isinf 3.13314332458 0.193701463415 0.00193104690431 0.00157070168856 0.00253241275797
isnan 3.16061943715 0.193019467167 0.000899212007505 0.000886003752345 0.000497831144463
trunc 1.23497512946 0.401735564728 0.0084893358349 0.00117097185741 0.00998174859287
exp 1.3437880075 0.22469382364 0.0156664315197 0.0156368930582 0.0173403377111
expm1 1.41952489306 0.229173613508 0.0332130281426 0.0177992945591 0.0345634521576
log 1.35790436023 0.259746851782 0.00917202251408 0.00911096435271 0.0103853208255
log1p 1.41073257786 0.221454078799 0.015283902439 0.0115659587242 0.0163751144465
log10 1.38020214634 0.222261703565 0.0093406979362 0.00929446904315 0.0105526754221
sqrt 1.29980181614 0.215333043152 0.0056800900563 0.00554746716698 0.00419920450281
acos 1.38762875797 0.230810866792 0.0171560825516 0.0171594446529 0.0178593921201
asin 1.38347082927 0.235335714822 0.0182445928705 0.0185464615385 0.0191342589118
atan 1.38371124953 0.22688558349 0.01454084803 0.0145301013133 0.0154805853659
cos 1.35799483677 0.232051091932 0.0200636998124 0.0204364727955 0.0214075797373
sin 1.40375426642 0.235093163227 0.0213015534709 0.0211516998124 0.0225604502814
tan 1.41166892308 0.240718168856 0.0274034971857 0.0276429268293 0.0287924953096
degrees 1.31739245028 0.182552345216 0.00166868292683 0.00608276172609 0.00329684052533
radians 1.31977080675 0.18133163227 0.00166409005628 0.00607432645402 0.00330575609755
acosh 1.38693283302 0.235513666041 0.0254836472795 0.0299879324578 0.0268675722326
asinh 1.43626140338 0.249595106942 0.0391198198874 0.0405993245779 0.040370521576
atanh 1.3980028818 0.248623459662 0.0226747917448 0.0363789868668 0.0238825966229
cosh 1.35819866417 0.230515302064 0.0213204652908 0.0212887954972 0.0228190318949
sinh 1.36441389869 0.23058641651 0.0210535384615 0.0210634446529 0.0223715722326
tanh 1.26429484428 0.220724592871 0.00989607504691 0.00988806003752 0.0114026866792

That long table wasn’t much fun to look at, here’s a bar plot instead showing the speedup achieved by replacing each NumPy function with its equivalent from the math package

math_vs_numpy_in_numba_speedup

 

You can see that sometimes, replacing the NumPy function with an equivalent from the math package makes a huge improvement (7x~9x faster) and that most of the time it is as fast as NumPy. The exceptions are the degrees, radians and atanh functions that are all faster in NumPy.

Taking a look at the assembler code generated or each of these tests shows that the math.ceil, math.floor and math.fabs functions get inlined, while functions such as acos, exp and log remain as function calls. In the case of math.* functions, the LLVM IR generated by Numba uses LLVM intrinsics, which the compiler target can then turn into function calls or replace them with inline code. The NumPy functions never get inlined, they remain as calls to numba.npymath.* functions. Thank you to Stanley Seibert for explaining this to me on the Numba mailing list over here.

The degrees and radians functions are an odd anomaly. The math.degrees and math.radians versions get inlined as explained above. However, the inlined versions are much slower than just calling numpy.radians. Since the conversion is just a few multiplications, this definitely wasn’t expected.

All of this has been reported on the Numba GitHub repository in the following two issues:

The Numba team has been really quick – there’s already progress on fixing the speed difference. When Numba 0.23 comes out, I’ll update this post.

My final take-home point with this is: its worth it to test whether frequently used functions (NumPy, math, any other package) can be replaced with code that can be compiled and hence inlined courtesy of Numba. Perhaps you won’t see any benefit, but you could see quite a speedup. Importantly though, check that the substitution doesn’t affect the correctness of your algorithm. As I found when updating this post for Numba 0.23, there can be differences between the interface definitions of the math.* and numpy.* functions (trunc for example) which could actually give you different results, so be careful.

Update 2016-01-16: Numba 0.23

Numba 0.23 has just come out and as promised I’ve re-run this test. The updated results are available here. Here’s the revised plot of speed differences between using math.* and numpy.* functions inside a Numba jitted function.

math_vs_numpy_in_numba_speedup

Immediately you can see a lot of the big speed differences have gone. Some remain with the transcendental functions, but the Numba developers will look into those cases. One speed difference that isn’t going away is that now numpy.trunc is faster than math.trunc inside jitted functions. Both functions get inlined now, resulting in the roundsd assembler instruction being used. The reason for the difference is actually that numpy.trunc returns a float while math.trunc returns an integer type. Numba is respecting this behaviour, hence it introduces additional assembler code for a type cast if you use math.trunc. This is also why the Numba developers have to be a little careful, sometimes the NumPy definition of a function and the math package definition will differ and Numba has to respect those differences.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s