patternpythonMinor

Fastest possible Cython for Black-Scholes algorithm

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

algorithmpossiblecythonscholesforfastestblack

Problem

I started with a pure python implementation, and have been trying to get the performance as close to native C as possible using numpy, numexpr, and cython. Here is the the numpy version that I compile with cython:

import numpy as np
cimport cython
cimport numpy as np
from cython.parallel cimport prange, parallel
from scipy.special import erf as sp_erf
from libc.math cimport log, exp, sqrt, erf
from libc.stdlib cimport malloc, free

DTYPE = np.float64
ctypedef np.float64_t DTYPE_t

@cython.boundscheck(False)
@cython.wraparound(False)
def black_scholes_cython(int nopt,
                         np.ndarray[DTYPE_t, ndim=1] price,
                         np.ndarray[DTYPE_t, ndim=1] strike,
                         np.ndarray[DTYPE_t, ndim=1] t,
                         double rate,
                         double vol):

    cdef np.ndarray d1 = np.zeros(nopt, dtype=DTYPE)
    cdef np.ndarray d2 = np.zeros(nopt, dtype=DTYPE)
    cdef np.ndarray call = np.zeros(nopt, dtype=DTYPE)
    cdef np.ndarray put = np.zeros(nopt, dtype=DTYPE)

    d1 = (np.log(price / strike) + (rate + vol * vol / 2.) * t) / (vol * np.sqrt(t))
    d2 = (np.log(price / strike) + (rate - vol * vol / 2.) * t) / (vol * np.sqrt(t))

    cdef np.ndarray n_d1 = 0.5 + 0.5 * sp_erf(d1 / np.sqrt(2))
    cdef np.ndarray n_d2 = 0.5 + 0.5 * sp_erf(d2 / np.sqrt(2))
    cdef np.ndarray neg_d1 = np.negative(n_d1)
    cdef np.ndarray neg_d2 = np.negative(n_d1)
    cdef np.ndarray e_rt = np.exp(-rate * t)

    call = price * n_d1 - e_rt * strike * n_d2
    put = e_rt * strike * neg_d2 - price * neg_d1

    return call, put

This code takes about 56 seconds to compute 4,194,304 options. C code with MKL takes about 7 seconds. I wanted to try to take advantage of parallelism, but I read here that you have to remove all python objects from a block to run the code in parallel. So I tried rewriting the function like this:

```
@cython.boundscheck(False)
@cython.wraparound(False)
def black_scholes_cython_parall

Solution

Well it's no surprise that the first function doesn't perform
differently than the Python version, as it will still call into Python
and to NumPy - nothing to be gained by doing that via Cython.

The setup.py is probably something closer to this:

from distutils.core import setup

from Cython.Build import cythonize

setup(ext_modules=cythonize("bs.pyx"),)

In test.py the generated .so file isn't necessarily named that way,
I'd rather try the import and catch the exception, that's a more fool
proof way:

try:
    import bs
except ImportError:
    print("Must run 'python setup.py build_ext --inplace' first.")
else:
    print("Cython")
    run(bs.black_scholes_cython, nparr=True)
    print("Parallel Cython")
    run(bs.black_scholes_cython_parallel, nparr=True)

Possible bug: n_d2/neg_d2 should probably also use d2 instead of

d1?

Memory should be free'd in any case, so the main loop should be

protected with try/finally.

I think it's more customary to use time.time instead of

time.clock.

Now, I got some better timings by manually decomposing the terms in the
parallel version. Next remove the d1/d2 arrays, as the values stored
in there aren't actually used again; that also gets rid of the memory
allocation.

Finally, if only one of those values is going to be returned, only
compute one of them; assuming that the if isn't much slower than the
double calculation it might be worth it to put a conditional there.

Assuming I didn't mess up the rewrite, I have now the following code,
which is a bit faster than the initial version:

@cython.boundscheck(False)
@cython.wraparound(False)
def black_scholes_cython_parallel2(int nopt,
                                   double[:] price,
                                   double[:] strike,
                                   double[:] t,
                                   double rate,
                                   double vol,
                                   bint ret_call=1):

    cdef double[:] ret = np.zeros(nopt, dtype=DTYPE)

    cdef int i

    cdef DTYPE_t d1, d2, n_d1, n_d2, s, p, e_rt, v, x, y

    with nogil, parallel():
        for i in prange(nopt, schedule='guided'):
            s = strike[i]
            p = price[i]
            v = vol * sqrt(t[i])
            x = log(p / s) + rate * t[i]
            y = vol * vol / 2. * t[i]

            d1 = (x + y) / v
            d2 = (x - y) / v

            n_d1 = 0.5 + 0.5 * erf(d1 / sqrt(2))
            n_d2 = 0.5 + 0.5 * erf(d2 / sqrt(2))

            e_rt = exp(-rate * t[i])

            if ret_call:
                ret[i] = p * n_d1 - e_rt * s * n_d2
            else:
                ret[i] = e_rt * s * -n_d2 - p * -n_d1

    return ret

For building the extension, check that optimisations are on, i.e. -O2
or -O3 is passed as arguments during compilation; there might be other
interesting flags as well.

Looking at the code samples from the Intel page there's probably only so
much you can do without getting into vectorised operations.

Code Snippets

from distutils.core import setup

from Cython.Build import cythonize

setup(ext_modules=cythonize("bs.pyx"),)

try:
    import bs
except ImportError:
    print("Must run 'python setup.py build_ext --inplace' first.")
else:
    print("Cython")
    run(bs.black_scholes_cython, nparr=True)
    print("Parallel Cython")
    run(bs.black_scholes_cython_parallel, nparr=True)

@cython.boundscheck(False)
@cython.wraparound(False)
def black_scholes_cython_parallel2(int nopt,
                                   double[:] price,
                                   double[:] strike,
                                   double[:] t,
                                   double rate,
                                   double vol,
                                   bint ret_call=1):

    cdef double[:] ret = np.zeros(nopt, dtype=DTYPE)

    cdef int i

    cdef DTYPE_t d1, d2, n_d1, n_d2, s, p, e_rt, v, x, y

    with nogil, parallel():
        for i in prange(nopt, schedule='guided'):
            s = strike[i]
            p = price[i]
            v = vol * sqrt(t[i])
            x = log(p / s) + rate * t[i]
            y = vol * vol / 2. * t[i]

            d1 = (x + y) / v
            d2 = (x - y) / v

            n_d1 = 0.5 + 0.5 * erf(d1 / sqrt(2))
            n_d2 = 0.5 + 0.5 * erf(d2 / sqrt(2))

            e_rt = exp(-rate * t[i])

            if ret_call:
                ret[i] = p * n_d1 - e_rt * s * n_d2
            else:
                ret[i] = e_rt * s * -n_d2 - p * -n_d1

    return ret

Context

StackExchange Code Review Q#108533, answer score: 3

Revisions (0)

No revisions yet.