patternpythonMinor
Fastest possible Cython for Black-Scholes algorithm
Viewed 0 times
algorithmpossiblecythonscholesforfastestblack
Problem
I started with a pure python implementation, and have been trying to get the performance as close to native C as possible using numpy, numexpr, and cython. Here is the the numpy version that I compile with cython:
This code takes about 56 seconds to compute 4,194,304 options. C code with MKL takes about 7 seconds. I wanted to try to take advantage of parallelism, but I read here that you have to remove all python objects from a block to run the code in parallel. So I tried rewriting the function like this:
```
@cython.boundscheck(False)
@cython.wraparound(False)
def black_scholes_cython_parall
import numpy as np
cimport cython
cimport numpy as np
from cython.parallel cimport prange, parallel
from scipy.special import erf as sp_erf
from libc.math cimport log, exp, sqrt, erf
from libc.stdlib cimport malloc, free
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t
@cython.boundscheck(False)
@cython.wraparound(False)
def black_scholes_cython(int nopt,
np.ndarray[DTYPE_t, ndim=1] price,
np.ndarray[DTYPE_t, ndim=1] strike,
np.ndarray[DTYPE_t, ndim=1] t,
double rate,
double vol):
cdef np.ndarray d1 = np.zeros(nopt, dtype=DTYPE)
cdef np.ndarray d2 = np.zeros(nopt, dtype=DTYPE)
cdef np.ndarray call = np.zeros(nopt, dtype=DTYPE)
cdef np.ndarray put = np.zeros(nopt, dtype=DTYPE)
d1 = (np.log(price / strike) + (rate + vol * vol / 2.) * t) / (vol * np.sqrt(t))
d2 = (np.log(price / strike) + (rate - vol * vol / 2.) * t) / (vol * np.sqrt(t))
cdef np.ndarray n_d1 = 0.5 + 0.5 * sp_erf(d1 / np.sqrt(2))
cdef np.ndarray n_d2 = 0.5 + 0.5 * sp_erf(d2 / np.sqrt(2))
cdef np.ndarray neg_d1 = np.negative(n_d1)
cdef np.ndarray neg_d2 = np.negative(n_d1)
cdef np.ndarray e_rt = np.exp(-rate * t)
call = price * n_d1 - e_rt * strike * n_d2
put = e_rt * strike * neg_d2 - price * neg_d1
return call, putThis code takes about 56 seconds to compute 4,194,304 options. C code with MKL takes about 7 seconds. I wanted to try to take advantage of parallelism, but I read here that you have to remove all python objects from a block to run the code in parallel. So I tried rewriting the function like this:
```
@cython.boundscheck(False)
@cython.wraparound(False)
def black_scholes_cython_parall
Solution
Well it's no surprise that the first function doesn't perform
differently than the Python version, as it will still call into Python
and to NumPy - nothing to be gained by doing that via Cython.
The
In
I'd rather try the import and catch the exception, that's a more fool
proof way:
protected with
Now, I got some better timings by manually decomposing the terms in the
parallel version. Next remove the
in there aren't actually used again; that also gets rid of the memory
allocation.
Finally, if only one of those values is going to be returned, only
compute one of them; assuming that the
double calculation it might be worth it to put a conditional there.
Assuming I didn't mess up the rewrite, I have now the following code,
which is a bit faster than the initial version:
For building the extension, check that optimisations are on, i.e.
or
interesting flags as well.
Looking at the code samples from the Intel page there's probably only so
much you can do without getting into vectorised operations.
differently than the Python version, as it will still call into Python
and to NumPy - nothing to be gained by doing that via Cython.
The
setup.py is probably something closer to this:from distutils.core import setup
from Cython.Build import cythonize
setup(ext_modules=cythonize("bs.pyx"),)In
test.py the generated .so file isn't necessarily named that way,I'd rather try the import and catch the exception, that's a more fool
proof way:
try:
import bs
except ImportError:
print("Must run 'python setup.py build_ext --inplace' first.")
else:
print("Cython")
run(bs.black_scholes_cython, nparr=True)
print("Parallel Cython")
run(bs.black_scholes_cython_parallel, nparr=True)- Possible bug:
n_d2/neg_d2should probably also used2instead of
d1?- Memory should be
free'd in any case, so the main loop should be
protected with
try/finally.- I think it's more customary to use
time.timeinstead of
time.clock.Now, I got some better timings by manually decomposing the terms in the
parallel version. Next remove the
d1/d2 arrays, as the values storedin there aren't actually used again; that also gets rid of the memory
allocation.
Finally, if only one of those values is going to be returned, only
compute one of them; assuming that the
if isn't much slower than thedouble calculation it might be worth it to put a conditional there.
Assuming I didn't mess up the rewrite, I have now the following code,
which is a bit faster than the initial version:
@cython.boundscheck(False)
@cython.wraparound(False)
def black_scholes_cython_parallel2(int nopt,
double[:] price,
double[:] strike,
double[:] t,
double rate,
double vol,
bint ret_call=1):
cdef double[:] ret = np.zeros(nopt, dtype=DTYPE)
cdef int i
cdef DTYPE_t d1, d2, n_d1, n_d2, s, p, e_rt, v, x, y
with nogil, parallel():
for i in prange(nopt, schedule='guided'):
s = strike[i]
p = price[i]
v = vol * sqrt(t[i])
x = log(p / s) + rate * t[i]
y = vol * vol / 2. * t[i]
d1 = (x + y) / v
d2 = (x - y) / v
n_d1 = 0.5 + 0.5 * erf(d1 / sqrt(2))
n_d2 = 0.5 + 0.5 * erf(d2 / sqrt(2))
e_rt = exp(-rate * t[i])
if ret_call:
ret[i] = p * n_d1 - e_rt * s * n_d2
else:
ret[i] = e_rt * s * -n_d2 - p * -n_d1
return retFor building the extension, check that optimisations are on, i.e.
-O2or
-O3 is passed as arguments during compilation; there might be otherinteresting flags as well.
Looking at the code samples from the Intel page there's probably only so
much you can do without getting into vectorised operations.
Code Snippets
from distutils.core import setup
from Cython.Build import cythonize
setup(ext_modules=cythonize("bs.pyx"),)try:
import bs
except ImportError:
print("Must run 'python setup.py build_ext --inplace' first.")
else:
print("Cython")
run(bs.black_scholes_cython, nparr=True)
print("Parallel Cython")
run(bs.black_scholes_cython_parallel, nparr=True)@cython.boundscheck(False)
@cython.wraparound(False)
def black_scholes_cython_parallel2(int nopt,
double[:] price,
double[:] strike,
double[:] t,
double rate,
double vol,
bint ret_call=1):
cdef double[:] ret = np.zeros(nopt, dtype=DTYPE)
cdef int i
cdef DTYPE_t d1, d2, n_d1, n_d2, s, p, e_rt, v, x, y
with nogil, parallel():
for i in prange(nopt, schedule='guided'):
s = strike[i]
p = price[i]
v = vol * sqrt(t[i])
x = log(p / s) + rate * t[i]
y = vol * vol / 2. * t[i]
d1 = (x + y) / v
d2 = (x - y) / v
n_d1 = 0.5 + 0.5 * erf(d1 / sqrt(2))
n_d2 = 0.5 + 0.5 * erf(d2 / sqrt(2))
e_rt = exp(-rate * t[i])
if ret_call:
ret[i] = p * n_d1 - e_rt * s * n_d2
else:
ret[i] = e_rt * s * -n_d2 - p * -n_d1
return retContext
StackExchange Code Review Q#108533, answer score: 3
Revisions (0)
No revisions yet.