patternpythonMinor
Monte Carlo estimation of the Hypergeometric Function
Viewed 0 times
montethefunctionhypergeometriccarloestimation
Problem
I am trying to implement the algorithm described in the paper Statistical Test for the Comparison of Samples from Mutational Spectra (Adams & Skopek, 1986) DOI: 10.1016/0022-2836(87)90669-3:
$$p = \dfrac{\prod\limits_{i=1}^{N}(R_i!)\prod\limits_{j=1}^{M}(C_j!)}{T!\prod\limits_{i=1}^{N}\prod\limits_{j=1}^{M}(X_{ij}!)}$$
The math is easy to follow in this script although very large numbers will be generated. Would using the mpmath module be my best bet for accuracy? I notice that many of my
data.dat
hyperg.py
```
from __future__ import division # Division is now float not integer
import numpy as np
import pandas as pd
from math import factorial as fctr
from decimal import Decimal
def factorfunc(num):
"""Returns factorial of x"""
return Decimal(fctr(num))
def main():
"""Return the Monte Carlo Hypergeometric probability"""
frame = pd.read_csv(r'c:\dev\hyperg\data.dat', sep='\t', header=0)
colsum_1 = sum(frame['dsum'])
colsum_2 = sum(frame['csum'])
full_sum = colsum_1 + colsum_2
length = len(frame)
frame.index = frame.index + 1
frame2 = pd.DataFrame(frame.apply(np.sum, axis=1)).applymap(factorfunc)
frame3 = frame.applymap(factorfunc)
binner = 1
f
$$p = \dfrac{\prod\limits_{i=1}^{N}(R_i!)\prod\limits_{j=1}^{M}(C_j!)}{T!\prod\limits_{i=1}^{N}\prod\limits_{j=1}^{M}(X_{ij}!)}$$
The math is easy to follow in this script although very large numbers will be generated. Would using the mpmath module be my best bet for accuracy? I notice that many of my
long numbers have an excessive amount of 0s possibly due to rounding errors? I want to optimize this function to generate the p-value as accurately as possible. Speed is important but not the priority. I would prefer to implement the algorithm myself although I understand there may be hypergeometric probability functions in SciPy and PyMC. I have yet to find this specific algorithm implemented in a Python module.- How can I improve the precision of the very large numbers? Would working with log computations be ideal?
- Are there any module hypergeometric functions available for this specific algorithm flavor?
- Am I using the
decimalmodule appropriately?
data.dat
dsum csum
1 0
1 2
3 9
2 1
2 1
1 0
1 0
10 0
2 3
3 0
15 0
1 28
etc.hyperg.py
```
from __future__ import division # Division is now float not integer
import numpy as np
import pandas as pd
from math import factorial as fctr
from decimal import Decimal
def factorfunc(num):
"""Returns factorial of x"""
return Decimal(fctr(num))
def main():
"""Return the Monte Carlo Hypergeometric probability"""
frame = pd.read_csv(r'c:\dev\hyperg\data.dat', sep='\t', header=0)
colsum_1 = sum(frame['dsum'])
colsum_2 = sum(frame['csum'])
full_sum = colsum_1 + colsum_2
length = len(frame)
frame.index = frame.index + 1
frame2 = pd.DataFrame(frame.apply(np.sum, axis=1)).applymap(factorfunc)
frame3 = frame.applymap(factorfunc)
binner = 1
f
Solution
I am the Adams of Adams-Skopek. I want to point out a couple of things:
-
You get some truncation error when you sum or multiply a series of values in different orders. You need to estimate this and account for it.
-
There is a lower limit on sparsity of the data, you need to address this.
-
If you can get a copy of the Pagano and Halverson program (cited in Adams and Skopek), it is useful for comparing it's results with your results. It does the exact hypergeometric test very efficiently.
-
There is a PC version here. You might need to ask the authors for the code.
-
The computer department here can probably locate the original Fortran code for the original version of the Adams Skopek program and give you a copy (which I don't have).
-
You get some truncation error when you sum or multiply a series of values in different orders. You need to estimate this and account for it.
-
There is a lower limit on sparsity of the data, you need to address this.
-
If you can get a copy of the Pagano and Halverson program (cited in Adams and Skopek), it is useful for comparing it's results with your results. It does the exact hypergeometric test very efficiently.
-
There is a PC version here. You might need to ask the authors for the code.
-
The computer department here can probably locate the original Fortran code for the original version of the Adams Skopek program and give you a copy (which I don't have).
Context
StackExchange Code Review Q#61621, answer score: 6
Revisions (0)
No revisions yet.