HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Monte Carlo estimation of the Hypergeometric Function

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
montethefunctionhypergeometriccarloestimation

Problem

I am trying to implement the algorithm described in the paper Statistical Test for the Comparison of Samples from Mutational Spectra (Adams & Skopek, 1986) DOI: 10.1016/0022-2836(87)90669-3:

$$p = \dfrac{\prod\limits_{i=1}^{N}(R_i!)\prod\limits_{j=1}^{M}(C_j!)}{T!\prod\limits_{i=1}^{N}\prod\limits_{j=1}^{M}(X_{ij}!)}$$

The math is easy to follow in this script although very large numbers will be generated. Would using the mpmath module be my best bet for accuracy? I notice that many of my long numbers have an excessive amount of 0s possibly due to rounding errors? I want to optimize this function to generate the p-value as accurately as possible. Speed is important but not the priority. I would prefer to implement the algorithm myself although I understand there may be hypergeometric probability functions in SciPy and PyMC. I have yet to find this specific algorithm implemented in a Python module.

  • How can I improve the precision of the very large numbers? Would working with log computations be ideal?



  • Are there any module hypergeometric functions available for this specific algorithm flavor?



  • Am I using the decimal module appropriately?



data.dat

dsum    csum
1   0
1   2
3   9
2   1
2   1
1   0
1   0
10  0
2   3
3   0
15  0
1   28
etc.


hyperg.py

```
from __future__ import division # Division is now float not integer
import numpy as np
import pandas as pd
from math import factorial as fctr
from decimal import Decimal

def factorfunc(num):
"""Returns factorial of x"""
return Decimal(fctr(num))

def main():
"""Return the Monte Carlo Hypergeometric probability"""
frame = pd.read_csv(r'c:\dev\hyperg\data.dat', sep='\t', header=0)

colsum_1 = sum(frame['dsum'])
colsum_2 = sum(frame['csum'])
full_sum = colsum_1 + colsum_2
length = len(frame)

frame.index = frame.index + 1
frame2 = pd.DataFrame(frame.apply(np.sum, axis=1)).applymap(factorfunc)
frame3 = frame.applymap(factorfunc)

binner = 1
f

Solution

I am the Adams of Adams-Skopek. I want to point out a couple of things:

-
You get some truncation error when you sum or multiply a series of values in different orders. You need to estimate this and account for it.

-
There is a lower limit on sparsity of the data, you need to address this.

-
If you can get a copy of the Pagano and Halverson program (cited in Adams and Skopek), it is useful for comparing it's results with your results. It does the exact hypergeometric test very efficiently.

-
There is a PC version here. You might need to ask the authors for the code.

-
The computer department here can probably locate the original Fortran code for the original version of the Adams Skopek program and give you a copy (which I don't have).

Context

StackExchange Code Review Q#61621, answer score: 6

Revisions (0)

No revisions yet.