HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Unique nucleotide permutations, Python itertools product

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
uniqueproductnucleotidepythonitertoolspermutations

Problem

I'm searching for all the unique possible permutations of nucleotide given a defined length. By unique, I mean the reverse complement would not be counted.


ACGT

For example, permutations of length 1 would be:

  • A:T



  • C:G



  • G:C



  • T:A



But the last two wouldn't be included because they would already be seen.

Length 2 would be:

  • AA:TT



  • AC:GT



  • AG:CT



  • AT:AT



  • CA:TG



  • CC:GG



  • CG:CG



  • CT:AG



  • GA:TC



  • GC:GC



  • GG:CC



  • GT:AC



  • TA:TA



  • TC:GA



  • TG:CA



  • TT:AA



So, 8, 11, 12, 14, 15, 16 wouldn't be included for the same reason, and so on.

def data_count(n, trantab):
    nn = int(n)
    count = 0
    seen = set()

    with open('file.txt','w') as f:
        for number in range(1, nn+1):
            for p in itertools.product('ACGT', repeat=number):
                print(''.join(p)+':'+''.join(p)[::-1].translate(trantab))
                if ''.join(p) in seen:
                    pass
                else:
                    seen.add(''.join(p)[::-1].translate(trantab))
                    count += 1
            #print(x)
            f.write(str(number)+' = '+str(count)+'\n')
            count = 0


It does the job pretty well until 10 or 12, but beyond that it takes a lot of time to output results.

n is the length of the search and trantab is:

intab = 'ACGT'
outtab = 'TGCA'
trantab = str.maketrans(intab, outtab)


to get the reverse complement of the nucleotide.

Is there any way to improve this one?

Solution

To get the best performance depends on what you want.
If you want to only write to the file you should use maths.
If you want to generate them all, well that's almost the best you'll get.

And so if you want to do the latter I'd remove all of the former from the function.
And just create:

def data_count(data, dethp):
    translation = str.maketrans(data, data[::-1])
    for number in range(1, int(depth)+1):
        for product in itertools.product(data, repeat=number):
            product = ''.join(product)
            yield product, product[::-1].translate(translation)


The other bit is a little harder.
What you want to do is find a common factor.
For repeat=2, we get the bullet points in your question.
And so we should we should group them, there's:

  • Groups that add to seen.



  • Groups that aren't counted due to seen.



  • Groups that are the same both sides.



# group 1
('AA', 'TT')
('AC', 'GT')
('AG', 'CT')
('CA', 'TG')
('CC', 'GG')
('GA', 'TC')

# group 2
('TT', 'AA')
('GT', 'AC')
('CT', 'AG')
('TG', 'CA')
('GG', 'CC')
('TC', 'GA')

# group 3
('TA', 'TA')
('GC', 'GC')
('CG', 'CG')
('AT', 'AT')


We can then use the above function to find out more about these groups.
And so we can get the following table:

1 + 2 + 3   | 1 + 3     | 3
------------+-----------+-----
4           | 2         | 0
16          | 10        | 4
64          | 32        | 0
256         | 136       | 16
1024        | 512       | 0
4096        | 2080      | 64
16384       | 8192      | 0
65536       | 32896     | 256
262144      | 131072    | 0
1048576     | 524800    | 1024
4194304     | 2097152   | 0
16777216    | 8390656   | 4096


From this we can see that 1 + 3 is the same as (1 + 2 + 3 + 3) // 2.
We can also find a pattern for 1 + 2 + 3 and for 3.
The first number is \$4 ^ \text{num}\$. The second is almost \$2 ^ \text{num}\$.
And so we can use this to find 1 + 3:

def data_count(depth):
    for n in range(1, int(depth)+1):
        yield n, (4 ** n + (0 if n % 2 else 2 ** n)) // 2


But if you want both, then you should use both functions.

Code Snippets

def data_count(data, dethp):
    translation = str.maketrans(data, data[::-1])
    for number in range(1, int(depth)+1):
        for product in itertools.product(data, repeat=number):
            product = ''.join(product)
            yield product, product[::-1].translate(translation)
# group 1
('AA', 'TT')
('AC', 'GT')
('AG', 'CT')
('CA', 'TG')
('CC', 'GG')
('GA', 'TC')

# group 2
('TT', 'AA')
('GT', 'AC')
('CT', 'AG')
('TG', 'CA')
('GG', 'CC')
('TC', 'GA')

# group 3
('TA', 'TA')
('GC', 'GC')
('CG', 'CG')
('AT', 'AT')
1 + 2 + 3   | 1 + 3     | 3
------------+-----------+-----
4           | 2         | 0
16          | 10        | 4
64          | 32        | 0
256         | 136       | 16
1024        | 512       | 0
4096        | 2080      | 64
16384       | 8192      | 0
65536       | 32896     | 256
262144      | 131072    | 0
1048576     | 524800    | 1024
4194304     | 2097152   | 0
16777216    | 8390656   | 4096
def data_count(depth):
    for n in range(1, int(depth)+1):
        yield n, (4 ** n + (0 if n % 2 else 2 ** n)) // 2

Context

StackExchange Code Review Q#148041, answer score: 2

Revisions (0)

No revisions yet.