HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Genome string clump finding problem

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
problemgenomefindingstringclump

Problem

I am trying to solve a bioinformatics problems from a Stepic course.

The problem posed: find clumps of the same pattern within a longer genome.

Motivation: Identifying 3 occurrences of the same pattern within a window of length 500 within a genome allows biologists to find the "replication origin" of a bacterial genome. This replication origin is required for the bacteria to clone itself, and it is therefore significant.

Using Python, I have written a solution that works effectively for small, test data sets, but which is not efficient for real-life genomes. Below I describe the method used.

I've used two core functions.

patternOccurances

This function finds all the starting locations of a short pattern in a longer string.

def patternOccurances2(text="GATATATGCATATACTT", pattern="ATAT"):

  occurances = []
  position = -1

  position = config.genome.find(pattern, position+1)
  occurances.append(position)

  while position != -1:
    position = text.find(pattern, position+1)
    occurances.append(position)
  return occurances[:-1]


frequentPatterns

This second function finds all patterns of length k in a string that occur at least t times; or, if t is omitted, only the most frequent patterns of length k.

```
def frequentPatterns(text="", k=2,numberOfOccurances=None):

#Frequent Words Problem: Find the frequent k-mers in a string.
# Input: A string text, and an integer k, number of occurances t
# Output: All most frequent k-mers in Text.

words = {}
frequentPatterns = []

#create a dictionary of all patterns of length k in text with their respective frequencies
for i in range(len(text)-k + 1): #iterate over the valid length of the string
a = text[i:i+k]
if a in words:
words[a] = words[a] + 1
else:
words[a] = 1

if numberOfOccurances == None: #gives only the most frequent strings of length k

largestValue=0
for k, v in words.iteritems():
if v > largestValue:
frequentPatterns = []

Solution


  1. Introduction



This code has so many basic problems with documentation, layout, spelling, naming, and so on, that I didn't get around to looking at what it does. In a professional context we would say, "this code is not remotely ready for review" and send it back for you to fix!

It's worth reading the Python style guide (PEP8) and following the advice there. It's not compulsory, but it will make it easier for you to collaborate with other Python programmers and read each others' code. In particular, I'd use 4 spaces indentation, and lowercase_words_with_underscores for function and method names.

  1. The function patternOccurances2



-
Your main code calls patternOccurances but when you define the function it's called patternOccurances2.

-
There's no documentation for patternOccurances2. What does it do and how am I supposed to call it?

-
This function takes default arguments text="GATATATGCATATACTT", pattern="ATAT". These do not seem like suitable default arguments. Default arguments are normally used for common values, to save the caller having to specify them. If you want to provide test data, then write a test case separately. See below for one way to do it.

-
This function takes its arguments in the order (text, pattern) but it would be better for it to take the arguments the other way round, for consistency with functions in the standard library like re.search.

-
This function starts by calling:

position = config.genome.find(pattern, position+1)


What is config.genome and how is it relevant here? It looks as if it might be a mistake for text.

-
It seems inelegant to append a -1 to the list of occurrences, and then remove it again. It would be better to write the loop like this:

result = []
position = -1
while True:
    position = text.find(pattern, position + 1)
    if position == -1:
        return result
    result.append(position)


-
Instead of returning a list, it's nearly always better in Python to generate the results using the yield instruction, explained here. This means that you can process the results one by one without having to build up an intermediate list in memory. So I would write the function like this:

def pattern_occurrences(pattern, text):
    """Generate the indices of the (possibly overlapping) occurrences of
    pattern in text. For example:

    >>> list(pattern_occurrences('ATAT', 'GATATATGCATATACTT'))
    [1, 3, 9]

    """
    position = -1
    while True:
        position = text.find(pattern, position + 1)
        if position == -1:
            return
        yield position


Notice that the docstring includes an example test case. You can run this and check that the result is correct using the doctest module.

  1. The function frequentPatterns



-
Something's gone wrong with the indentation in this function. A copy-paste error?

-
The documentation says Input ... number of occurances t. But the actual argument is called numberOfOccurances, not t.

-
The documentation for frequentPatterns says Output: All most frequent k-mers in text. But actually, only the k-mers that appear at least t times are output.

-
There's no explanation in the documentation for the behaviour if you don't specify a value for t.

-
The documentation for frequentPatterns is in a comment. But that means that I can't read it from the interactive interpreter:

>>> help(frequentPatterns)
Help on function frequentPatterns in module __main__:

frequentPatterns(text='', k=2, numberOfOccurances=None)


If you put this documentation into a docstring, like this:

def frequent_patterns(text="", k=2, t=None):
    """Return a list of the most frequent k-mers in a string.

    Input: A string text, an integer k, and a number of occurrences t.
    Output: All k-mers that appear at least t times in text.

    If t is omitted, return a list containing the k-mer that appears
    the most often (or the k-mers, if there is a tie).

    """
    words = {}
    # ...


then you'd be able to read it from the interactive interpreter:

>>> help(frequent_patterns)
Help on function frequent_patterns in module __main__:

frequent_patterns(text='', k=2, t=None)
Return a list of the most frequent k-mers in a string.

Input: A string text, an integer k, and a number of occurrences t.
Output: All k-mers that appear at least t times in text.

If t is omitted, return a list containing the k-mer that appears
the most often (or the k-mers, if there is a tie).

Code Snippets

position = config.genome.find(pattern, position+1)
result = []
position = -1
while True:
    position = text.find(pattern, position + 1)
    if position == -1:
        return result
    result.append(position)
def pattern_occurrences(pattern, text):
    """Generate the indices of the (possibly overlapping) occurrences of
    pattern in text. For example:

    >>> list(pattern_occurrences('ATAT', 'GATATATGCATATACTT'))
    [1, 3, 9]

    """
    position = -1
    while True:
        position = text.find(pattern, position + 1)
        if position == -1:
            return
        yield position
def frequent_patterns(text="", k=2, t=None):
    """Return a list of the most frequent k-mers in a string.

    Input: A string text, an integer k, and a number of occurrences t.
    Output: All k-mers that appear at least t times in text.

    If t is omitted, return a list containing the k-mer that appears
    the most often (or the k-mers, if there is a tie).

    """
    words = {}
    # ...

Context

StackExchange Code Review Q#37932, answer score: 5

Revisions (0)

No revisions yet.