patternpythonMinor
Genome string clump finding problem
Viewed 0 times
problemgenomefindingstringclump
Problem
I am trying to solve a bioinformatics problems from a Stepic course.
The problem posed: find clumps of the same pattern within a longer genome.
Motivation: Identifying 3 occurrences of the same pattern within a window of length 500 within a genome allows biologists to find the "replication origin" of a bacterial genome. This replication origin is required for the bacteria to clone itself, and it is therefore significant.
Using Python, I have written a solution that works effectively for small, test data sets, but which is not efficient for real-life genomes. Below I describe the method used.
I've used two core functions.
patternOccurances
This function finds all the starting locations of a short pattern in a longer string.
frequentPatterns
This second function finds all patterns of length
```
def frequentPatterns(text="", k=2,numberOfOccurances=None):
#Frequent Words Problem: Find the frequent k-mers in a string.
# Input: A string text, and an integer k, number of occurances t
# Output: All most frequent k-mers in Text.
words = {}
frequentPatterns = []
#create a dictionary of all patterns of length k in text with their respective frequencies
for i in range(len(text)-k + 1): #iterate over the valid length of the string
a = text[i:i+k]
if a in words:
words[a] = words[a] + 1
else:
words[a] = 1
if numberOfOccurances == None: #gives only the most frequent strings of length k
largestValue=0
for k, v in words.iteritems():
if v > largestValue:
frequentPatterns = []
The problem posed: find clumps of the same pattern within a longer genome.
Motivation: Identifying 3 occurrences of the same pattern within a window of length 500 within a genome allows biologists to find the "replication origin" of a bacterial genome. This replication origin is required for the bacteria to clone itself, and it is therefore significant.
Using Python, I have written a solution that works effectively for small, test data sets, but which is not efficient for real-life genomes. Below I describe the method used.
I've used two core functions.
patternOccurances
This function finds all the starting locations of a short pattern in a longer string.
def patternOccurances2(text="GATATATGCATATACTT", pattern="ATAT"):
occurances = []
position = -1
position = config.genome.find(pattern, position+1)
occurances.append(position)
while position != -1:
position = text.find(pattern, position+1)
occurances.append(position)
return occurances[:-1]frequentPatterns
This second function finds all patterns of length
k in a string that occur at least t times; or, if t is omitted, only the most frequent patterns of length k.```
def frequentPatterns(text="", k=2,numberOfOccurances=None):
#Frequent Words Problem: Find the frequent k-mers in a string.
# Input: A string text, and an integer k, number of occurances t
# Output: All most frequent k-mers in Text.
words = {}
frequentPatterns = []
#create a dictionary of all patterns of length k in text with their respective frequencies
for i in range(len(text)-k + 1): #iterate over the valid length of the string
a = text[i:i+k]
if a in words:
words[a] = words[a] + 1
else:
words[a] = 1
if numberOfOccurances == None: #gives only the most frequent strings of length k
largestValue=0
for k, v in words.iteritems():
if v > largestValue:
frequentPatterns = []
Solution
- Introduction
This code has so many basic problems with documentation, layout, spelling, naming, and so on, that I didn't get around to looking at what it does. In a professional context we would say, "this code is not remotely ready for review" and send it back for you to fix!
It's worth reading the Python style guide (PEP8) and following the advice there. It's not compulsory, but it will make it easier for you to collaborate with other Python programmers and read each others' code. In particular, I'd use 4 spaces indentation, and
lowercase_words_with_underscores for function and method names.- The function patternOccurances2
-
Your main code calls
patternOccurances but when you define the function it's called patternOccurances2.-
There's no documentation for
patternOccurances2. What does it do and how am I supposed to call it?-
This function takes default arguments
text="GATATATGCATATACTT", pattern="ATAT". These do not seem like suitable default arguments. Default arguments are normally used for common values, to save the caller having to specify them. If you want to provide test data, then write a test case separately. See below for one way to do it.-
This function takes its arguments in the order
(text, pattern) but it would be better for it to take the arguments the other way round, for consistency with functions in the standard library like re.search.-
This function starts by calling:
position = config.genome.find(pattern, position+1)What is
config.genome and how is it relevant here? It looks as if it might be a mistake for text.-
It seems inelegant to append a
-1 to the list of occurrences, and then remove it again. It would be better to write the loop like this:result = []
position = -1
while True:
position = text.find(pattern, position + 1)
if position == -1:
return result
result.append(position)-
Instead of returning a list, it's nearly always better in Python to generate the results using the
yield instruction, explained here. This means that you can process the results one by one without having to build up an intermediate list in memory. So I would write the function like this:def pattern_occurrences(pattern, text):
"""Generate the indices of the (possibly overlapping) occurrences of
pattern in text. For example:
>>> list(pattern_occurrences('ATAT', 'GATATATGCATATACTT'))
[1, 3, 9]
"""
position = -1
while True:
position = text.find(pattern, position + 1)
if position == -1:
return
yield positionNotice that the docstring includes an example test case. You can run this and check that the result is correct using the
doctest module.- The function frequentPatterns
-
Something's gone wrong with the indentation in this function. A copy-paste error?
-
The documentation says
Input ... number of occurances t. But the actual argument is called numberOfOccurances, not t.-
The documentation for
frequentPatterns says Output: All most frequent k-mers in text. But actually, only the k-mers that appear at least t times are output.-
There's no explanation in the documentation for the behaviour if you don't specify a value for
t.-
The documentation for
frequentPatterns is in a comment. But that means that I can't read it from the interactive interpreter:>>> help(frequentPatterns)
Help on function frequentPatterns in module __main__:
frequentPatterns(text='', k=2, numberOfOccurances=None)
If you put this documentation into a docstring, like this:
def frequent_patterns(text="", k=2, t=None):
"""Return a list of the most frequent k-mers in a string.
Input: A string text, an integer k, and a number of occurrences t.
Output: All k-mers that appear at least t times in text.
If t is omitted, return a list containing the k-mer that appears
the most often (or the k-mers, if there is a tie).
"""
words = {}
# ...then you'd be able to read it from the interactive interpreter:
>>> help(frequent_patterns)
Help on function frequent_patterns in module __main__:
frequent_patterns(text='', k=2, t=None)
Return a list of the most frequent k-mers in a string.
Input: A string text, an integer k, and a number of occurrences t.
Output: All k-mers that appear at least t times in text.
If t is omitted, return a list containing the k-mer that appears
the most often (or the k-mers, if there is a tie).
Code Snippets
position = config.genome.find(pattern, position+1)result = []
position = -1
while True:
position = text.find(pattern, position + 1)
if position == -1:
return result
result.append(position)def pattern_occurrences(pattern, text):
"""Generate the indices of the (possibly overlapping) occurrences of
pattern in text. For example:
>>> list(pattern_occurrences('ATAT', 'GATATATGCATATACTT'))
[1, 3, 9]
"""
position = -1
while True:
position = text.find(pattern, position + 1)
if position == -1:
return
yield positiondef frequent_patterns(text="", k=2, t=None):
"""Return a list of the most frequent k-mers in a string.
Input: A string text, an integer k, and a number of occurrences t.
Output: All k-mers that appear at least t times in text.
If t is omitted, return a list containing the k-mer that appears
the most often (or the k-mers, if there is a tie).
"""
words = {}
# ...Context
StackExchange Code Review Q#37932, answer score: 5
Revisions (0)
No revisions yet.