patternpythonMinor
Finding roughly matching genome sequences in Python dictionary
Viewed 0 times
genomepythonfindingdictionarysequencesmatchingroughly
Problem
The purpose of my code here is to play a part in genome sequencing analysis, and while functional it takes days to run, so I am looking for any way I can improve speed. The input is up to 500 million lines long (making speed code efficiency important) and contains sequencing reads and corresponding info. Each read takes up 4 lines within the input file and looks something like this:
The portion of my code that is very slow takes a dictionary as input, which contains all of the data found within the input sequencing file and is in the form shown below:
In the first part of the code I am looking for pairs of sequences by checking for similar keys (termed umi in the code). The goal is to find keys that when converted to complement sequence are only different by a single letter. Then for each key if there is only one closely matching key, the associated dictionaries are retained. If there are no matches or more than one matching key, all of these keys should be ignored.
```
from Levenshtein import distance
deDuplexDict = {} # dict that will contain key pairs
finalList = [] # list to keep track of valid key pairs
for i in duplexDict: # dict with sequencing file info
tempList = []
for j in duplexDict:
complement = str(Seq(j).complement()) # this is just finding complementary sequence
if distance(i,complement) <= 1: # find similar umi/read seq pairs
tempList.append(j) # make a list of all similar pairs
# only keep a complementary pair if there are exactly two matching consensus reads
if len(tempList) == 1:
if i not in finalList and j not in finalList:
finalList.append(i)
finalList.append(j)
# only retain those dict values that are true pairs
for key in finalList:
deDuplexDict[key] =
@A001 <-header
AAAAACCCCCCCCCCCC <-seq read (finalRead)
+
################# <-quality (trimmed_quality)The portion of my code that is very slow takes a dictionary as input, which contains all of the data found within the input sequencing file and is in the form shown below:
duplexDict[umi] = {'header':header, 'seq':finalRead, 'qual':trimmed_quality}In the first part of the code I am looking for pairs of sequences by checking for similar keys (termed umi in the code). The goal is to find keys that when converted to complement sequence are only different by a single letter. Then for each key if there is only one closely matching key, the associated dictionaries are retained. If there are no matches or more than one matching key, all of these keys should be ignored.
```
from Levenshtein import distance
deDuplexDict = {} # dict that will contain key pairs
finalList = [] # list to keep track of valid key pairs
for i in duplexDict: # dict with sequencing file info
tempList = []
for j in duplexDict:
complement = str(Seq(j).complement()) # this is just finding complementary sequence
if distance(i,complement) <= 1: # find similar umi/read seq pairs
tempList.append(j) # make a list of all similar pairs
# only keep a complementary pair if there are exactly two matching consensus reads
if len(tempList) == 1:
if i not in finalList and j not in finalList:
finalList.append(i)
finalList.append(j)
# only retain those dict values that are true pairs
for key in finalList:
deDuplexDict[key] =
Solution
You need a faster search method
You are comparing every entry in
More formally, you algorithm runs in \$\mathcal{O}(n^2)\$, where n is the length of the input dictionary. So, for 500 million (5e8) reads of sequence data, you need to run about 250 thousand trillion (25e16) operations. This is why it takes days to run.
You will need to index your reads based on the sequences themselves. Find and implement an architecture, whether hash tables, binary trees, or something else, that allows fast searching of the input list of sequencing reads.
Of course, a hashing method is built-in with Python's
Using dictionary search
In your case, in order to use Python's built-in dictionary to make and search the hash table, you might first do something like this:
The resulting dictionary has the sequences themselves as the keys. Searching for a particular sequence takes \$\mathcal{O}(1)\$ operations.
(Note: I am assuming you need to be able to refer back to the original
Generate mismatches and search
As you are only looking for sequences different by one base, you can just generate all possible mismatches. If you eventually want to include other types of sequence similarity, you will need to use more specialized sequence analysis tools (such as BLAST).
So, you will need a simple function to generate all possible one-base-pair mismatches from each input sequence. Assuming you are using only genomic bases A,T,C,G, for each sequence, there will be 3 x n possible one-base-pair mismatches, where n is the length of the sequencing reads.
Then, search the dictionary like this:
Your
The whole search process will take on the order of \$\mathcal{O}(n)\$ operations, and should likely finish within minutes for 500 million sequencing reads. You can then use the last lines of your existing code to generate the output file.
You are comparing every entry in
duplexDict directly with every entry in duplexDict. This means the number of operations will increase with the square of the number of entries in duplexDict. This stands out from the lines:for i in duplexDict:
...
for j in duplexDict:More formally, you algorithm runs in \$\mathcal{O}(n^2)\$, where n is the length of the input dictionary. So, for 500 million (5e8) reads of sequence data, you need to run about 250 thousand trillion (25e16) operations. This is why it takes days to run.
You will need to index your reads based on the sequences themselves. Find and implement an architecture, whether hash tables, binary trees, or something else, that allows fast searching of the input list of sequencing reads.
Of course, a hashing method is built-in with Python's
dict. There is no hard limit on the length of the key strings, and the number of entries you can put in the dict is limited by available memory.Using dictionary search
In your case, in order to use Python's built-in dictionary to make and search the hash table, you might first do something like this:
seq_dict = {}
for i in duplexDict:
seq_dict[i['seq']] = {info: i['header']} # add whatever info you need to find
# the original entry again in `duplexDict`}The resulting dictionary has the sequences themselves as the keys. Searching for a particular sequence takes \$\mathcal{O}(1)\$ operations.
(Note: I am assuming you need to be able to refer back to the original
duplexDict once you have your hits. Then, you don't need all of the extra information to accompany the sequencing reads in the new dictionary. If each entry in duplexDict has an identifier like umi, just put that alone as the value in the new dictionary.)Generate mismatches and search
As you are only looking for sequences different by one base, you can just generate all possible mismatches. If you eventually want to include other types of sequence similarity, you will need to use more specialized sequence analysis tools (such as BLAST).
So, you will need a simple function to generate all possible one-base-pair mismatches from each input sequence. Assuming you are using only genomic bases A,T,C,G, for each sequence, there will be 3 x n possible one-base-pair mismatches, where n is the length of the sequencing reads.
def get_one_bp_mismatches(seq):
mismatches = []
bases = ['A','T','G','C']
for e,i in enumerate(seq):
for b in bases:
mismatches.append(seq[:e] + b + seq[e+1:])
return mismatchesThen, search the dictionary like this:
for seq_list,info in [get_one_bp_mismatches(i['seq']),i['info'] for i in duplexDict.items()]:
for seq in seq_list:
if seq in seq_dict:
finalList.append({'search_seq': info,
'found_seq': seq_dict[seq]['info']})Your
finalList will contain all matching pairs, identified by whatever information you use to look them up in the original duplexDict.The whole search process will take on the order of \$\mathcal{O}(n)\$ operations, and should likely finish within minutes for 500 million sequencing reads. You can then use the last lines of your existing code to generate the output file.
Code Snippets
for i in duplexDict:
...
for j in duplexDict:seq_dict = {}
for i in duplexDict:
seq_dict[i['seq']] = {info: i['header']} # add whatever info you need to find
# the original entry again in `duplexDict`}def get_one_bp_mismatches(seq):
mismatches = []
bases = ['A','T','G','C']
for e,i in enumerate(seq):
for b in bases:
mismatches.append(seq[:e] + b + seq[e+1:])
return mismatchesfor seq_list,info in [get_one_bp_mismatches(i['seq']),i['info'] for i in duplexDict.items()]:
for seq in seq_list:
if seq in seq_dict:
finalList.append({'search_seq': info,
'found_seq': seq_dict[seq]['info']})Context
StackExchange Code Review Q#156490, answer score: 8
Revisions (0)
No revisions yet.