patternpythonMinor

Finding roughly matching genome sequences in Python dictionary

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

genomepythonfindingdictionarysequencesmatchingroughly

Problem

The purpose of my code here is to play a part in genome sequencing analysis, and while functional it takes days to run, so I am looking for any way I can improve speed. The input is up to 500 million lines long (making speed code efficiency important) and contains sequencing reads and corresponding info. Each read takes up 4 lines within the input file and looks something like this:

@A001             <-header
AAAAACCCCCCCCCCCC <-seq read (finalRead)
+
################# <-quality (trimmed_quality)

The portion of my code that is very slow takes a dictionary as input, which contains all of the data found within the input sequencing file and is in the form shown below:

duplexDict[umi] = {'header':header, 'seq':finalRead, 'qual':trimmed_quality}

In the first part of the code I am looking for pairs of sequences by checking for similar keys (termed umi in the code). The goal is to find keys that when converted to complement sequence are only different by a single letter. Then for each key if there is only one closely matching key, the associated dictionaries are retained. If there are no matches or more than one matching key, all of these keys should be ignored.

```
from Levenshtein import distance
deDuplexDict = {} # dict that will contain key pairs
finalList = [] # list to keep track of valid key pairs
for i in duplexDict: # dict with sequencing file info
tempList = []
for j in duplexDict:
complement = str(Seq(j).complement()) # this is just finding complementary sequence
if distance(i,complement) <= 1: # find similar umi/read seq pairs
tempList.append(j) # make a list of all similar pairs
# only keep a complementary pair if there are exactly two matching consensus reads
if len(tempList) == 1:
if i not in finalList and j not in finalList:
finalList.append(i)
finalList.append(j)

# only retain those dict values that are true pairs
for key in finalList:
deDuplexDict[key] =

Solution

You need a faster search method

You are comparing every entry in duplexDict directly with every entry in duplexDict. This means the number of operations will increase with the square of the number of entries in duplexDict. This stands out from the lines:

for i in duplexDict:
        ...
        for j in duplexDict:

More formally, you algorithm runs in \$\mathcal{O}(n^2)\$, where n is the length of the input dictionary. So, for 500 million (5e8) reads of sequence data, you need to run about 250 thousand trillion (25e16) operations. This is why it takes days to run.

You will need to index your reads based on the sequences themselves. Find and implement an architecture, whether hash tables, binary trees, or something else, that allows fast searching of the input list of sequencing reads.

Of course, a hashing method is built-in with Python's dict. There is no hard limit on the length of the key strings, and the number of entries you can put in the dict is limited by available memory.

Using dictionary search

In your case, in order to use Python's built-in dictionary to make and search the hash table, you might first do something like this:

seq_dict = {}
for i in duplexDict:
    seq_dict[i['seq']] = {info: i['header']}  # add whatever info you need to find
                                              # the original entry again in `duplexDict`}

The resulting dictionary has the sequences themselves as the keys. Searching for a particular sequence takes \$\mathcal{O}(1)\$ operations.

(Note: I am assuming you need to be able to refer back to the original duplexDict once you have your hits. Then, you don't need all of the extra information to accompany the sequencing reads in the new dictionary. If each entry in duplexDict has an identifier like umi, just put that alone as the value in the new dictionary.)

Generate mismatches and search

As you are only looking for sequences different by one base, you can just generate all possible mismatches. If you eventually want to include other types of sequence similarity, you will need to use more specialized sequence analysis tools (such as BLAST).

So, you will need a simple function to generate all possible one-base-pair mismatches from each input sequence. Assuming you are using only genomic bases A,T,C,G, for each sequence, there will be 3 x n possible one-base-pair mismatches, where n is the length of the sequencing reads.

def get_one_bp_mismatches(seq):
    mismatches = []
    bases = ['A','T','G','C']
    for e,i in enumerate(seq):
        for b in bases:
            mismatches.append(seq[:e] + b + seq[e+1:])
    return mismatches

Then, search the dictionary like this:

for seq_list,info in [get_one_bp_mismatches(i['seq']),i['info'] for i in duplexDict.items()]:
    for seq in seq_list:
        if seq in seq_dict:
            finalList.append({'search_seq': info,
                              'found_seq': seq_dict[seq]['info']})

Your finalList will contain all matching pairs, identified by whatever information you use to look them up in the original duplexDict.

The whole search process will take on the order of \$\mathcal{O}(n)\$ operations, and should likely finish within minutes for 500 million sequencing reads. You can then use the last lines of your existing code to generate the output file.

Code Snippets

for i in duplexDict:
        ...
        for j in duplexDict:

seq_dict = {}
for i in duplexDict:
    seq_dict[i['seq']] = {info: i['header']}  # add whatever info you need to find
                                              # the original entry again in `duplexDict`}

def get_one_bp_mismatches(seq):
    mismatches = []
    bases = ['A','T','G','C']
    for e,i in enumerate(seq):
        for b in bases:
            mismatches.append(seq[:e] + b + seq[e+1:])
    return mismatches

for seq_list,info in [get_one_bp_mismatches(i['seq']),i['info'] for i in duplexDict.items()]:
    for seq in seq_list:
        if seq in seq_dict:
            finalList.append({'search_seq': info,
                              'found_seq': seq_dict[seq]['info']})

Context

StackExchange Code Review Q#156490, answer score: 8

Revisions (0)

No revisions yet.