HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Optimization for SQLite result set parsing

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
resultsqliteoptimizationparsingforset

Problem

I am retrieving information from an SQLite database that gives me back around 20 million rows that I need to process. This information is then transformed into a dict of lists which I need to use. I am trying to use generators wherever possible.

What optimizations can be done?

I am either getting a “Killed” message or it takes a really long time to run. The SQL result set part is working fine. I tested the generator code in the Python interpreter, and it doesn’t have any problems. I am guessing the problem is with the dict generation.

Update for clarity:

I have 20 million rows in my result set from my SQLite database. Each row is of the form:

(2786972, 486255.0, 4125992.0, 'AACAGA', '2005’)


I now need to create a dict that is keyed with the fourth element ‘AACAGA’ of the row. The value that the dict will hold is the third element, but it has to hold the values for all the occurences in the result set.

So, in our case here, ‘AACAGA’ will hold a list containing multiple values from the SQL result set. The problem here is to find tandem repeats in a genome sequence. A tandem repeat is a genome read (‘AACAGA’) that is repeated at least three times in succession.

For me to calculate this, I need all the values in the third index as a list keyed by the genome read, in our case ‘AACAGA’. Once I have the list, I can subtract successive values in the list to see if there are three consecutive matches to the length of the read. This is what I aim to accomplish with the dictionary and lists as values.

```
#!/usr/bin/python3.3
import sqlite3 as sql

sequence_dict = {}
tandem_repeat = {}

def dict_generator(large_dict):
dkeys = large_dict.keys()
for k in dkeys:
yield(k, large_dict[k])

def create_result_generator():
conn = sql.connect('sequences_mt_test.sqlite', timeout=20)
c = conn.cursor()
try:
conn.row_factory = sql.Row
sql_string = "select * from sequence_info where kmer_length > 2"
c.execute(sql_

Solution

I have missed the whole point of your program but these comments might be useful to you anyway :

dict_generator

I might be wrong but dict_generator(large_dict) looks like large_dict.iteritems() (Python 2) / large_dict.items() (Python 3).

create_result_generator

while True:
     result_set = c.fetchall()
     if not result_set:
         break
     else:
         return(row for row in result_set)


This loop does not seem very loopy to me. What about :

result_set = c.fetchall()
 if result_set:
     return(row for row in result_set)


Also, it seems to hilight a bigger issue about the fact that the connection might not be closed.

find_longest_tandem_repeat

In :

try:
         sequence_dict[row[3]].append(row[2])
     except KeyError:
         sequence_dict[row[3]] = []
         sequence_dict[row[3]].append(row[2])


seems to be equivalent to

sequence_dict.setdefault(row[3],[]).append(row[2])


Also, I don't know where row is coming from at this stage.

In :

for key,value in dict_generator(sequence_dict):
     sortList = sorted(value)
     for i in range (0, (len(sortList)-1)):
         if((sorList[i+1]-sorList[i]) == (sorList[i+2]-sorList[i+1])
             == (sorList[i+3]-sorList[i+2]) == (len(key))):
               tandem_repeat[key] = True
     sortList = []


Naming (sortList/sorList) is either wrong or very confusing.
There is no point in doing sortList = [].
You can probably break once you've done tandem_repeat[key] = True because new iterations won't do much anyway.

print(max(k for k in tandem_repeat.keys() if tandem_repeat[k] is True))


might as well be :

print(max(k for k,v in tandem_repeat.items() if v))

Code Snippets

while True:
     result_set = c.fetchall()
     if not result_set:
         break
     else:
         return(row for row in result_set)
result_set = c.fetchall()
 if result_set:
     return(row for row in result_set)
try:
         sequence_dict[row[3]].append(row[2])
     except KeyError:
         sequence_dict[row[3]] = []
         sequence_dict[row[3]].append(row[2])
sequence_dict.setdefault(row[3],[]).append(row[2])
for key,value in dict_generator(sequence_dict):
     sortList = sorted(value)
     for i in range (0, (len(sortList)-1)):
         if((sorList[i+1]-sorList[i]) == (sorList[i+2]-sorList[i+1])
             == (sorList[i+3]-sorList[i+2]) == (len(key))):
               tandem_repeat[key] = True
     sortList = []

Context

StackExchange Code Review Q#41551, answer score: 3

Revisions (0)

No revisions yet.