HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Iterate list to map entries in Python

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
mapiteratepythonlistentries

Problem

I have two files, namely:

File1:

CL1 AA  XX  YY  ZZ  SS \n
CL2 3_b AA


File2:

AA  string1
AA  string2
3_b string3


My expected output is:

CL1 AA  string1
CL1 AA  string2
CL2 3_b string3
CL2 AA  string1
CL2 AA  string2


For this I wrote a following code:

import numpy as np
print("Reading Files...")
header = open('File1', 'r')
cl = header.readlines()
infile = np.genfromtxt('File2', dtype='str', skip_header=1)
new_array = []

for j in range(len(infile)):
    for row in cl:
        element = row.split("\t")
        ele_size = len(element)
        for i in range(0, ele_size):
            if np.core.defchararray.equal(infile[j,0], element[i]):
                clust = element[0]
                match1 = infile[j,0]
                match2 = infile[j,1]
                combo = "\t".join([clust, match1, match2])
                new_array.append(combo)

np.savetxt('output.txt',new_array, fmt='%s', delimiter='\t')


This generates the output I desire. But since the file has some 700000 lines in file2 and some 65000 clusters, it takes a huge time to iterate. Can anyone suggest an efficient way to parse it?

Is it possible to keep first file as a list and the second file as a dictionary, and then iterate over key values?

Solution

Your code does loop 700 000 multiplied by 65 000 multiplied by the number of elements in each cluster. That is a lot of iterations, and not very useful.

The better approach would be to read the smaller file into memory, and then read the larger file line by line. In addition as you iterate over each row in the smaller file, matching each of the keys, it makes sense to switch from a dict with cluster as key, and the different keys as values, to actually using the keys as keys, and list all the clusters it belongs to.

This approach would leave the lower memory footprint, but should be rather efficient to work with. Here is some code to start you off with. You might need to adjust a little related to splitting on space or tabs, but I get your wanted output using this.

from collections import defaultdict

def build_cluster_dict(filename):
    """Return a dict of keys with all the cluster the key exists in."""

    result = defaultdict(list)
    with open(filename) as infile:
        for line  in infile:
            elements = line.strip().split()

            cluster = elements[0]
            for key in elements[1:]:
                result[key].append(cluster)

    return result

def build_output(cluster_dict, filename, output_filename):

    with open(filename) as infile, open(output_filename, 'w') as outfile:
        for line in infile:
            key, text = line.strip().split()

            if key in cluster_dict:
                for cluster in cluster_dict[key]:
                    outfile.write('{}\t{}\t{}\n'.format(cluster, key, text))

def main():

    cluster_dict = build_cluster_dict("cluster.txt")

    print (cluster_dict)

    build_output(cluster_dict, "file2.txt", "output.txt")

if __name__ == '__main__':
    main()


Note that I've left out the use of numpy, as I don't see the need for it in this context. I've also used a double with statement to open both the in- and out-file at the same time in the same context. I left the print (cluster_dict) in there just to see the intermediate list it generates. For your test files this gave the following output (somewhat formatted):

defaultdict(, 
            {'AA': ['CL1', 'CL2'], 
             'SS': ['CL1'], 
             'YY': ['CL1'], 
             'XX': ['CL1'],
             '3_b': ['CL2'],
             'ZZ': ['CL1']})


Addendum: Locate erroneous input line

In comments OP said their was a problem in the key, text line, and to detect this these lines:

for line in infile:
           key, text = line.strip().split()


can be replaced with:

for line_number, line in enumerate(infile):
            try:
                key, text = line.strip().split()
            except ValueError:
                print("Error on line {}: {}".format(line_number, line))
                ## Option a) Use empty text
                #key = line.strip()
                #text = ""
                # Option b) Continue with next line
                continue


This code will catch the error situation, and as it stands it will display an error message with the offending line. I've set it to use option b), that is continue with next line. If you want to use an empty string and write the output to file file, uncomment option a), and comment out option b).

Code Snippets

from collections import defaultdict

def build_cluster_dict(filename):
    """Return a dict of keys with all the cluster the key exists in."""

    result = defaultdict(list)
    with open(filename) as infile:
        for line  in infile:
            elements = line.strip().split()

            cluster = elements[0]
            for key in elements[1:]:
                result[key].append(cluster)

    return result


def build_output(cluster_dict, filename, output_filename):

    with open(filename) as infile, open(output_filename, 'w') as outfile:
        for line in infile:
            key, text = line.strip().split()

            if key in cluster_dict:
                for cluster in cluster_dict[key]:
                    outfile.write('{}\t{}\t{}\n'.format(cluster, key, text))


def main():

    cluster_dict = build_cluster_dict("cluster.txt")

    print (cluster_dict)

    build_output(cluster_dict, "file2.txt", "output.txt")

if __name__ == '__main__':
    main()
defaultdict(<type 'list'>, 
            {'AA': ['CL1', 'CL2'], 
             'SS': ['CL1'], 
             'YY': ['CL1'], 
             'XX': ['CL1'],
             '3_b': ['CL2'],
             'ZZ': ['CL1']})
for line in infile:
           key, text = line.strip().split()
for line_number, line in enumerate(infile):
            try:
                key, text = line.strip().split()
            except ValueError:
                print("Error on line {}: {}".format(line_number, line))
                ## Option a) Use empty text
                #key = line.strip()
                #text = ""
                # Option b) Continue with next line
                continue

Context

StackExchange Code Review Q#114786, answer score: 6

Revisions (0)

No revisions yet.