patternpythonMinor
Iterate list to map entries in Python
Viewed 0 times
mapiteratepythonlistentries
Problem
I have two files, namely:
File1:
File2:
My expected output is:
For this I wrote a following code:
This generates the output I desire. But since the file has some 700000 lines in file2 and some 65000 clusters, it takes a huge time to iterate. Can anyone suggest an efficient way to parse it?
Is it possible to keep first file as a list and the second file as a dictionary, and then iterate over key values?
File1:
CL1 AA XX YY ZZ SS \n
CL2 3_b AAFile2:
AA string1
AA string2
3_b string3My expected output is:
CL1 AA string1
CL1 AA string2
CL2 3_b string3
CL2 AA string1
CL2 AA string2For this I wrote a following code:
import numpy as np
print("Reading Files...")
header = open('File1', 'r')
cl = header.readlines()
infile = np.genfromtxt('File2', dtype='str', skip_header=1)
new_array = []
for j in range(len(infile)):
for row in cl:
element = row.split("\t")
ele_size = len(element)
for i in range(0, ele_size):
if np.core.defchararray.equal(infile[j,0], element[i]):
clust = element[0]
match1 = infile[j,0]
match2 = infile[j,1]
combo = "\t".join([clust, match1, match2])
new_array.append(combo)
np.savetxt('output.txt',new_array, fmt='%s', delimiter='\t')This generates the output I desire. But since the file has some 700000 lines in file2 and some 65000 clusters, it takes a huge time to iterate. Can anyone suggest an efficient way to parse it?
Is it possible to keep first file as a list and the second file as a dictionary, and then iterate over key values?
Solution
Your code does loop 700 000 multiplied by 65 000 multiplied by the number of elements in each cluster. That is a lot of iterations, and not very useful.
The better approach would be to read the smaller file into memory, and then read the larger file line by line. In addition as you iterate over each row in the smaller file, matching each of the keys, it makes sense to switch from a dict with
This approach would leave the lower memory footprint, but should be rather efficient to work with. Here is some code to start you off with. You might need to adjust a little related to splitting on space or tabs, but I get your wanted output using this.
Note that I've left out the use of numpy, as I don't see the need for it in this context. I've also used a double
Addendum: Locate erroneous input line
In comments OP said their was a problem in the
can be replaced with:
This code will catch the error situation, and as it stands it will display an error message with the offending line. I've set it to use option b), that is continue with next line. If you want to use an empty string and write the output to file file, uncomment option a), and comment out option b).
The better approach would be to read the smaller file into memory, and then read the larger file line by line. In addition as you iterate over each row in the smaller file, matching each of the keys, it makes sense to switch from a dict with
cluster as key, and the different keys as values, to actually using the keys as keys, and list all the clusters it belongs to. This approach would leave the lower memory footprint, but should be rather efficient to work with. Here is some code to start you off with. You might need to adjust a little related to splitting on space or tabs, but I get your wanted output using this.
from collections import defaultdict
def build_cluster_dict(filename):
"""Return a dict of keys with all the cluster the key exists in."""
result = defaultdict(list)
with open(filename) as infile:
for line in infile:
elements = line.strip().split()
cluster = elements[0]
for key in elements[1:]:
result[key].append(cluster)
return result
def build_output(cluster_dict, filename, output_filename):
with open(filename) as infile, open(output_filename, 'w') as outfile:
for line in infile:
key, text = line.strip().split()
if key in cluster_dict:
for cluster in cluster_dict[key]:
outfile.write('{}\t{}\t{}\n'.format(cluster, key, text))
def main():
cluster_dict = build_cluster_dict("cluster.txt")
print (cluster_dict)
build_output(cluster_dict, "file2.txt", "output.txt")
if __name__ == '__main__':
main()Note that I've left out the use of numpy, as I don't see the need for it in this context. I've also used a double
with statement to open both the in- and out-file at the same time in the same context. I left the print (cluster_dict) in there just to see the intermediate list it generates. For your test files this gave the following output (somewhat formatted):defaultdict(,
{'AA': ['CL1', 'CL2'],
'SS': ['CL1'],
'YY': ['CL1'],
'XX': ['CL1'],
'3_b': ['CL2'],
'ZZ': ['CL1']})Addendum: Locate erroneous input line
In comments OP said their was a problem in the
key, text line, and to detect this these lines:for line in infile:
key, text = line.strip().split()can be replaced with:
for line_number, line in enumerate(infile):
try:
key, text = line.strip().split()
except ValueError:
print("Error on line {}: {}".format(line_number, line))
## Option a) Use empty text
#key = line.strip()
#text = ""
# Option b) Continue with next line
continueThis code will catch the error situation, and as it stands it will display an error message with the offending line. I've set it to use option b), that is continue with next line. If you want to use an empty string and write the output to file file, uncomment option a), and comment out option b).
Code Snippets
from collections import defaultdict
def build_cluster_dict(filename):
"""Return a dict of keys with all the cluster the key exists in."""
result = defaultdict(list)
with open(filename) as infile:
for line in infile:
elements = line.strip().split()
cluster = elements[0]
for key in elements[1:]:
result[key].append(cluster)
return result
def build_output(cluster_dict, filename, output_filename):
with open(filename) as infile, open(output_filename, 'w') as outfile:
for line in infile:
key, text = line.strip().split()
if key in cluster_dict:
for cluster in cluster_dict[key]:
outfile.write('{}\t{}\t{}\n'.format(cluster, key, text))
def main():
cluster_dict = build_cluster_dict("cluster.txt")
print (cluster_dict)
build_output(cluster_dict, "file2.txt", "output.txt")
if __name__ == '__main__':
main()defaultdict(<type 'list'>,
{'AA': ['CL1', 'CL2'],
'SS': ['CL1'],
'YY': ['CL1'],
'XX': ['CL1'],
'3_b': ['CL2'],
'ZZ': ['CL1']})for line in infile:
key, text = line.strip().split()for line_number, line in enumerate(infile):
try:
key, text = line.strip().split()
except ValueError:
print("Error on line {}: {}".format(line_number, line))
## Option a) Use empty text
#key = line.strip()
#text = ""
# Option b) Continue with next line
continueContext
StackExchange Code Review Q#114786, answer score: 6
Revisions (0)
No revisions yet.