patternpythonMinor
Optimizing python code for big data sets
Viewed 0 times
bigoptimizingforpythoncodesetsdata
Problem
I'm trying to optimize a python string that works on big data sets, the way it works is by taking in a with a list of keywords and scores and taking in a file loaded with data from the twitter api. The program does a keyword match against tweet text. At the end of the program I want to produce an average for each term found in text object of the json file. e.g.
sad 3
With sad being the keyword and 3 being the average score.
It's running way too slow but I'm new to Python coming from a php background and I think I'm doing things the php way in python.
How can I get this code to run faster?
```
import sys
import json
import re
def findRecord(key, records):
for r in records:
if r[0] == key:
return r
def average_records(records):
for r in records:
if r[1] > 0:
avg = r[1] / r[2]
print r[0] + ' ' + str(avg)
else:
avg = r[3] / r[4]
print r[0] + ' ' + str(avg)
def hw(sent_file, tweet_file):
scores = {}
sent_file = open(sent_file, 'r')
for line in sent_file:
term, score = line.split("\t")
scores[term] = int(score)
recored_affin = []
#print scores.items()
data = []
with open(tweet_file, 'r') as f:
for line in f:
data.append(json.loads(line))
#print data[4]['text']
for tweet in data:
total = 0
if 'text' in tweet:
for k, v in scores.iteritems():
#print tweet['text']
num_of_aff = len(re.findall(k, tweet['text']))
if num_of_aff > 0:
#print "Number is: " + str(num_of_aff)
#print "Word is: " + k
#print "Tweet is: " + tweet['text']
total += (v * num_of_aff)
#print "Score is: " + str(total)
#while count 0:
new_value = quick_
sad 3
With sad being the keyword and 3 being the average score.
It's running way too slow but I'm new to Python coming from a php background and I think I'm doing things the php way in python.
How can I get this code to run faster?
```
import sys
import json
import re
def findRecord(key, records):
for r in records:
if r[0] == key:
return r
def average_records(records):
for r in records:
if r[1] > 0:
avg = r[1] / r[2]
print r[0] + ' ' + str(avg)
else:
avg = r[3] / r[4]
print r[0] + ' ' + str(avg)
def hw(sent_file, tweet_file):
scores = {}
sent_file = open(sent_file, 'r')
for line in sent_file:
term, score = line.split("\t")
scores[term] = int(score)
recored_affin = []
#print scores.items()
data = []
with open(tweet_file, 'r') as f:
for line in f:
data.append(json.loads(line))
#print data[4]['text']
for tweet in data:
total = 0
if 'text' in tweet:
for k, v in scores.iteritems():
#print tweet['text']
num_of_aff = len(re.findall(k, tweet['text']))
if num_of_aff > 0:
#print "Number is: " + str(num_of_aff)
#print "Word is: " + k
#print "Tweet is: " + tweet['text']
total += (v * num_of_aff)
#print "Score is: " + str(total)
#while count 0:
new_value = quick_
Solution
Your code contains a mix of tabs and spaces. This caused your code to display incorrectly before I edited it. The most common way in python is to use only spaces. You should be able to configure your editor to insert spaces instead of tabs when you push the tab key.
Python convention is to name function lowercase_with_underscores
It is going to be inefficient to loop over records looking for things like this. Instead, you use a dictionary and look them up by key.
Rather than indexing
Then you can access those names directly. It'll be easier to read and probably marginally faster.
In this case, you can do:
I have no idea what
This scores bit is a nicely self contained section. I suggest making it a separate function.
Don't keep dead code, just remove it. If you think you might need it back look into version control.
You're finished with the file now, you should really drop out of the with block.
There isn't really much point in storing the json objects in a list just to process them. Just process them as you get them.
Don't leave commented code in there
Use
That's going to be expensive, it scans through the whole list again.
Isn't this just foundRow again?
Expensive, has to scan through again.
elif v < 0:
You've got some duplication here, you should move the common logic out of the if blocks.
Why do you open these file but never do anything with them?
Your speed issues are probably the result of using a list and constantly searching over the whole list instead of using a dictionary. Make recorded_affin a dictionary, and your code should be simpler and faster.
import sys
import json
import re
def findRecord(key, records):Python convention is to name function lowercase_with_underscores
for r in records:
if r[0] == key:
return rIt is going to be inefficient to loop over records looking for things like this. Instead, you use a dictionary and look them up by key.
def average_records(records):
for r in records:Rather than indexing
r all over the place, I suggest using:for k, new_value, new_count, old_neg_value, old_neg_count in records:Then you can access those names directly. It'll be easier to read and probably marginally faster.
if r[1] > 0:
avg = r[1] / r[2]
print r[0] + ' ' + str(avg)
else:
avg = r[3] / r[4]
print r[0] + ' ' + str(avg)In this case, you can do:
print r[0], avg for the same result.def hw(sent_file, tweet_file):I have no idea what
hw meansscores = {}
sent_file = open(sent_file, 'r')
for line in sent_file:
term, score = line.split("\t")
scores[term] = int(score)This scores bit is a nicely self contained section. I suggest making it a separate function.
recored_affin = []
#print scores.items()Don't keep dead code, just remove it. If you think you might need it back look into version control.
data = []
with open(tweet_file, 'r') as f:
for line in f:
data.append(json.loads(line))
#print data[4]['text']You're finished with the file now, you should really drop out of the with block.
for tweet in data:There isn't really much point in storing the json objects in a list just to process them. Just process them as you get them.
total = 0
if 'text' in tweet:
for k, v in scores.iteritems():
#print tweet['text']
num_of_aff = len(re.findall(k, tweet['text']))
if num_of_aff > 0:
#print "Number is: " + str(num_of_aff)
#print "Word is: " + k
#print "Tweet is: " + tweet['text']
total += (v * num_of_aff)
#print "Score is: " + str(total)
#while count < len(recorded_affin):Don't leave commented code in there
foundRow = findRecord(k, recored_affin)
if foundRow != None:Use
is None to check for foundRowindex = recored_affin.index(foundRow)That's going to be expensive, it scans through the whole list again.
quick_rec = recored_affin[index]Isn't this just foundRow again?
if v > 0:
new_value = quick_rec[1] + v
new_count = quick_rec[2] + 1
old_neg_value = 0
old_neg_count = 0
recored_affin.append([k, new_value, new_count, old_neg_value, old_neg_count])
recored_affin.remove(foundRow)Expensive, has to scan through again.
elif v < 0:
old_pos_value = 0
old_pos_count = 0
new_value = quick_rec[3] + v
new_count = quick_rec[4] + 1
recored_affin.append([k, old_pos_value, old_pos_count, new_value, new_count])
recored_affin.remove(foundRow)You've got some duplication here, you should move the common logic out of the if blocks.
else:
if v > 0:
recored_affin.append([k,v,1,0,0])
elif v < 0:
recored_affin.append([k,0,0,v,1])
#print recored_affin
##print foundRow
##print total
average_records(recored_affin)
def lines(fp):
print str(len(fp.readlines()))
def main():
sent_file = open(sys.argv[1])
tweet_file = open(sys.argv[2])Why do you open these file but never do anything with them?
hw(sys.argv[1], sys.argv[2])
#lines(sent_file)
#lines(tweet_file)
if __name__ == '__main__':
main()Your speed issues are probably the result of using a list and constantly searching over the whole list instead of using a dictionary. Make recorded_affin a dictionary, and your code should be simpler and faster.
Code Snippets
import sys
import json
import re
def findRecord(key, records):for r in records:
if r[0] == key:
return rdef average_records(records):
for r in records:for k, new_value, new_count, old_neg_value, old_neg_count in records:if r[1] > 0:
avg = r[1] / r[2]
print r[0] + ' ' + str(avg)
else:
avg = r[3] / r[4]
print r[0] + ' ' + str(avg)Context
StackExchange Code Review Q#26077, answer score: 5
Revisions (0)
No revisions yet.