patternpythonMinor
Holding records using dictionary
Viewed 0 times
dictionaryrecordsholdingusing
Problem
Can you please help me with the following script?
At the moment the script is taking up to 20 mins to execute, depending on the amount of data being processed (each time the script is executed, it processes several thousand lines of data). I would like to make the script more responsive. Can you tell me how I can update the script for better performance?
See comments in each of the main parts of the script:
```
#!/opt/SP/mdp/home/SCRIPTS/tools/Python-2.6.2/bin/python
import glob
import re as regex
from datetime import datetime
# Here I am getting two set of files which I am going to use to create two separate dictionaries
Cnfiles = glob.glob('/logs/split_logs/Cn*_generic_activity.log')
Prfiles = glob.glob('/logs/split_logs/Pr*_generic_activity.log')
# Output file
log = file('/logs/split_logs/processed_data.log', 'w')
# First dictionary, holds received records
Cn = {}
for logfile in Cnfiles:
with open(logfile) as logfile:
filecontent = logfile.xreadlines()
for line in filecontent:
if 'SERV1' in line and 'RECV' in line or 'SERV2' in line and 'RECV' in line or 'SERV3' in line and 'RECV' in line:
line = regex.sub('', '')
line = line.replace('.', ' ')
line = line.replace(r'|', ' ')
line = line.strip()
field = line.split(' ')
opco = field[4]
service = field[5]
status = field[6]
jarid = field[10]
Cn.setdefault(opco, {}).setdefault(service, {}).setdefault(status, {})[jarid] = jarid
# Second dictionary, holds the various stages the records go through
Pr = {}
for logfile in Prfiles:
with open(logfile) as logfile:
filecontent = logfile.xreadlines()
for line in filecontent:
if 'status 7 to 13' in line or 'status 9 to 13' in line or 'status 7 to 14' in line or 'status 9 to 14' in line or 'status 5 to 504' in line or 'status 7 to 505' in line:
At the moment the script is taking up to 20 mins to execute, depending on the amount of data being processed (each time the script is executed, it processes several thousand lines of data). I would like to make the script more responsive. Can you tell me how I can update the script for better performance?
See comments in each of the main parts of the script:
```
#!/opt/SP/mdp/home/SCRIPTS/tools/Python-2.6.2/bin/python
import glob
import re as regex
from datetime import datetime
# Here I am getting two set of files which I am going to use to create two separate dictionaries
Cnfiles = glob.glob('/logs/split_logs/Cn*_generic_activity.log')
Prfiles = glob.glob('/logs/split_logs/Pr*_generic_activity.log')
# Output file
log = file('/logs/split_logs/processed_data.log', 'w')
# First dictionary, holds received records
Cn = {}
for logfile in Cnfiles:
with open(logfile) as logfile:
filecontent = logfile.xreadlines()
for line in filecontent:
if 'SERV1' in line and 'RECV' in line or 'SERV2' in line and 'RECV' in line or 'SERV3' in line and 'RECV' in line:
line = regex.sub('', '')
line = line.replace('.', ' ')
line = line.replace(r'|', ' ')
line = line.strip()
field = line.split(' ')
opco = field[4]
service = field[5]
status = field[6]
jarid = field[10]
Cn.setdefault(opco, {}).setdefault(service, {}).setdefault(status, {})[jarid] = jarid
# Second dictionary, holds the various stages the records go through
Pr = {}
for logfile in Prfiles:
with open(logfile) as logfile:
filecontent = logfile.xreadlines()
for line in filecontent:
if 'status 7 to 13' in line or 'status 9 to 13' in line or 'status 7 to 14' in line or 'status 9 to 14' in line or 'status 5 to 504' in line or 'status 7 to 505' in line:
Solution
This:
gets the keys of the dict as a list and then searches the list, which loses the speed advantage of a dict. Searching a list is O(n), whereas searching a dict is O(1). It's faster to do this:
This:
is equivalent to this:
This:
can be shortened to this:
This:
can be shortened to this:
Pre-compiling the regexes may give a little speed improvement, but not much, because the regexes are cached by the re module.
jarid in Pr['13'].keys()gets the keys of the dict as a list and then searches the list, which loses the speed advantage of a dict. Searching a list is O(n), whereas searching a dict is O(1). It's faster to do this:
jarid in Pr['13']This:
if 'SERV1' in line and 'RECV' in line or 'SERV2' in line and 'RECV' in line or 'SERV3' in line and 'RECV' in line:is equivalent to this:
if ('SERV1' in line or 'SERV2' in line or 'SERV3' in line) and 'RECV' in line:This:
line = line.replace('', '')
line = line.replace('.', ' ')
line = line.replace(r'|', ' ')can be shortened to this:
line = regex.sub(',|]', '', line)This:
line = line.replace('', '')
line = line.replace('.', ' ')can be shortened to this:
line = regex.sub('[<>.]', '', line)Pre-compiling the regexes may give a little speed improvement, but not much, because the regexes are cached by the re module.
Code Snippets
jarid in Pr['13'].keys()jarid in Pr['13']if 'SERV1' in line and 'RECV' in line or 'SERV2' in line and 'RECV' in line or 'SERV3' in line and 'RECV' in line:if ('SERV1' in line or 'SERV2' in line or 'SERV3' in line) and 'RECV' in line:line = line.replace('<', '')
line = line.replace('>', '')
line = line.replace('.', ' ')
line = line.replace(r'|', ' ')Context
StackExchange Code Review Q#3454, answer score: 8
Revisions (0)
No revisions yet.