patternpythonMinor
Extracting emails from a file and writing them to another file
Viewed 0 times
filewritinganotheremailsextractingandfromthem
Problem
The code below works fine for me. Is there any way that I can improve this code more? Note: delimiter is single whitespace only for email address.
getmail.py
Input file scratchmail.txt:
Output file mailout.txt:
getmail.py
#!/usr/bin/python
import re
handleone = open("scratchmail.txt", "r")
f1 = handleone.readlines()
handletwo = open("mailout.txt", "a+")
match_list = [ ]
for ArrayItem in f1:
match_list = re.findall(r'[\w\.-]+@[\w\.-]+', ArrayItem)
if(len(match_list)>0):
handletwo.write(match_list[0]+"\r\n")Input file scratchmail.txt:
tickets.cgi:5141|5141Z4063442957942364067088508|1219588377|1219588377||PND||abc@AABBCC.com|Mjpqafml|JgALKGXCuasMph|
tickets.cgi:358236|358236Z24989798132452492828304439|1433071308|1433071308||PND||xyz.abc@example.com|Edison|CpwxPARuQaqPR|
tickets.cgi:86805|86805Z25422507290694218605033173|1232345784|1232345784||PND||pytest@test.com|Agfaodki|jPdSNVlbXi|
…Output file mailout.txt:
abc@AABBCC.com
xyz.abc@example.com
pytest@test.comSolution
Working with file handles
You should always close the file handles that you open.
You didn't close any of them.
In addition,
you should use
so that they will be automatically closed when leaving the scope.
Checking if a list is empty
There is a much simpler way to check if a list is empty, instead of this:
The Pythonic way to write is simply this:
Unnecessary initializations and data storage
There's no need to initialize
You reassign it anyway inside.
You don't need to store all the lines from the first file into a list.
You can process the input line by line and write output line by line.
That can save a lot of memory, as you only need to keep one line in memory at a time,
not the entire input file.
Poor naming
The file handles are very poorly named.
and it doesn't describe what it really is.
But
Minor optimization
Instead of re-evaluating regular expressions in every loop,
I use
However, this might not be necessary,
as it seems recent versions of Python have a caching mechanism.
Suggested implementation
With the above suggestions applied, the code becomes simpler and cleaner:
You should always close the file handles that you open.
You didn't close any of them.
In addition,
you should use
with open(...) as ... syntax when working with files,so that they will be automatically closed when leaving the scope.
Checking if a list is empty
There is a much simpler way to check if a list is empty, instead of this:
if(len(match_list)>0):The Pythonic way to write is simply this:
if match_list:Unnecessary initializations and data storage
There's no need to initialize
match_list before the loops.You reassign it anyway inside.
You don't need to store all the lines from the first file into a list.
You can process the input line by line and write output line by line.
That can save a lot of memory, as you only need to keep one line in memory at a time,
not the entire input file.
Poor naming
The file handles are very poorly named.
handleone and handletwo don't convey that one is for input the other is for output.ArrayItem doesn't follow the recommended naming convention of snake_case (see PEP8),and it doesn't describe what it really is.
But
f1 wins the trophy of worst name in the posted code.lines would have been better.Minor optimization
Instead of re-evaluating regular expressions in every loop,
I use
re.compile to compile them first, to make all uses efficient.However, this might not be necessary,
as it seems recent versions of Python have a caching mechanism.
Suggested implementation
With the above suggestions applied, the code becomes simpler and cleaner:
import re
re_pattern = re.compile(r'[\w\.-]+@[\w\.-]+')
with open("input.txt") as fh_in:
with open("mailout.txt", "a+") as fh_out:
for line in fh_in:
match_list = re_pattern.findall(line)
if match_list:
fh_out.write(match_list[0]+"\r\n")Code Snippets
if(len(match_list)>0):if match_list:import re
re_pattern = re.compile(r'[\w\.-]+@[\w\.-]+')
with open("input.txt") as fh_in:
with open("mailout.txt", "a+") as fh_out:
for line in fh_in:
match_list = re_pattern.findall(line)
if match_list:
fh_out.write(match_list[0]+"\r\n")Context
StackExchange Code Review Q#93398, answer score: 7
Revisions (0)
No revisions yet.