patternpythonMinor
Searching for a string in a downloaded PDF
Viewed 0 times
downloadedsearchingforstringpdf
Problem
This code goes to the website containing the PDF, downloads the PDF, then it converts this PDF to text. Finally, it reads this whole file (Over 5000 lines) into a list, line by line, and searches for my name in it.
It's a heck of a lot of code to complete this simple task. Also, as it is taking in 5000 lines and reading them one by one, it takes quite a long time to run this.
import mechanize
import os
br = mechanize.Browser()
br.open("http://sttm.org/Bulletin/Bulletins2015/tabid/483/LinkClick.aspx?fileticket=WkjzL8SuMQQ%3d&tabid=483&portalid=0&mid=979")
print br.title()
pdfs = []
for i in br.links():
if "Sunday," in i.text:
pdfs.append("http://sttm.org/Bulletin/Buletins2015/" + i.url)
br.retrieve(pdfs[0], "file.pdf")
input = "file.pdf"
output = "out.txt"
os.system(("ps2ascii %s %s") %(input, output))
with open("out.txt") as f:
list = f.readlines()
scheduled = False
for i in list:
if "My Name Here" in i:
scheduled = True;
if scheduled == True:
print "Yes"
else:
print "No"It's a heck of a lot of code to complete this simple task. Also, as it is taking in 5000 lines and reading them one by one, it takes quite a long time to run this.
Solution
First of all standard library imports and third-party library imports should be kept separate:
It's better to put your code in a function so that you re-use it with different urls, search keywords etc. Further explanation in comments:
from subprocess import PIPE, Popen
from urlparse import urljoin
import mechanizeIt's better to put your code in a function so that you re-use it with different urls, search keywords etc. Further explanation in comments:
def pdf_contains(url, file_url, search_key, keyword=''):
br = mechanize.Browser()
br.open(url)
print br.title()
# As you're interested only in the first link text that contains
# the keyword('Sunday,' in this case) we should better use next()
# with a generator expression. `next()` will yield the first item from
# the generator if any otherwise we'll return None
# Also `urlparse.urljoin` can come handy in joining urls.
pdf = next((urljoin(file_url, f.url) for f in br.links()
if keyword in f.text), None)
# Now instead of downloading the file using .urlretrive we
# can simply simply get a file-like object using
# `mechanize.urlopen` which then we can then pass to
# subprocess's STDIN
if pdf is not None:
data = mechanize.urlopen(pdf)
# Now instead of running `os.system` and storing the output
# in a file we can use `subprocess.Popen` to store the
# output of ps2ascii command in PIPE
proc = Popen(['ps2ascii'], stdin=data, stdout=PIPE)
# Now simply read the data line by line from PIPE
# and check for the search_key, if found return instantly
for line in iter(proc.stdout.readline, ''):
if search_key in line:
return True
return False
url = ("http://sttm.org/Bulletin/Bulletins2015/tabid/483/Link"
"Click.aspx?fileticket=WkjzL8SuMQQ%3d&tabid=483&portal"
"id=0&mid=979", "http://sttm.org/Bulletin/Buletins2015/")
scheduled = pdf_contains(url,
"http://sttm.org/Bulletin/Buletins2015/",
"My Name Here",
"Sunday,")
print scheduledCode Snippets
from subprocess import PIPE, Popen
from urlparse import urljoin
import mechanizedef pdf_contains(url, file_url, search_key, keyword=''):
br = mechanize.Browser()
br.open(url)
print br.title()
# As you're interested only in the first link text that contains
# the keyword('Sunday,' in this case) we should better use next()
# with a generator expression. `next()` will yield the first item from
# the generator if any otherwise we'll return None
# Also `urlparse.urljoin` can come handy in joining urls.
pdf = next((urljoin(file_url, f.url) for f in br.links()
if keyword in f.text), None)
# Now instead of downloading the file using .urlretrive we
# can simply simply get a file-like object using
# `mechanize.urlopen` which then we can then pass to
# subprocess's STDIN
if pdf is not None:
data = mechanize.urlopen(pdf)
# Now instead of running `os.system` and storing the output
# in a file we can use `subprocess.Popen` to store the
# output of ps2ascii command in PIPE
proc = Popen(['ps2ascii'], stdin=data, stdout=PIPE)
# Now simply read the data line by line from PIPE
# and check for the search_key, if found return instantly
for line in iter(proc.stdout.readline, ''):
if search_key in line:
return True
return False
url = ("http://sttm.org/Bulletin/Bulletins2015/tabid/483/Link"
"Click.aspx?fileticket=WkjzL8SuMQQ%3d&tabid=483&portal"
"id=0&mid=979", "http://sttm.org/Bulletin/Buletins2015/")
scheduled = pdf_contains(url,
"http://sttm.org/Bulletin/Buletins2015/",
"My Name Here",
"Sunday,")
print scheduledContext
StackExchange Code Review Q#78692, answer score: 4
Revisions (0)
No revisions yet.