HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Searching for a string in a downloaded PDF

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
downloadedsearchingforstringpdf

Problem

This code goes to the website containing the PDF, downloads the PDF, then it converts this PDF to text. Finally, it reads this whole file (Over 5000 lines) into a list, line by line, and searches for my name in it.

import mechanize
import os
br = mechanize.Browser()
br.open("http://sttm.org/Bulletin/Bulletins2015/tabid/483/LinkClick.aspx?fileticket=WkjzL8SuMQQ%3d&tabid=483&portalid=0&mid=979")
print br.title()
pdfs = []
for i in br.links():
    if "Sunday," in i.text:
        pdfs.append("http://sttm.org/Bulletin/Buletins2015/" + i.url)
br.retrieve(pdfs[0], "file.pdf")
input = "file.pdf"
output = "out.txt"
os.system(("ps2ascii %s %s") %(input, output))
with open("out.txt") as f:
    list = f.readlines()
scheduled = False
for i in list:
    if "My Name Here" in i:
        scheduled = True;
if scheduled == True:
    print "Yes"
else:
    print "No"


It's a heck of a lot of code to complete this simple task. Also, as it is taking in 5000 lines and reading them one by one, it takes quite a long time to run this.

Solution

First of all standard library imports and third-party library imports should be kept separate:

from subprocess import PIPE, Popen
from urlparse import urljoin

import mechanize


It's better to put your code in a function so that you re-use it with different urls, search keywords etc. Further explanation in comments:

def pdf_contains(url, file_url, search_key, keyword=''):
    br = mechanize.Browser()
    br.open(url)
    print br.title()
    # As you're interested only in the first link text that contains
    # the keyword('Sunday,' in this case) we should better use next()
    # with a generator expression. `next()` will yield the first item from
    # the generator if any otherwise we'll return None
    # Also `urlparse.urljoin` can come handy in joining urls.

    pdf = next((urljoin(file_url, f.url) for f in br.links()
               if keyword in f.text), None)

    # Now instead of downloading the file using .urlretrive we 
    # can simply simply get a file-like object using
    # `mechanize.urlopen` which then we can then pass to
    # subprocess's STDIN 

    if pdf is not None:
        data = mechanize.urlopen(pdf)

        # Now instead of running `os.system` and storing the output
        # in a file we can use `subprocess.Popen` to store the
        # output of ps2ascii command in PIPE

        proc = Popen(['ps2ascii'], stdin=data, stdout=PIPE)

        # Now simply read the data line by line from PIPE
        # and check for the search_key, if found return instantly 
        for line in iter(proc.stdout.readline, ''):
            if search_key in line:
                return True
    return False

url = ("http://sttm.org/Bulletin/Bulletins2015/tabid/483/Link"
       "Click.aspx?fileticket=WkjzL8SuMQQ%3d&tabid=483&portal"
       "id=0&mid=979", "http://sttm.org/Bulletin/Buletins2015/")

scheduled = pdf_contains(url,
                         "http://sttm.org/Bulletin/Buletins2015/",
                         "My Name Here",
                         "Sunday,")

print scheduled

Code Snippets

from subprocess import PIPE, Popen
from urlparse import urljoin

import mechanize
def pdf_contains(url, file_url, search_key, keyword=''):
    br = mechanize.Browser()
    br.open(url)
    print br.title()
    # As you're interested only in the first link text that contains
    # the keyword('Sunday,' in this case) we should better use next()
    # with a generator expression. `next()` will yield the first item from
    # the generator if any otherwise we'll return None
    # Also `urlparse.urljoin` can come handy in joining urls.

    pdf = next((urljoin(file_url, f.url) for f in br.links()
               if keyword in f.text), None)

    # Now instead of downloading the file using .urlretrive we 
    # can simply simply get a file-like object using
    # `mechanize.urlopen` which then we can then pass to
    # subprocess's STDIN 

    if pdf is not None:
        data = mechanize.urlopen(pdf)

        # Now instead of running `os.system` and storing the output
        # in a file we can use `subprocess.Popen` to store the
        # output of ps2ascii command in PIPE

        proc = Popen(['ps2ascii'], stdin=data, stdout=PIPE)

        # Now simply read the data line by line from PIPE
        # and check for the search_key, if found return instantly 
        for line in iter(proc.stdout.readline, ''):
            if search_key in line:
                return True
    return False

url = ("http://sttm.org/Bulletin/Bulletins2015/tabid/483/Link"
       "Click.aspx?fileticket=WkjzL8SuMQQ%3d&tabid=483&portal"
       "id=0&mid=979", "http://sttm.org/Bulletin/Buletins2015/")

scheduled = pdf_contains(url,
                         "http://sttm.org/Bulletin/Buletins2015/",
                         "My Name Here",
                         "Sunday,")

print scheduled

Context

StackExchange Code Review Q#78692, answer score: 4

Revisions (0)

No revisions yet.