HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Extract Pages from PDF based on search in python

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
searchpagespythonbasedextractfrompdf

Problem

Everything is working fine except timing.
it takes lot time for my file containing 1000 pages and having 100 pages of interest.

import re
from PyPDF2 import PdfFileReader, PdfFileWriter
import glob, os

# find pages
def  findText(f, slist):
    file = open(f, 'rb')
    pdfDoc = PdfFileReader(file)
    pages = []
    for i in range(pdfDoc.getNumPages()):
        content = pdfDoc.getPage(i).extractText().lower()
        for s in slist:
            if re.search(s.lower(), content) is not None:
                if i not in pages:
                    pages.append(i)
    return pages

#extract pages
def extractPage(f, fOut, pages):
    file = open(f, 'rb')
    output = PdfFileWriter()
    pdfOne = PdfFileReader(file)
    for i in pages:
        output.addPage(pdfOne.getPage(i))
    outputStream = open(fOut, "wb")
    output.write(outputStream)
    outputStream.close()
    return

os.chdir(r"path\to\mydir")
for pdfFile in glob.glob("*.pdf"):
    print(pdfFile)
    outPdfFile = pdfFile.replace(".pdf","_searched_extracted.pdf")
    stringList = ["string1", "string2"]
    extractPage(pdfFile, outPdfFile, findText(pdfFile, stringList))


Updated code after suggestions is at:

https://gist.github.com/pra007/099f10b07be5b7126a36438c67ad7a1f

Solution

You could try profiling but the code is simple enough that I think you're spending most of the time in PyPDF2 code. Two options:

  • You can preprocess your PDF files to store their text somewhere, which will make the search phase much faster, especially if you run multiples queries on the same PDF files



  • You can try another parser such as a Python 3 version of PDFMiner, or even a parser written in a faster language

Context

StackExchange Code Review Q#140719, answer score: 4

Revisions (0)

No revisions yet.