patternpythonMinor
Extract Pages from PDF based on search in python
Viewed 0 times
searchpagespythonbasedextractfrompdf
Problem
Everything is working fine except timing.
it takes lot time for my file containing 1000 pages and having 100 pages of interest.
Updated code after suggestions is at:
https://gist.github.com/pra007/099f10b07be5b7126a36438c67ad7a1f
it takes lot time for my file containing 1000 pages and having 100 pages of interest.
import re
from PyPDF2 import PdfFileReader, PdfFileWriter
import glob, os
# find pages
def findText(f, slist):
file = open(f, 'rb')
pdfDoc = PdfFileReader(file)
pages = []
for i in range(pdfDoc.getNumPages()):
content = pdfDoc.getPage(i).extractText().lower()
for s in slist:
if re.search(s.lower(), content) is not None:
if i not in pages:
pages.append(i)
return pages
#extract pages
def extractPage(f, fOut, pages):
file = open(f, 'rb')
output = PdfFileWriter()
pdfOne = PdfFileReader(file)
for i in pages:
output.addPage(pdfOne.getPage(i))
outputStream = open(fOut, "wb")
output.write(outputStream)
outputStream.close()
return
os.chdir(r"path\to\mydir")
for pdfFile in glob.glob("*.pdf"):
print(pdfFile)
outPdfFile = pdfFile.replace(".pdf","_searched_extracted.pdf")
stringList = ["string1", "string2"]
extractPage(pdfFile, outPdfFile, findText(pdfFile, stringList))Updated code after suggestions is at:
https://gist.github.com/pra007/099f10b07be5b7126a36438c67ad7a1f
Solution
You could try profiling but the code is simple enough that I think you're spending most of the time in PyPDF2 code. Two options:
- You can preprocess your PDF files to store their text somewhere, which will make the search phase much faster, especially if you run multiples queries on the same PDF files
- You can try another parser such as a Python 3 version of PDFMiner, or even a parser written in a faster language
Context
StackExchange Code Review Q#140719, answer score: 4
Revisions (0)
No revisions yet.