patternpythonMinor
PDF processing by PDFBox app jar
Viewed 0 times
appjarpdfboxprocessingpdf
Problem
I want to convert a large PDF file into individual text files with PDFBox using Python. This code takes lot of time to convert a PDF containing 1000 pages.
Is there anything I can do to improve it?
I cannot use any other library. My other code (Python - parsing of obtained text) works best with output of PDFBox.
On the other hand, when I do this, it takes 10 seconds:
Is there anything I can do to improve it?
I cannot use any other library. My other code (Python - parsing of obtained text) works best with output of PDFBox.
import PyPDF2
import sys
import os
import subprocess
pdfFileObj = open('cpdf.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
tot = pdfReader.numPages
for i in range(1, tot+1):
i = str(i)
ffi = "out\\" + i + "-extracted.txt"
command = "java -jar" + " " + "pdfbox-app-2.0.2.jar" + " " + "ExtractText" + " " + "cpdf.pdf" + " " + ffi + " " + "-startPage" + " " + i + " " + "-endPage" + " " + i
subprocess.check_output(command, shell=True)
print('Completed ' + str(i))On the other hand, when I do this, it takes 10 seconds:
import subprocess
command = "java -jar pdfbox-app-2.0.2.jar ExtractText cpdf.pdf out.txt -startPage 1 -endPage 3158"
subprocess.check_output(command, shell=True)Solution
You should use
If your mode was only
Building your command can be greatly simplified using
Pull the
Get rid of the intermediate variable
Choose better variable names, like
Follow PEP8 for variable names, use
Make the file name a variable (allows passing it in via
Use
If you are using python 2.x (which does not seem to be the case, judging from your
Final code:
Alternatively, saving the
If the code were longer I would put it into a function. In that case
With functions:
Regarding your timing: It is probably slow because you are repeatedly starting the java program, accumulating initialization times. Maybe you can split the text obtained from the combined command somehow?
with..as to open a file to make sure it is closed again, even if your program is interrupted:with open('cpdf.pdf', 'rb') as pdfFileObj:
...If your mode was only
r, it could be omitted, as it is the default, however since the file is binary, we need 'rb' here.Building your command can be greatly simplified using
format, leaving out the slow string additions:command = "java -jar pdfbox-app-2.0.2.jar ExtractText cpdf.pdf {0} -startPage {1} -endPage {1}".format(ffi, i)Pull the
command variable out of the loop and make it a constant.Get rid of the intermediate variable
ffi (which is also a very bad variable name).Choose better variable names, like
page, instead of i.Follow PEP8 for variable names, use
lower_case, not camelCase.Make the file name a variable (allows passing it in via
sys.argv or similar at some point).Use
format's awesome features of replacing named place-holders and locals() to get a dictionary of locally defined variables. Use dictionary unpacking (**dict) to connect the two.If you are using python 2.x (which does not seem to be the case, judging from your
print()), use xrange instead of range to avoid building a list with 1000 elements before iterating and build a generator instead.Final code:
import sys
import PyPDF2
import subprocess
COMMAND = "java -jar pdfbox-app-2.0.2.jar ExtractText {file_name} out\\{page}-extracted.txt -startPage {page} -endPage {page}"
file_name = sys.argv[1] if len(sys.argv) == 2 else "cpdf.pdf"
with open(file_name, "rb") as pdf_file:
pages = PyPDF2.PdfFileReader(pdf_file).numPages
for page in range(1, pages + 1):
subprocess.check_output(COMMAND.format(**locals()), shell=True)
print('Completed {}'.format(page))Alternatively, saving the
pdf_reader.numPage calls to locals():subprocess.check_output(COMMAND.format(page=page, file_name=file_name), shell=True)If the code were longer I would put it into a function. In that case
COMMAND should be defined inside that function, because local variable lookup is faster than global variable lookup.With functions:
import sys
import PyPDF2
import subprocess
def run(file_name, page):
command = "java -jar pdfbox-app-2.0.2.jar ExtractText {file_name} out\\{page}-extracted.txt -startPage {page} -endPage {page}"
return subprocess.check_output(command.format(page=page, file_name=file_name), shell=True)
def number_of_pages(file_name):
with open(file_name, "rb") as pdf_file:
return PyPDF2.PdfFileReader(pdf_file).numPages
if __name__ == "__main__":
file_name = sys.argv[1] if len(sys.argv) == 2 else "cpdf.pdf"
pages = number_of_pages(file_name)
for page in range(1, pages + 1):
run(file_name, page)
print('Completed {}'.format(page))Regarding your timing: It is probably slow because you are repeatedly starting the java program, accumulating initialization times. Maybe you can split the text obtained from the combined command somehow?
Code Snippets
with open('cpdf.pdf', 'rb') as pdfFileObj:
...command = "java -jar pdfbox-app-2.0.2.jar ExtractText cpdf.pdf {0} -startPage {1} -endPage {1}".format(ffi, i)import sys
import PyPDF2
import subprocess
COMMAND = "java -jar pdfbox-app-2.0.2.jar ExtractText {file_name} out\\{page}-extracted.txt -startPage {page} -endPage {page}"
file_name = sys.argv[1] if len(sys.argv) == 2 else "cpdf.pdf"
with open(file_name, "rb") as pdf_file:
pages = PyPDF2.PdfFileReader(pdf_file).numPages
for page in range(1, pages + 1):
subprocess.check_output(COMMAND.format(**locals()), shell=True)
print('Completed {}'.format(page))subprocess.check_output(COMMAND.format(page=page, file_name=file_name), shell=True)import sys
import PyPDF2
import subprocess
def run(file_name, page):
command = "java -jar pdfbox-app-2.0.2.jar ExtractText {file_name} out\\{page}-extracted.txt -startPage {page} -endPage {page}"
return subprocess.check_output(command.format(page=page, file_name=file_name), shell=True)
def number_of_pages(file_name):
with open(file_name, "rb") as pdf_file:
return PyPDF2.PdfFileReader(pdf_file).numPages
if __name__ == "__main__":
file_name = sys.argv[1] if len(sys.argv) == 2 else "cpdf.pdf"
pages = number_of_pages(file_name)
for page in range(1, pages + 1):
run(file_name, page)
print('Completed {}'.format(page))Context
StackExchange Code Review Q#137820, answer score: 7
Revisions (0)
No revisions yet.