HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

PDF processing by PDFBox app jar

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
appjarpdfboxprocessingpdf

Problem

I want to convert a large PDF file into individual text files with PDFBox using Python. This code takes lot of time to convert a PDF containing 1000 pages.

Is there anything I can do to improve it?

I cannot use any other library. My other code (Python - parsing of obtained text) works best with output of PDFBox.

import PyPDF2
import sys
import os
import subprocess

pdfFileObj = open('cpdf.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
tot = pdfReader.numPages
for i in range(1, tot+1):
    i = str(i)
    ffi = "out\\" + i + "-extracted.txt"
    command = "java -jar" + " " + "pdfbox-app-2.0.2.jar" + " " + "ExtractText" + " " + "cpdf.pdf" + " " + ffi + " " + "-startPage" + " " + i + " " + "-endPage" + " " + i
    subprocess.check_output(command, shell=True)
    print('Completed ' + str(i))


On the other hand, when I do this, it takes 10 seconds:

import subprocess

command = "java -jar pdfbox-app-2.0.2.jar ExtractText cpdf.pdf  out.txt -startPage 1 -endPage 3158"

subprocess.check_output(command, shell=True)

Solution

You should use with..as to open a file to make sure it is closed again, even if your program is interrupted:

with open('cpdf.pdf', 'rb') as pdfFileObj:
    ...


If your mode was only r, it could be omitted, as it is the default, however since the file is binary, we need 'rb' here.

Building your command can be greatly simplified using format, leaving out the slow string additions:

command = "java -jar pdfbox-app-2.0.2.jar ExtractText cpdf.pdf {0} -startPage {1} -endPage {1}".format(ffi, i)


Pull the command variable out of the loop and make it a constant.

Get rid of the intermediate variable ffi (which is also a very bad variable name).

Choose better variable names, like page, instead of i.

Follow PEP8 for variable names, use lower_case, not camelCase.

Make the file name a variable (allows passing it in via sys.argv or similar at some point).

Use format's awesome features of replacing named place-holders and locals() to get a dictionary of locally defined variables. Use dictionary unpacking (**dict) to connect the two.

If you are using python 2.x (which does not seem to be the case, judging from your print()), use xrange instead of range to avoid building a list with 1000 elements before iterating and build a generator instead.

Final code:

import sys
import PyPDF2
import subprocess

COMMAND = "java -jar pdfbox-app-2.0.2.jar ExtractText {file_name} out\\{page}-extracted.txt -startPage {page} -endPage {page}"

file_name = sys.argv[1] if len(sys.argv) == 2 else "cpdf.pdf"
with open(file_name, "rb") as pdf_file:
    pages = PyPDF2.PdfFileReader(pdf_file).numPages

for page in range(1, pages + 1):
    subprocess.check_output(COMMAND.format(**locals()), shell=True)
    print('Completed {}'.format(page))


Alternatively, saving the pdf_reader.numPage calls to locals():

subprocess.check_output(COMMAND.format(page=page, file_name=file_name), shell=True)


If the code were longer I would put it into a function. In that case COMMAND should be defined inside that function, because local variable lookup is faster than global variable lookup.

With functions:

import sys
import PyPDF2
import subprocess

def run(file_name, page):
    command = "java -jar pdfbox-app-2.0.2.jar ExtractText {file_name} out\\{page}-extracted.txt -startPage {page} -endPage {page}"
    return subprocess.check_output(command.format(page=page, file_name=file_name), shell=True)

def number_of_pages(file_name):
    with open(file_name, "rb") as pdf_file:
        return PyPDF2.PdfFileReader(pdf_file).numPages

if __name__ == "__main__":
    file_name = sys.argv[1] if len(sys.argv) == 2 else "cpdf.pdf"
    pages = number_of_pages(file_name)

    for page in range(1, pages + 1):
        run(file_name, page)
        print('Completed {}'.format(page))


Regarding your timing: It is probably slow because you are repeatedly starting the java program, accumulating initialization times. Maybe you can split the text obtained from the combined command somehow?

Code Snippets

with open('cpdf.pdf', 'rb') as pdfFileObj:
    ...
command = "java -jar pdfbox-app-2.0.2.jar ExtractText cpdf.pdf {0} -startPage {1} -endPage {1}".format(ffi, i)
import sys
import PyPDF2
import subprocess

COMMAND = "java -jar pdfbox-app-2.0.2.jar ExtractText {file_name} out\\{page}-extracted.txt -startPage {page} -endPage {page}"

file_name = sys.argv[1] if len(sys.argv) == 2 else "cpdf.pdf"
with open(file_name, "rb") as pdf_file:
    pages = PyPDF2.PdfFileReader(pdf_file).numPages

for page in range(1, pages + 1):
    subprocess.check_output(COMMAND.format(**locals()), shell=True)
    print('Completed {}'.format(page))
subprocess.check_output(COMMAND.format(page=page, file_name=file_name), shell=True)
import sys
import PyPDF2
import subprocess


def run(file_name, page):
    command = "java -jar pdfbox-app-2.0.2.jar ExtractText {file_name} out\\{page}-extracted.txt -startPage {page} -endPage {page}"
    return subprocess.check_output(command.format(page=page, file_name=file_name), shell=True)


def number_of_pages(file_name):
    with open(file_name, "rb") as pdf_file:
        return PyPDF2.PdfFileReader(pdf_file).numPages

if __name__ == "__main__":
    file_name = sys.argv[1] if len(sys.argv) == 2 else "cpdf.pdf"
    pages = number_of_pages(file_name)

    for page in range(1, pages + 1):
        run(file_name, page)
        print('Completed {}'.format(page))

Context

StackExchange Code Review Q#137820, answer score: 7

Revisions (0)

No revisions yet.