HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Reading the bytes of a PDF

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
readingthebytespdf

Problem

I'm quite a newbie in Python and I want to speed up this method since it takes very long time especially when the size of the input file in Mbs. Also, I couldn't figure out how to use Cython in the for loop. I'm using this function with other functions to compare files byte by byte. Any recommendations?

# this function returns a file bytes in a list
filename1 = 'doc1.pdf'
def byte_target(filename1):
    f = open(filename1, "rb")
    try:
        b = f.read(1)
        tlist = []
        while True:
            # get file bytes
            t = ' '.join(format(ord(x), 'b') for x in b)
            b = f.read(1)
            if not b:
                break
            #add this byte to the list
            tlist.append(t)

            #print b        

    finally:
        f.close()
    return tlist

Solution

It's not surprising that this is too slow:
you're reading data byte-by-byte.
For faster performance you would need to read larger buffers at a time.

If you want to compare files by content, use the filecmp package.

There are also some glaring problems with this code.
For example, instead of opening a file, doing something in a try block and closing the file handle manually, you should use the recommended with-resources technique:

with open(filename1, "rb") as f:
        b = f.read(1)
        # ...


Finally, the function name and all variable names are very poor,
and don't help the readers understand their purpose and what you're trying to do.

Code Snippets

with open(filename1, "rb") as f:
        b = f.read(1)
        # ...

Context

StackExchange Code Review Q#92676, answer score: 5

Revisions (0)

No revisions yet.