patternpythonMinor
JPEG extraction script
Viewed 0 times
scriptextractionjpeg
Problem
Here is a program that I've wrote to extract JPEGs from a file. It reads a file that contains the image data and separates it into individual images.
The script names the identified images with their SHA256 digest, and all of the photos that it finds will be dumped into the current directory.
Here's how I test the script to see if it is working correctly:
How can I improve my script's performance? Are there any problems with its operation? The script does not know when an image ends; just when a new one starts. Could this cause problems?
import hashlib
inputfile = 'data.txt'
marker = chr(0xFF)+chr(0xD8)
# Input data
imagedump = file(inputfile, "rb").read()
imagedump = imagedump.split(marker)
count=0
for photo in imagedump:
name = hashlib.sha256(photo).hexdigest()[0:16]+".jpg"
file(name, "wb").write(marker+photo)
count=count+1
print countThe script names the identified images with their SHA256 digest, and all of the photos that it finds will be dumped into the current directory.
Here's how I test the script to see if it is working correctly:
- Type
cd ~/images/
- create the folder
mkdir test
- dump some JPEGs into a singe file in the directory
cat *.jpg > ./test/data.txt
cd testand put the script into the current directory
- run the script
python extract.py, and the JPEGs will be dumped into the current folder
How can I improve my script's performance? Are there any problems with its operation? The script does not know when an image ends; just when a new one starts. Could this cause problems?
Solution
There is an End-of-data marker
To do a proper job, you must look for JPEG markers, most of which are followed by two bytes indicating the payload size, and advance the indicated number of bytes, until you hit the
FF D9, but you can't scan for it blindly, because those bytes can also appear within a JPEG image. For example, if the JPEG contains a thumbnail, then FF D9 could mark the end of the thumbnail rather than of the whole image. In fact, the FF D8 start-of-image marker can also appear within a JPEG image for the same reason. Therefore, your technique is invalid.To do a proper job, you must look for JPEG markers, most of which are followed by two bytes indicating the payload size, and advance the indicated number of bytes, until you hit the
FF D9 marker. It might even be faster, since you can advance in chunks rather than scanning every byte sequentially.Context
StackExchange Code Review Q#32981, answer score: 4
Revisions (0)
No revisions yet.