patternjavaMinor
Image duplicate finder
Viewed 0 times
duplicateimagefinder
Problem
I have a bunch of exact-duplicate pictures that I've acquired over the years. I'd like to create a list of all them so I can eventually delete some. My idea was simple: dump the hash and location of every image file under a path in MongoDB for later analysis. This is what I came up with:
```
import com.david.mongodocs.ImageEntry;
import com.mongodb.MongoClient;
import org.apache.commons.codec.digest.DigestUtils;
import org.mongodb.morphia.Datastore;
import org.mongodb.morphia.Morphia;
import java.io.FileInputStream;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class MD5Deduplicator {
private static Datastore datastore;
public static void main(String[] args) throws Exception {
long startTime = System.nanoTime();
Morphia morphia = new Morphia();
morphia.mapPackage("com.david.mongodocs");
datastore = morphia.createDatastore(new MongoClient(), "md5Deduplicator");
datastore.ensureIndexes();
logDuplicates(Paths.get(args[0]));
System.out.println("Completed scan in " + (System.nanoTime() - startTime )+ " nanosecs");
}
private static void logDuplicates(Path path) throws IOException {
Files.walk(path).parallel()
.filter(Files::isReadable)
.filter(Files::isRegularFile)
.forEach(filePath -> {
try {
String contentType = Files.probeContentType(filePath);
if (contentType != null && contentType.startsWith("image")) {
FileInputStream fis = new FileInputStream(filePath.toFile());
String md5 = DigestUtils.md5Hex(fis);
fis.close();
ImageEntry imageEntry = new ImageEntry(filePath.toAbsolutePath().toString(), md5);
datastore.save(imageEntry);
}
```
import com.david.mongodocs.ImageEntry;
import com.mongodb.MongoClient;
import org.apache.commons.codec.digest.DigestUtils;
import org.mongodb.morphia.Datastore;
import org.mongodb.morphia.Morphia;
import java.io.FileInputStream;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class MD5Deduplicator {
private static Datastore datastore;
public static void main(String[] args) throws Exception {
long startTime = System.nanoTime();
Morphia morphia = new Morphia();
morphia.mapPackage("com.david.mongodocs");
datastore = morphia.createDatastore(new MongoClient(), "md5Deduplicator");
datastore.ensureIndexes();
logDuplicates(Paths.get(args[0]));
System.out.println("Completed scan in " + (System.nanoTime() - startTime )+ " nanosecs");
}
private static void logDuplicates(Path path) throws IOException {
Files.walk(path).parallel()
.filter(Files::isReadable)
.filter(Files::isRegularFile)
.forEach(filePath -> {
try {
String contentType = Files.probeContentType(filePath);
if (contentType != null && contentType.startsWith("image")) {
FileInputStream fis = new FileInputStream(filePath.toFile());
String md5 = DigestUtils.md5Hex(fis);
fis.close();
ImageEntry imageEntry = new ImageEntry(filePath.toAbsolutePath().toString(), md5);
datastore.save(imageEntry);
}
Solution
The question and the subject is interesting. Recently I needed to make some similar processing; this question pushed me to proceed.
Review
First of all, some remarks about the original implementation.
The method
There is some room for improvements in this method. Calling
Tests
I've got a folder with about 900 JPG image files in numerous sub-folders that I used for tests. Introducing a
Based on this SO post that helped me recall different ways and APIs to produce checksums and hashes, I executed a few tests to compare the perfs with the reviewed implementation, as well as when using Guava. The averages of 5 runs on my old i7 gave the following results:
For the tests with Guava, I used the following overloading of
So, we can see that, at least for my test case and without entering into theoretical discussions, Guava's
Review
First of all, some remarks about the original implementation.
The method
logDuplicates(Path), according to what I see, does not log duplicates, but saves into mongodb all ImageEntry objects corresponding to all image files found in the path. So, to trace duplicates, it is still necessary to execute a couple of requests in mongodb.There is some room for improvements in this method. Calling
datastore.save(imageEntry) for each item separately looked rather suspicious. Doing it in batch mode for all available items should make the thing quicker. Indeed, Datastore.save is overloaded with save(Iterable) and save(T... entities). A slightly improved version of the method would look like this:private void digestImages(Path path) throws IOException {
List images = new ArrayList<>(APPROX_IMAGES_COUNT);
Files.walk(path)
.parallel()
.filter(Files::isReadable)
.filter(Files::isRegularFile)
.forEach(filePath -> {
if (isImage(filePath)) {
ImageEntry img = digestAndBuildImageEntry(filePath);
if (img != null) {
images.add(img);
} else {
System.out.println(String.format("Failed to digest image: %1$s", filePath));
}
}});
datastore.save(images);
}
private boolean isImage(Path path) {
try {
String contentType = Files.probeContentType(path);
return contentType != null && contentType.startsWith("image");
} catch (IOException ex) {
ex.printStackTrace();
return false;
}
}
private ImageEntry digestAndBuildImageEntry(Path filePath) {
try (InputStream is = Files.newInputStream(filePath);
BufferedInputStream buffered = new BufferedInputStream(is)) {
String hash = DigestUtils.md5Hex(buffered);
return new ImageEntry(filePath.toAbsolutePath().toString(), hash);
} catch (IOException ex) {
ex.printStackTrace();
return null;
}
}Tests
I've got a folder with about 900 JPG image files in numerous sub-folders that I used for tests. Introducing a
try-with-resources + batch saving seemed to improve the overall performance for about 10% (see results below).Based on this SO post that helped me recall different ways and APIs to produce checksums and hashes, I executed a few tests to compare the perfs with the reviewed implementation, as well as when using Guava. The averages of 5 runs on my old i7 gave the following results:
impl avg time,ms %
-------------------------------------
original 6583 100.0
reviewed 5873 89.2
guava/sha1 8267 125.6
guava/md5 5865 89.1
guava/murmur3-128 3819 58.0
guava/murmur3-32 2689 40.9
guava/adler32 2432 36.9
-------------------------------------For the tests with Guava, I used the following overloading of
digestAndBuildImageEntry: private ImageEntry digestAndBuildImageEntry(Path filePath, HashFunction hashFunc) {
try {
String hash = com.google.common.io.Files.hash(filePath.toFile(), hashFunc).toString();
return new ImageEntry(filePath.toAbsolutePath().toString(), hash);
} catch (IOException ex) {
ex.printStackTrace();
return null;
}
}So, we can see that, at least for my test case and without entering into theoretical discussions, Guava's
adler32 hash performs almost three times faster than the original implementation.Code Snippets
private void digestImages(Path path) throws IOException {
List<ImageEntry> images = new ArrayList<>(APPROX_IMAGES_COUNT);
Files.walk(path)
.parallel()
.filter(Files::isReadable)
.filter(Files::isRegularFile)
.forEach(filePath -> {
if (isImage(filePath)) {
ImageEntry img = digestAndBuildImageEntry(filePath);
if (img != null) {
images.add(img);
} else {
System.out.println(String.format("Failed to digest image: %1$s", filePath));
}
}});
datastore.save(images);
}
private boolean isImage(Path path) {
try {
String contentType = Files.probeContentType(path);
return contentType != null && contentType.startsWith("image");
} catch (IOException ex) {
ex.printStackTrace();
return false;
}
}
private ImageEntry digestAndBuildImageEntry(Path filePath) {
try (InputStream is = Files.newInputStream(filePath);
BufferedInputStream buffered = new BufferedInputStream(is)) {
String hash = DigestUtils.md5Hex(buffered);
return new ImageEntry(filePath.toAbsolutePath().toString(), hash);
} catch (IOException ex) {
ex.printStackTrace();
return null;
}
}impl avg time,ms %
-------------------------------------
original 6583 100.0
reviewed 5873 89.2
guava/sha1 8267 125.6
guava/md5 5865 89.1
guava/murmur3-128 3819 58.0
guava/murmur3-32 2689 40.9
guava/adler32 2432 36.9
-------------------------------------private ImageEntry digestAndBuildImageEntry(Path filePath, HashFunction hashFunc) {
try {
String hash = com.google.common.io.Files.hash(filePath.toFile(), hashFunc).toString();
return new ImageEntry(filePath.toAbsolutePath().toString(), hash);
} catch (IOException ex) {
ex.printStackTrace();
return null;
}
}Context
StackExchange Code Review Q#110314, answer score: 2
Revisions (0)
No revisions yet.