HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Image duplicate finder

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
duplicateimagefinder

Problem

I have a bunch of exact-duplicate pictures that I've acquired over the years. I'd like to create a list of all them so I can eventually delete some. My idea was simple: dump the hash and location of every image file under a path in MongoDB for later analysis. This is what I came up with:

```
import com.david.mongodocs.ImageEntry;
import com.mongodb.MongoClient;
import org.apache.commons.codec.digest.DigestUtils;
import org.mongodb.morphia.Datastore;
import org.mongodb.morphia.Morphia;

import java.io.FileInputStream;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

public class MD5Deduplicator {
private static Datastore datastore;

public static void main(String[] args) throws Exception {
long startTime = System.nanoTime();
Morphia morphia = new Morphia();
morphia.mapPackage("com.david.mongodocs");
datastore = morphia.createDatastore(new MongoClient(), "md5Deduplicator");
datastore.ensureIndexes();
logDuplicates(Paths.get(args[0]));
System.out.println("Completed scan in " + (System.nanoTime() - startTime )+ " nanosecs");
}

private static void logDuplicates(Path path) throws IOException {
Files.walk(path).parallel()
.filter(Files::isReadable)
.filter(Files::isRegularFile)
.forEach(filePath -> {
try {
String contentType = Files.probeContentType(filePath);
if (contentType != null && contentType.startsWith("image")) {
FileInputStream fis = new FileInputStream(filePath.toFile());
String md5 = DigestUtils.md5Hex(fis);
fis.close();
ImageEntry imageEntry = new ImageEntry(filePath.toAbsolutePath().toString(), md5);
datastore.save(imageEntry);

}

Solution

The question and the subject is interesting. Recently I needed to make some similar processing; this question pushed me to proceed.

Review

First of all, some remarks about the original implementation.

The method logDuplicates(Path), according to what I see, does not log duplicates, but saves into mongodb all ImageEntry objects corresponding to all image files found in the path. So, to trace duplicates, it is still necessary to execute a couple of requests in mongodb.

There is some room for improvements in this method. Calling datastore.save(imageEntry) for each item separately looked rather suspicious. Doing it in batch mode for all available items should make the thing quicker. Indeed, Datastore.save is overloaded with save(Iterable) and save(T... entities). A slightly improved version of the method would look like this:

private void digestImages(Path path) throws IOException {
  List images = new ArrayList<>(APPROX_IMAGES_COUNT);
  Files.walk(path)
       .parallel()
       .filter(Files::isReadable)
       .filter(Files::isRegularFile)
       .forEach(filePath -> {
         if (isImage(filePath)) {
             ImageEntry img = digestAndBuildImageEntry(filePath);
             if (img != null) {
               images.add(img);
             } else {
               System.out.println(String.format("Failed to digest image: %1$s", filePath));
             }
         }});
  datastore.save(images);
}

private boolean isImage(Path path) {
  try {
    String contentType = Files.probeContentType(path);
    return contentType != null && contentType.startsWith("image");
  } catch (IOException ex) {
    ex.printStackTrace();
    return false;
  }
}

private ImageEntry digestAndBuildImageEntry(Path filePath) {
  try (InputStream is = Files.newInputStream(filePath);
       BufferedInputStream buffered = new BufferedInputStream(is)) {
    String hash = DigestUtils.md5Hex(buffered);
    return new ImageEntry(filePath.toAbsolutePath().toString(), hash);
  } catch (IOException ex) {
    ex.printStackTrace();
    return null;
  }
}


Tests

I've got a folder with about 900 JPG image files in numerous sub-folders that I used for tests. Introducing a try-with-resources + batch saving seemed to improve the overall performance for about 10% (see results below).

Based on this SO post that helped me recall different ways and APIs to produce checksums and hashes, I executed a few tests to compare the perfs with the reviewed implementation, as well as when using Guava. The averages of 5 runs on my old i7 gave the following results:

impl               avg time,ms    %
-------------------------------------
original              6583      100.0 
reviewed              5873       89.2 
guava/sha1            8267      125.6
guava/md5             5865       89.1 
guava/murmur3-128     3819       58.0
guava/murmur3-32      2689       40.9 
guava/adler32         2432       36.9
-------------------------------------


For the tests with Guava, I used the following overloading of digestAndBuildImageEntry:

private ImageEntry digestAndBuildImageEntry(Path filePath, HashFunction hashFunc) {
  try {
    String hash = com.google.common.io.Files.hash(filePath.toFile(), hashFunc).toString();
    return new ImageEntry(filePath.toAbsolutePath().toString(), hash);
  } catch (IOException ex) {
    ex.printStackTrace();
    return null;
  }
}


So, we can see that, at least for my test case and without entering into theoretical discussions, Guava's adler32 hash performs almost three times faster than the original implementation.

Code Snippets

private void digestImages(Path path) throws IOException {
  List<ImageEntry> images = new ArrayList<>(APPROX_IMAGES_COUNT);
  Files.walk(path)
       .parallel()
       .filter(Files::isReadable)
       .filter(Files::isRegularFile)
       .forEach(filePath -> {
         if (isImage(filePath)) {
             ImageEntry img = digestAndBuildImageEntry(filePath);
             if (img != null) {
               images.add(img);
             } else {
               System.out.println(String.format("Failed to digest image: %1$s", filePath));
             }
         }});
  datastore.save(images);
}

private boolean isImage(Path path) {
  try {
    String contentType = Files.probeContentType(path);
    return contentType != null && contentType.startsWith("image");
  } catch (IOException ex) {
    ex.printStackTrace();
    return false;
  }
}

private ImageEntry digestAndBuildImageEntry(Path filePath) {
  try (InputStream is = Files.newInputStream(filePath);
       BufferedInputStream buffered = new BufferedInputStream(is)) {
    String hash = DigestUtils.md5Hex(buffered);
    return new ImageEntry(filePath.toAbsolutePath().toString(), hash);
  } catch (IOException ex) {
    ex.printStackTrace();
    return null;
  }
}
impl               avg time,ms    %
-------------------------------------
original              6583      100.0 
reviewed              5873       89.2 
guava/sha1            8267      125.6
guava/md5             5865       89.1 
guava/murmur3-128     3819       58.0
guava/murmur3-32      2689       40.9 
guava/adler32         2432       36.9
-------------------------------------
private ImageEntry digestAndBuildImageEntry(Path filePath, HashFunction hashFunc) {
  try {
    String hash = com.google.common.io.Files.hash(filePath.toFile(), hashFunc).toString();
    return new ImageEntry(filePath.toAbsolutePath().toString(), hash);
  } catch (IOException ex) {
    ex.printStackTrace();
    return null;
  }
}

Context

StackExchange Code Review Q#110314, answer score: 2

Revisions (0)

No revisions yet.