HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Finding duplicates in given list of files

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
filesfindinglistgivenduplicates

Problem

I implemented a simple tool to find duplicate files, by applying the following strategy:

  • Take a list of files and return the sets of duplicates (set of sets)



  • Implement a custom Comparator to compare files by content



  • Implement another comparator that calls the first one, and when two files are equal, mark them as duplicates, using a "duplicate tracker" helper class



  • Pass the list of files to Collections.sort, using the custom comparator with the duplicate tracker. The goal of sort algorithms is to reduce the number of comparisons, which seems to suit well for my purpose too.



The main method calling the other pieces:

public class DuplicateFileFinder {

    /**
     * Find duplicates in given list of files, ignoring I/O errors
     *
     * @param files the list of files to check
     * @return sets of duplicate files
     */
    public Set> findDuplicates(List files) {
        DuplicateTracker tracker = new DuplicateTracker<>();
        Comparator comparator = new FileContentComparator();

        Collections.sort(files, (file1, file2) -> {
            int cmp = comparator.compare(file1, file2);
            if (cmp == 0) {
                tracker.add(file1, file2);
            }
            return cmp;
        });
        return tracker.getDuplicates();
    }
}


The Comparator to compare files by content, ignoring I/O errors:

```
/**
* A comparator to compare two files by content.
* In case of I/O errors, fall back to default comparator of File.
*/
public class FileContentComparator implements Comparator {

private static final int DEFAULT_BUFSIZE = 1024 * 1024;
private final int bufSize;

public FileContentComparator() {
bufSize = DEFAULT_BUFSIZE;
}

public FileContentComparator(int bufSize) {
this.bufSize = bufSize;
}

@Override
public int compare(File file1, File file2) {
int cmpFileSize = Long.compare(file1.length(), file2.length());
if (cmpFileSize != 0) {
r

Solution

For real use, you normally want to avoid directly comparing file contents as much as possible.

To do that, you do testing in at least two tiers, and possibly three. First you compare the file sizes. If they're different, the files are different and you're done. Most file systems store file sizes directly, so you can retrieve it very quickly, and since it's only one number the comparison is very fast.

From there you have an (optional) second comparison. This is especially applicable if (for example) the two sets of files are physically remote from each other, so it's fairly cheap to read both files (locally) but much more expensive to transmit them for comparison. In this case, you can do a hash on each file, and compare the hashes.

Then, if the previous test(s) matched, you can do a full comparison of the two files. Or maybe you don't bother--if you use serious cryptographic hash (e.g., SHA-512) for the intermediate comparison, it may not be worth bothering to compare the full contents--the chances of finding a collision with SHA-512 are so minuscule you may not consider it worth the bother. On the other hand, if you used something like a 16-bit (or even 32-bit) CRC as your "hash", you nearly need more to confirm an apparent duplicate.

Context

StackExchange Code Review Q#121707, answer score: 7

Revisions (0)

No revisions yet.