patternjavaMinor
Finding duplicates in given list of files
Viewed 0 times
filesfindinglistgivenduplicates
Problem
I implemented a simple tool to find duplicate files, by applying the following strategy:
The main method calling the other pieces:
The
```
/**
* A comparator to compare two files by content.
* In case of I/O errors, fall back to default comparator of File.
*/
public class FileContentComparator implements Comparator {
private static final int DEFAULT_BUFSIZE = 1024 * 1024;
private final int bufSize;
public FileContentComparator() {
bufSize = DEFAULT_BUFSIZE;
}
public FileContentComparator(int bufSize) {
this.bufSize = bufSize;
}
@Override
public int compare(File file1, File file2) {
int cmpFileSize = Long.compare(file1.length(), file2.length());
if (cmpFileSize != 0) {
r
- Take a list of files and return the sets of duplicates (set of sets)
- Implement a custom
Comparatorto compare files by content
- Implement another comparator that calls the first one, and when two files are equal, mark them as duplicates, using a "duplicate tracker" helper class
- Pass the list of files to
Collections.sort, using the custom comparator with the duplicate tracker. The goal of sort algorithms is to reduce the number of comparisons, which seems to suit well for my purpose too.
The main method calling the other pieces:
public class DuplicateFileFinder {
/**
* Find duplicates in given list of files, ignoring I/O errors
*
* @param files the list of files to check
* @return sets of duplicate files
*/
public Set> findDuplicates(List files) {
DuplicateTracker tracker = new DuplicateTracker<>();
Comparator comparator = new FileContentComparator();
Collections.sort(files, (file1, file2) -> {
int cmp = comparator.compare(file1, file2);
if (cmp == 0) {
tracker.add(file1, file2);
}
return cmp;
});
return tracker.getDuplicates();
}
}The
Comparator to compare files by content, ignoring I/O errors:```
/**
* A comparator to compare two files by content.
* In case of I/O errors, fall back to default comparator of File.
*/
public class FileContentComparator implements Comparator {
private static final int DEFAULT_BUFSIZE = 1024 * 1024;
private final int bufSize;
public FileContentComparator() {
bufSize = DEFAULT_BUFSIZE;
}
public FileContentComparator(int bufSize) {
this.bufSize = bufSize;
}
@Override
public int compare(File file1, File file2) {
int cmpFileSize = Long.compare(file1.length(), file2.length());
if (cmpFileSize != 0) {
r
Solution
For real use, you normally want to avoid directly comparing file contents as much as possible.
To do that, you do testing in at least two tiers, and possibly three. First you compare the file sizes. If they're different, the files are different and you're done. Most file systems store file sizes directly, so you can retrieve it very quickly, and since it's only one number the comparison is very fast.
From there you have an (optional) second comparison. This is especially applicable if (for example) the two sets of files are physically remote from each other, so it's fairly cheap to read both files (locally) but much more expensive to transmit them for comparison. In this case, you can do a hash on each file, and compare the hashes.
Then, if the previous test(s) matched, you can do a full comparison of the two files. Or maybe you don't bother--if you use serious cryptographic hash (e.g., SHA-512) for the intermediate comparison, it may not be worth bothering to compare the full contents--the chances of finding a collision with SHA-512 are so minuscule you may not consider it worth the bother. On the other hand, if you used something like a 16-bit (or even 32-bit) CRC as your "hash", you nearly need more to confirm an apparent duplicate.
To do that, you do testing in at least two tiers, and possibly three. First you compare the file sizes. If they're different, the files are different and you're done. Most file systems store file sizes directly, so you can retrieve it very quickly, and since it's only one number the comparison is very fast.
From there you have an (optional) second comparison. This is especially applicable if (for example) the two sets of files are physically remote from each other, so it's fairly cheap to read both files (locally) but much more expensive to transmit them for comparison. In this case, you can do a hash on each file, and compare the hashes.
Then, if the previous test(s) matched, you can do a full comparison of the two files. Or maybe you don't bother--if you use serious cryptographic hash (e.g., SHA-512) for the intermediate comparison, it may not be worth bothering to compare the full contents--the chances of finding a collision with SHA-512 are so minuscule you may not consider it worth the bother. On the other hand, if you used something like a 16-bit (or even 32-bit) CRC as your "hash", you nearly need more to confirm an apparent duplicate.
Context
StackExchange Code Review Q#121707, answer score: 7
Revisions (0)
No revisions yet.