HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Optimization of file scanner

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
optimizationfilescanner

Problem

Program must to find duplicate files (by content) in a given directory(and subdirectories).

I collect all data in Map> map where Key is the size of file and Value is the List of paths to files with the same size.

public static void main(String[] args) {

     long start = System.currentTimeMillis();

     new FileScanner("/").searchFiles();

     System.out.println(System.currentTimeMillis() - start);
}


I tested program on root directory (Linux) / where total count of files is: 281091. The time is of scanning: 3131064 milliseconds. In my opinion it's may be more faster.

/boot/grub/biosdisk.mod
/usr/lib/grub/i386-pc/biosdisk.mod

/usr/lib/ruby/vendor_ruby/1.8/rubygems/command_manager.rb
/usr/lib/ruby/1.9.1/rubygems/command_manager.rb

/usr/share/pixmaps/openjdk-7.xpm
/usr/share/app-install/icons/_usr_share_icons_sun-java5.xpm
...


Also the program make log/files.log files where outputting paths of files the same contents separated groups on one blank line.

```
import org.apache.log4j.Logger;

import java.io.*;
import java.util.*;
import java.nio.file.Files;

public class FileScanner {

private String path, canonPath;

private static final int BUFFER_SIZE_SMALL = 1024; // 1024 byte
private static final int BUFFER_SIZE_MEDIUM = 1048576; // 1 mb
private static final int BUFFER_SIZE_BIG = 10485760; // 10 mb

private static Logger log = Logger.getLogger(FileScanner.class);

/*
* Data structure where keys is a size of file and
* value is list of canonical path to files the same size
*/
private Map> mapFiles;

/*
* Constructor using the specified path
*/
public FileScanner(String path) {
this.path = path;
mapFiles = new HashMap<>();
}

/*
* Constructor with the specified initial capacity
*/
public FileScanner(String path, int capacity) {
this.path = path;
mapFiles = new HashMap<>(capacity);
}

/*
* Ge

Solution

Another approach: Currently if you have three big files (with the same size) the algorithm will compare A with B and A with C. It reads A twice which could be avoided. Read each file once, store a hash value (MD5, SHA1, etc.) of the content and compare the hash only.

About the current implementation: If I'm right it contains a bug. In the following code paths.remove(path2) modifies the list during iteration:

for (ArrayList paths : mapFiles.values()) {
        if (paths.size() == 1) continue;
        for (int i = 0; i < paths.size(); i++) {
            String path1 = paths.get(i);
            boolean isFound = false;
            for (int j = 0; j < paths.size(); j++) {
                String path2 = paths.get(j);
                if (compareFiles(path1, path2)) {
                    log.info(path2);
                    paths.remove(path2);
                    isFound = true;
                }
            }
            if (isFound) log.info(path1 + "\n");
            paths.remove(path1);
        }
    }


I've created a directory with four files with the same content. It looks to me that the code does not read one of them. (I've put a log statement before compareFiles call.)

Some other notes about the code:

-
Instead of

private final Map> mapFiles;


use a Multimap. Guava has great implementations. (Doc, Javadoc)

-
canonPath should be a local variable inside toCanonicalPath since no other method uses it.
Furthermore, currently in case of an error the method returns the path of the previous file. It looks like a bugs.

-
Comments like this almos unnecessary:

/*
 * Get canonical path from File
 */
private String toCanonicalPath(final File file) { ... }


It says nothing more that the code already does, it's rather noise. (Clean Code by Robert C. Martin: Chapter 4: Comments, Noise Comments)

-
In the FileFilter subclass you could use guard clauses to flatten the arrow code.

-
I'd put the variable declarations to separate lines. From Code Complete, 2nd Edition, p759:


With statements on their own lines, the code reads from top to bottom, instead
of top to bottom and left to right. When you’re looking for a specific line of code,
your eye should be able to follow the left margin of the code. It shouldn’t have to
dip into each and every line just because a single line might contain two statements.

Code Snippets

for (ArrayList<String> paths : mapFiles.values()) {
        if (paths.size() == 1) continue;
        for (int i = 0; i < paths.size(); i++) {
            String path1 = paths.get(i);
            boolean isFound = false;
            for (int j = 0; j < paths.size(); j++) {
                String path2 = paths.get(j);
                if (compareFiles(path1, path2)) {
                    log.info(path2);
                    paths.remove(path2);
                    isFound = true;
                }
            }
            if (isFound) log.info(path1 + "\n");
            paths.remove(path1);
        }
    }
private final Map<Long, ArrayList<String>> mapFiles;
/*
 * Get canonical path from File
 */
private String toCanonicalPath(final File file) { ... }

Context

StackExchange Code Review Q#42587, answer score: 7

Revisions (0)

No revisions yet.