patternjavaMinor
Optimization of file scanner
Viewed 0 times
optimizationfilescanner
Problem
Program must to find duplicate files (by content) in a given directory(and subdirectories).
I collect all data in
I tested program on root directory (Linux)
Also the program make
```
import org.apache.log4j.Logger;
import java.io.*;
import java.util.*;
import java.nio.file.Files;
public class FileScanner {
private String path, canonPath;
private static final int BUFFER_SIZE_SMALL = 1024; // 1024 byte
private static final int BUFFER_SIZE_MEDIUM = 1048576; // 1 mb
private static final int BUFFER_SIZE_BIG = 10485760; // 10 mb
private static Logger log = Logger.getLogger(FileScanner.class);
/*
* Data structure where keys is a size of file and
* value is list of canonical path to files the same size
*/
private Map> mapFiles;
/*
* Constructor using the specified path
*/
public FileScanner(String path) {
this.path = path;
mapFiles = new HashMap<>();
}
/*
* Constructor with the specified initial capacity
*/
public FileScanner(String path, int capacity) {
this.path = path;
mapFiles = new HashMap<>(capacity);
}
/*
* Ge
I collect all data in
Map> map where Key is the size of file and Value is the List of paths to files with the same size.public static void main(String[] args) {
long start = System.currentTimeMillis();
new FileScanner("/").searchFiles();
System.out.println(System.currentTimeMillis() - start);
}I tested program on root directory (Linux)
/ where total count of files is: 281091. The time is of scanning: 3131064 milliseconds. In my opinion it's may be more faster./boot/grub/biosdisk.mod
/usr/lib/grub/i386-pc/biosdisk.mod
/usr/lib/ruby/vendor_ruby/1.8/rubygems/command_manager.rb
/usr/lib/ruby/1.9.1/rubygems/command_manager.rb
/usr/share/pixmaps/openjdk-7.xpm
/usr/share/app-install/icons/_usr_share_icons_sun-java5.xpm
...Also the program make
log/files.log files where outputting paths of files the same contents separated groups on one blank line.```
import org.apache.log4j.Logger;
import java.io.*;
import java.util.*;
import java.nio.file.Files;
public class FileScanner {
private String path, canonPath;
private static final int BUFFER_SIZE_SMALL = 1024; // 1024 byte
private static final int BUFFER_SIZE_MEDIUM = 1048576; // 1 mb
private static final int BUFFER_SIZE_BIG = 10485760; // 10 mb
private static Logger log = Logger.getLogger(FileScanner.class);
/*
* Data structure where keys is a size of file and
* value is list of canonical path to files the same size
*/
private Map> mapFiles;
/*
* Constructor using the specified path
*/
public FileScanner(String path) {
this.path = path;
mapFiles = new HashMap<>();
}
/*
* Constructor with the specified initial capacity
*/
public FileScanner(String path, int capacity) {
this.path = path;
mapFiles = new HashMap<>(capacity);
}
/*
* Ge
Solution
Another approach: Currently if you have three big files (with the same size) the algorithm will compare A with B and A with C. It reads A twice which could be avoided. Read each file once, store a hash value (MD5, SHA1, etc.) of the content and compare the hash only.
About the current implementation: If I'm right it contains a bug. In the following code
I've created a directory with four files with the same content. It looks to me that the code does not read one of them. (I've put a log statement before
Some other notes about the code:
-
Instead of
use a
-
Furthermore, currently in case of an error the method returns the path of the previous file. It looks like a bugs.
-
Comments like this almos unnecessary:
It says nothing more that the code already does, it's rather noise. (Clean Code by Robert C. Martin: Chapter 4: Comments, Noise Comments)
-
In the
-
I'd put the variable declarations to separate lines. From Code Complete, 2nd Edition, p759:
With statements on their own lines, the code reads from top to bottom, instead
of top to bottom and left to right. When you’re looking for a specific line of code,
your eye should be able to follow the left margin of the code. It shouldn’t have to
dip into each and every line just because a single line might contain two statements.
About the current implementation: If I'm right it contains a bug. In the following code
paths.remove(path2) modifies the list during iteration:for (ArrayList paths : mapFiles.values()) {
if (paths.size() == 1) continue;
for (int i = 0; i < paths.size(); i++) {
String path1 = paths.get(i);
boolean isFound = false;
for (int j = 0; j < paths.size(); j++) {
String path2 = paths.get(j);
if (compareFiles(path1, path2)) {
log.info(path2);
paths.remove(path2);
isFound = true;
}
}
if (isFound) log.info(path1 + "\n");
paths.remove(path1);
}
}I've created a directory with four files with the same content. It looks to me that the code does not read one of them. (I've put a log statement before
compareFiles call.)Some other notes about the code:
-
Instead of
private final Map> mapFiles;use a
Multimap. Guava has great implementations. (Doc, Javadoc)-
canonPath should be a local variable inside toCanonicalPath since no other method uses it.Furthermore, currently in case of an error the method returns the path of the previous file. It looks like a bugs.
-
Comments like this almos unnecessary:
/*
* Get canonical path from File
*/
private String toCanonicalPath(final File file) { ... }It says nothing more that the code already does, it's rather noise. (Clean Code by Robert C. Martin: Chapter 4: Comments, Noise Comments)
-
In the
FileFilter subclass you could use guard clauses to flatten the arrow code.-
I'd put the variable declarations to separate lines. From Code Complete, 2nd Edition, p759:
With statements on their own lines, the code reads from top to bottom, instead
of top to bottom and left to right. When you’re looking for a specific line of code,
your eye should be able to follow the left margin of the code. It shouldn’t have to
dip into each and every line just because a single line might contain two statements.
Code Snippets
for (ArrayList<String> paths : mapFiles.values()) {
if (paths.size() == 1) continue;
for (int i = 0; i < paths.size(); i++) {
String path1 = paths.get(i);
boolean isFound = false;
for (int j = 0; j < paths.size(); j++) {
String path2 = paths.get(j);
if (compareFiles(path1, path2)) {
log.info(path2);
paths.remove(path2);
isFound = true;
}
}
if (isFound) log.info(path1 + "\n");
paths.remove(path1);
}
}private final Map<Long, ArrayList<String>> mapFiles;/*
* Get canonical path from File
*/
private String toCanonicalPath(final File file) { ... }Context
StackExchange Code Review Q#42587, answer score: 7
Revisions (0)
No revisions yet.