patternjavaModerate
Checking for differences between two (large) files
Viewed 0 times
checkingdifferencestwobetweenlargeforfiles
Problem
I want to write a relatively simple program, that can backup files from my computer to a remote location and encrypt them in the process, while also computing a diff (well not really...I'm content with seeing if anything changed at all, not so much what has changed) between the local and the remote files to see which ones have changed and are necessary to update.
I am aware that there are perfectly good programs out there to do this (rsync, or others based on duplicity). I'm not trying to reinvent the wheel, it's just supposed to be a learning experience for myself
My question is regarding to the diff part of the project. I have made some assumptions and wrote some sample code to test them out, but I would like to know if you see anything I might have missed, if the assumptions are just plain wrong, or if there's something that could go wrong in a particular constelation.
Assumption 1: If files are not of equal length, they can not be the same (ie. some modification must have taken place)
Assumption 2: If two files are the same (ie. no modification has taken place) any byte sub-set of these two files will have the same hash
Assumption 3: If a byte sub-set of two files is found which does not result in the same hash, the two files are not the same (ie. have been modified)
The code is written in Java and the hashing algorithm used is BLAKE-512 using the java implementation from Marc Greim.
```
public boolean compareStream() throws IOException {
int i = 0;
int step = 4096;
boolean equal = false;
FileInputStream fi1 = new FileInputStream(_File1);
FileInputStream fi2 = new FileInputStream(_File2);
byte[] fi1Content = new byte[step];
byte[] fi2Content = new byte[step];
if(_File1.length() == _File2.length()) { //Assumption 1
while(i*step < _File1.length()) {
fi1.read(fi1Content, 0, step); //Assumption 2
fi2.read(fi2Content, 0
I am aware that there are perfectly good programs out there to do this (rsync, or others based on duplicity). I'm not trying to reinvent the wheel, it's just supposed to be a learning experience for myself
My question is regarding to the diff part of the project. I have made some assumptions and wrote some sample code to test them out, but I would like to know if you see anything I might have missed, if the assumptions are just plain wrong, or if there's something that could go wrong in a particular constelation.
Assumption 1: If files are not of equal length, they can not be the same (ie. some modification must have taken place)
Assumption 2: If two files are the same (ie. no modification has taken place) any byte sub-set of these two files will have the same hash
Assumption 3: If a byte sub-set of two files is found which does not result in the same hash, the two files are not the same (ie. have been modified)
The code is written in Java and the hashing algorithm used is BLAKE-512 using the java implementation from Marc Greim.
_File1 and _File2 are 2 files > 1.5GB of type java.io.File```
public boolean compareStream() throws IOException {
int i = 0;
int step = 4096;
boolean equal = false;
FileInputStream fi1 = new FileInputStream(_File1);
FileInputStream fi2 = new FileInputStream(_File2);
byte[] fi1Content = new byte[step];
byte[] fi2Content = new byte[step];
if(_File1.length() == _File2.length()) { //Assumption 1
while(i*step < _File1.length()) {
fi1.read(fi1Content, 0, step); //Assumption 2
fi2.read(fi2Content, 0
Solution
There are some basic issues here, as well as some algorithmic complexities, and then some advanced suggestions.
Basic issues relate to Java code conventions, etc.
Basics
Use try-with-resources. You have code which may fail, and leave open files lying around to be garbage collected. Consider the following code:
The next thing, is why open the files if they are different lengths?
The code above, should be:
Use the power of the force, ... I mean parameters, Luke... I mean Daniel.
Your method should take the two files as parameters, not as class-level fields. As it stands, your code is not "reentrant", and it should be. Your method is:
but it should be
Algorithm
Since you are comparing two files byte-by-byte, the hashing will make no difference. If the two files were on different machines, and you have a slow network between them, and if you could run the hashing algorithm remotely, then it probably makes sense to hash the two files on each side, and then just compare the small, and easy to transfer, hash result. Something like SHA-256.
So, there's no need to hash, just do byte-by-byte comparisons.
For large files like yours, why have such a small step size? Use something much larger like 4MB, not 4KB. It will make it much faster.
Alternatives
File IO is always slower than you want. Java has the NIO framework for higher-performance IO using Channels and Buffers. This would be a great time to learn how to use them, because, a 4MB Memory-Mapped IO operation on the two files will likely give you the best performance.
See the MemoryMapped IO JavaDoc
I ran up a test using NIO, and produced the following code:
Note, the guts could be rewritten more concisely too:
On my laptop, this is comparing 1.5GB files in under 2 seconds. Obviously, your milage may vary, and my laptop is an unknown beast.... but things that may play in to the equation:
Basic issues relate to Java code conventions, etc.
Basics
Use try-with-resources. You have code which may fail, and leave open files lying around to be garbage collected. Consider the following code:
try (FileInputStream fi1 = new FileInputStream(_File1);
FileInputStream fi2 = new FileInputStream(_File2);) {
// do stuff with the files - they will be auto-closed.
}The next thing, is why open the files if they are different lengths?
FileInputStream fi1 = new FileInputStream(_File1);
FileInputStream fi2 = new FileInputStream(_File2);
byte[] fi1Content = new byte[step];
byte[] fi2Content = new byte[step];
if(_File1.length() == _File2.length()) { //Assumption 1The code above, should be:
if(_File1.length() == _File2.length()) { //Assumption 1
FileInputStream fi1 = new FileInputStream(_File1);
FileInputStream fi2 = new FileInputStream(_File2);
byte[] fi1Content = new byte[step];
byte[] fi2Content = new byte[step];Use the power of the force, ... I mean parameters, Luke... I mean Daniel.
Your method should take the two files as parameters, not as class-level fields. As it stands, your code is not "reentrant", and it should be. Your method is:
public boolean compareStream() ....but it should be
public boolean compareStream(File filea, File fileb) ....Algorithm
Since you are comparing two files byte-by-byte, the hashing will make no difference. If the two files were on different machines, and you have a slow network between them, and if you could run the hashing algorithm remotely, then it probably makes sense to hash the two files on each side, and then just compare the small, and easy to transfer, hash result. Something like SHA-256.
So, there's no need to hash, just do byte-by-byte comparisons.
For large files like yours, why have such a small step size? Use something much larger like 4MB, not 4KB. It will make it much faster.
Alternatives
File IO is always slower than you want. Java has the NIO framework for higher-performance IO using Channels and Buffers. This would be a great time to learn how to use them, because, a 4MB Memory-Mapped IO operation on the two files will likely give you the best performance.
See the MemoryMapped IO JavaDoc
I ran up a test using NIO, and produced the following code:
public static final boolean compareFiles(final Path filea, final Path fileb) throws IOException {
if (Files.size(filea) != Files.size(fileb)) {
return false;
}
final long size = Files.size(filea);
final int mapspan = 4 * 1024 * 1024;
try (FileChannel chana = (FileChannel)Files.newByteChannel(filea);
FileChannel chanb = (FileChannel)Files.newByteChannel(fileb)) {
for (long position = 0; position < size; position += mapspan) {
MappedByteBuffer mba = mapChannel(chana, position, size, mapspan);
MappedByteBuffer mbb = mapChannel(chanb, position, size, mapspan);
if (mba.compareTo(mbb) != 0) {
return false;
}
}
}
return true;
}
private static MappedByteBuffer mapChannel(FileChannel channel, long position, long size, int mapspan) throws IOException {
final long end = Math.min(size, position + mapspan);
final long maplen = (int)(end - position);
return channel.map(MapMode.READ_ONLY, position, maplen);
}Note, the guts could be rewritten more concisely too:
if (!mapChannel(chana, position, size, mapspan)
.equals(mapChannel(chanb, position, size, mapspan))) {
return false;
}On my laptop, this is comparing 1.5GB files in under 2 seconds. Obviously, your milage may vary, and my laptop is an unknown beast.... but things that may play in to the equation:
- I have 16GB mem
- it's a 4 year old laptop
- it has an SSD
- there is file-system encryption
- it runs linux.
Code Snippets
try (FileInputStream fi1 = new FileInputStream(_File1);
FileInputStream fi2 = new FileInputStream(_File2);) {
// do stuff with the files - they will be auto-closed.
}FileInputStream fi1 = new FileInputStream(_File1);
FileInputStream fi2 = new FileInputStream(_File2);
byte[] fi1Content = new byte[step];
byte[] fi2Content = new byte[step];
if(_File1.length() == _File2.length()) { //Assumption 1if(_File1.length() == _File2.length()) { //Assumption 1
FileInputStream fi1 = new FileInputStream(_File1);
FileInputStream fi2 = new FileInputStream(_File2);
byte[] fi1Content = new byte[step];
byte[] fi2Content = new byte[step];public boolean compareStream() ....public boolean compareStream(File filea, File fileb) ....Context
StackExchange Code Review Q#90147, answer score: 13
Revisions (0)
No revisions yet.