HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaModerate

Checking for differences between two (large) files

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
checkingdifferencestwobetweenlargeforfiles

Problem

I want to write a relatively simple program, that can backup files from my computer to a remote location and encrypt them in the process, while also computing a diff (well not really...I'm content with seeing if anything changed at all, not so much what has changed) between the local and the remote files to see which ones have changed and are necessary to update.

I am aware that there are perfectly good programs out there to do this (rsync, or others based on duplicity). I'm not trying to reinvent the wheel, it's just supposed to be a learning experience for myself

My question is regarding to the diff part of the project. I have made some assumptions and wrote some sample code to test them out, but I would like to know if you see anything I might have missed, if the assumptions are just plain wrong, or if there's something that could go wrong in a particular constelation.

Assumption 1: If files are not of equal length, they can not be the same (ie. some modification must have taken place)

Assumption 2: If two files are the same (ie. no modification has taken place) any byte sub-set of these two files will have the same hash

Assumption 3: If a byte sub-set of two files is found which does not result in the same hash, the two files are not the same (ie. have been modified)

The code is written in Java and the hashing algorithm used is BLAKE-512 using the java implementation from Marc Greim.

_File1 and _File2 are 2 files > 1.5GB of type java.io.File

```
public boolean compareStream() throws IOException {
int i = 0;
int step = 4096;
boolean equal = false;

FileInputStream fi1 = new FileInputStream(_File1);
FileInputStream fi2 = new FileInputStream(_File2);

byte[] fi1Content = new byte[step];
byte[] fi2Content = new byte[step];

if(_File1.length() == _File2.length()) { //Assumption 1
while(i*step < _File1.length()) {

fi1.read(fi1Content, 0, step); //Assumption 2
fi2.read(fi2Content, 0

Solution

There are some basic issues here, as well as some algorithmic complexities, and then some advanced suggestions.

Basic issues relate to Java code conventions, etc.

Basics

Use try-with-resources. You have code which may fail, and leave open files lying around to be garbage collected. Consider the following code:

try (FileInputStream fi1 = new FileInputStream(_File1);      
    FileInputStream fi2 = new FileInputStream(_File2);) {

    // do stuff with the files - they will be auto-closed.

}


The next thing, is why open the files if they are different lengths?

FileInputStream fi1 = new FileInputStream(_File1);      
FileInputStream fi2 = new FileInputStream(_File2);

byte[] fi1Content = new byte[step];
byte[] fi2Content = new byte[step];

if(_File1.length() == _File2.length()) { //Assumption 1


The code above, should be:

if(_File1.length() == _File2.length()) { //Assumption 1
    FileInputStream fi1 = new FileInputStream(_File1);      
    FileInputStream fi2 = new FileInputStream(_File2);

    byte[] fi1Content = new byte[step];
    byte[] fi2Content = new byte[step];


Use the power of the force, ... I mean parameters, Luke... I mean Daniel.

Your method should take the two files as parameters, not as class-level fields. As it stands, your code is not "reentrant", and it should be. Your method is:

public boolean compareStream() ....


but it should be

public boolean compareStream(File filea, File fileb) ....


Algorithm

Since you are comparing two files byte-by-byte, the hashing will make no difference. If the two files were on different machines, and you have a slow network between them, and if you could run the hashing algorithm remotely, then it probably makes sense to hash the two files on each side, and then just compare the small, and easy to transfer, hash result. Something like SHA-256.

So, there's no need to hash, just do byte-by-byte comparisons.

For large files like yours, why have such a small step size? Use something much larger like 4MB, not 4KB. It will make it much faster.

Alternatives

File IO is always slower than you want. Java has the NIO framework for higher-performance IO using Channels and Buffers. This would be a great time to learn how to use them, because, a 4MB Memory-Mapped IO operation on the two files will likely give you the best performance.

See the MemoryMapped IO JavaDoc

I ran up a test using NIO, and produced the following code:

public static final boolean compareFiles(final Path filea, final Path fileb) throws IOException {
    if (Files.size(filea) != Files.size(fileb)) {
        return false;
    }

    final long size = Files.size(filea);
    final int mapspan = 4 * 1024 * 1024;

    try (FileChannel chana = (FileChannel)Files.newByteChannel(filea);
            FileChannel chanb = (FileChannel)Files.newByteChannel(fileb)) {

        for (long position = 0; position < size; position += mapspan) {
            MappedByteBuffer mba = mapChannel(chana, position, size, mapspan);
            MappedByteBuffer mbb = mapChannel(chanb, position, size, mapspan);

            if (mba.compareTo(mbb) != 0) {
                return false;
            }

        }

    }
    return true;
}

private static MappedByteBuffer mapChannel(FileChannel channel, long position, long size, int mapspan) throws IOException {
    final long end = Math.min(size, position + mapspan);
    final long maplen = (int)(end - position);
    return channel.map(MapMode.READ_ONLY, position, maplen);
}


Note, the guts could be rewritten more concisely too:

if (!mapChannel(chana, position, size, mapspan)
                   .equals(mapChannel(chanb, position, size, mapspan))) {
                return false;
            }


On my laptop, this is comparing 1.5GB files in under 2 seconds. Obviously, your milage may vary, and my laptop is an unknown beast.... but things that may play in to the equation:

  • I have 16GB mem



  • it's a 4 year old laptop



  • it has an SSD



  • there is file-system encryption



  • it runs linux.

Code Snippets

try (FileInputStream fi1 = new FileInputStream(_File1);      
    FileInputStream fi2 = new FileInputStream(_File2);) {

    // do stuff with the files - they will be auto-closed.

}
FileInputStream fi1 = new FileInputStream(_File1);      
FileInputStream fi2 = new FileInputStream(_File2);

byte[] fi1Content = new byte[step];
byte[] fi2Content = new byte[step];

if(_File1.length() == _File2.length()) { //Assumption 1
if(_File1.length() == _File2.length()) { //Assumption 1
    FileInputStream fi1 = new FileInputStream(_File1);      
    FileInputStream fi2 = new FileInputStream(_File2);

    byte[] fi1Content = new byte[step];
    byte[] fi2Content = new byte[step];
public boolean compareStream() ....
public boolean compareStream(File filea, File fileb) ....

Context

StackExchange Code Review Q#90147, answer score: 13

Revisions (0)

No revisions yet.