HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpModerate

Organizing and visually duplicating images

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
visuallyimagesduplicatingandorganizing

Problem

This algorithm is supposed to organize and visually duplicate images. I compare 2 images at a time; 1 is the first index image of the directory and another for comparing every other image in the directory. If there are duplicates of the first index image, all of the duplicates will be moved to a new folder that is named after the original folder.

It is working fine but the performance time is approximately 5 minutes and 29 seconds. I am dealing with 1090 images.

For example, if 2.png had 5 visually duplicated images, they would be moved to the path file called:


C:\Users\user\Downloads\CaptchaCollection\Large\Duplicates\2

What this code is supposed to do:

-
Copy all the images from another directory to:


C:\Users\user\Downloads\CaptchaCollection\Large\

This is because I want to back up my raw images just in case I lose them

-
Collects the amount of images there are in the directory.

-
Selects the first index and then compares it with all other images in the directory.

-
If they are equal they are moved to a new folder.

-
When this is done the original is moved to the Sorted folder.

-
Then a new first index will be selected and the loop goes on until everything is finished. At the end of the inner loop, it will recalculate the amount files in the directory.

-
Move the last file into the Sorted folder.

-
Check for any empty folders.

-
Recursively deletes them.

-
(Not involved in sorting) Pretty much logs everything into a text file.

```
private bool compare(Bitmap bmp1, Bitmap bmp2)
{
bool equals = true; // sets it to true first

Rectangle rect = new Rectangle(0, 0, bmp1.Width, bmp1.Height);
BitmapData bmpData1 = bmp1.LockBits(rect, ImageLockMode.ReadOnly, bmp1.PixelFormat);
BitmapData bmpData2 = bmp2.LockBits(rect, ImageLockMode.ReadOnly, bmp2.PixelFormat);

unsafe
{
byte ptr1 = (byte)bmpData1.Scan0.ToPointer();
byte ptr2 = (byte)bmpData2.Scan0.ToPointer();
int width = rect.Width *

Solution

The algorithm as described is O(n^2) worst case. Assuming that there are n images and no two images are equal then the first one gets compared against n-1, the second one against n-2, ... etc. so you have 1/2 n (n + 1) comparisons.

A way around this is by calculating a checksum with low enough collision probability on each image - SHA256 comes to mind but there are a whole heap of others in the .NET framework which will fit the bill.

Then the algorithm becomes (pseudo LINQ):

var equivalentImages = allImages.Select(i => Tuple.Create(HashImage(i), FullPath(i)))
                                .GroupBy(t => t.Item1);


This way you only need to process the data of each image once and then group them by the hash.

A few remarks to your code:

-
Instead of disposing bmp1 and bmp2 manually you should use a using block. This means they will get disposed even if any of the code in between throws an exception.

-
Similar Streams are IDisposable so again you should wrap them in a using block.

-
It is not overly clear that the writer is used for logging purposes until you read the code. I would abstract it behind an ILogger interface which I would pass in. It makes it much clearer that the messages are used for logging and removes the responsibility of managing the log file out of the function (Single Responsibility Principle). In fact you should have a look at log4net which provides a whole heap more flexibility in terms of logging targets and filtering. In case your project should grow this will make things a bit easier.

Update

A very simple example use (loosely based on the MSDN example for copying the bitmap data):

public byte[] HashImage(string fileName)
{
    using (var image = new Bitmap(fileName))
    {
        var sha256 = SHA256.Create();

        var rect = new Rectangle(0, 0, image.Width, image.Height);
        var data = image.LockBits(rect, ImageLockMode.ReadOnly, image.PixelFormat);

        var dataPtr = data.Scan0;

        var totalBytes = (int)Math.Abs(data.Stride) * data.Height;
        var rawData = new byte[totalBytes];
        System.Runtime.InteropServices.Marshal.Copy(dataPtr, rawData, 0, totalBytes);

        image.UnlockBits(data);

        return sha256.ComputeHash(rawData);
    }
}


Update 2: To complete the example (based on your pastebin, note I changed the method above to accept a filename rather than a Bitmap directly)

private Tuple GetIndexedImage(string fileName)
{
    var baseFileName = Path.GetFileNameWithoutExtension(fileName);
    int index;
    if (int.TryParse(baseFileName, out index))
    {
        return Tuple.Create(index, fileName);
    }
    return null;
}

private void button1_Click(object sender, EventArgs e)
{
    string original = @"C:\Users\user\Documents\CaptchaCollection\";

    var equivalentImages = Directory.GetFiles(original)
                                    .Select(f => GetIndexedImage(f)) // build tuples (index, fileName) or null if parsing failed
                                    .Where(t => t != null)           // ignore all invalid ones
                                    .OrderBy(t => t.Item1)           // order by index
                                    .Select(t => Tuple.Create(HashImage(t.Item2), t.Item2)) // create new tuple (hash, fileName)
                                    .GroupBy(t => t.Item1);          // group by Hash

    // print groups
    foreach (var group in equivalentImages)
    {
        Console.WriteLine("All images with hash: {0}", HashToString(group.Key));
        foreach (var t in group)
        {
            Console.WriteLine("\t{0}", t.Item2);
        }
    }
    
}

private string HashToString(byte[] hash)
{
    var builder = new StringBuilder();
    foreach (var b in hash)
    {
        builder.AppendFormat("{0:x2}", b);
    }
    return builder.ToString();
}


Note:

  • Unless the hashes collided (very unlikely) or there is a subtle bug I've missed (possible) all image within each group should have the same content and you can do your file moving.



  • You probably don't want to hard-code the directory and rather pass it in from external.

Code Snippets

var equivalentImages = allImages.Select(i => Tuple.Create(HashImage(i), FullPath(i)))
                                .GroupBy(t => t.Item1);
public byte[] HashImage(string fileName)
{
    using (var image = new Bitmap(fileName))
    {
        var sha256 = SHA256.Create();

        var rect = new Rectangle(0, 0, image.Width, image.Height);
        var data = image.LockBits(rect, ImageLockMode.ReadOnly, image.PixelFormat);

        var dataPtr = data.Scan0;

        var totalBytes = (int)Math.Abs(data.Stride) * data.Height;
        var rawData = new byte[totalBytes];
        System.Runtime.InteropServices.Marshal.Copy(dataPtr, rawData, 0, totalBytes);

        image.UnlockBits(data);

        return sha256.ComputeHash(rawData);
    }
}
private Tuple<int, string> GetIndexedImage(string fileName)
{
    var baseFileName = Path.GetFileNameWithoutExtension(fileName);
    int index;
    if (int.TryParse(baseFileName, out index))
    {
        return Tuple.Create(index, fileName);
    }
    return null;
}

private void button1_Click(object sender, EventArgs e)
{
    string original = @"C:\Users\user\Documents\CaptchaCollection\";

    var equivalentImages = Directory.GetFiles(original)
                                    .Select(f => GetIndexedImage(f)) // build tuples (index, fileName) or null if parsing failed
                                    .Where(t => t != null)           // ignore all invalid ones
                                    .OrderBy(t => t.Item1)           // order by index
                                    .Select(t => Tuple.Create(HashImage(t.Item2), t.Item2)) // create new tuple (hash, fileName)
                                    .GroupBy(t => t.Item1);          // group by Hash

    // print groups
    foreach (var group in equivalentImages)
    {
        Console.WriteLine("All images with hash: {0}", HashToString(group.Key));
        foreach (var t in group)
        {
            Console.WriteLine("\t{0}", t.Item2);
        }
    }
    
}

private string HashToString(byte[] hash)
{
    var builder = new StringBuilder();
    foreach (var b in hash)
    {
        builder.AppendFormat("{0:x2}", b);
    }
    return builder.ToString();
}

Context

StackExchange Code Review Q#40122, answer score: 12

Revisions (0)

No revisions yet.