HiveBrain v1.2.0
Get Started
← Back to all entries
patterncppMinor

Duplicate files finder in C++

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
duplicatefilesfinder

Problem

I've came across this question https://stackoverflow.com/questions/11760420/what-is-an-optimal-way-to-find-duplicate-files-in-c and there are two answers for it both the answers convey that it is bit difficult and other suggested C++ is not suited for this purpose so as a student I am interested to test my C++ skill on this tough work!

The Idea I have implemented the code is below:

-
A scanner* which will get the list of files recursively in a directory/drive including sub folders

-
Once a file is got the location is got into a temporary variable

-
STL map is made use of here where key is the size of the file and the value is the location

-
Since STL map does not allow duplicates once a duplicate key (files with same sizes) if found dupe keys then both the file location is added into the STL list.

-
So at this point we got the list of files with same sizes.

-
Using iterators each file location is accessed and hashed using MD5 algorithm and that is added into another map where hash is a key and value is the location

-
Likewise when duplicate keys (same hashes) => exactly duplicate files
is found it is added into another list

-
So we got complete list of the locations of duplicate files!

Talk is cheap here is the code I have implemented the same!

Tangerine.cpp

/ SWAMI KARUPPASWAMI THUNNAI
// Tangerine.cpp : Defines the entry point for the console application.
//

#include
#include
#include
#include
#include"md5.h"
#include"scan.h"
using namespace std;

string Process::generate_digest(char* location)
{
    string md5;
    md5 = CALL_MD5_Function(location);
    cout > array;
    Scan scanner;
    scanner.ListDirectoryContents(array);
    scanner.hasher();
    scanner.display();
    int stay;
    cin >> stay;
    return 0;
}


scan.h

```
#pragma once

#include
#include
#include
#include
using namespace std;

// Processing the files

class Process
{
private:
map fileduplicates;
list list_of_duplicates;
maphashes;
mapduplicates;

Solution

I'm not looking at weather your code could run faster, or use less memory, I'm looking at making you code readable and maintainable. Even my code, which is always perfect in every respect :), will have to be modified at some point and the poor individual that has to update it needs every help they can get. If the code is 'clean' then spotting improvements to it is much easier.

A plus point that I noticed is the order you have placed the include files in. Placing them with system files first all the way down to the local ones mean that if you get an error because of duplicate type or function it is reported in the local code, not the 'system' files. You can change the prototype for printf() in stdio.h so it doesn't clash with printf() in MyFile.h, but its not a recommended course of action.

Another plus point is the code is fairly easy to read, you aren't afraid of the space bar.

OK so the negative points, as you said "using std" is my first point, but what about comments personally I write them before I code, as a way of designing the code. Its also nice to explain what functions and classes are supposed to do and why.

Tangerine.cpp - Process::generate_digest

  • location isn't being changed so why not make it a const reference?



  • You don't check to see if location is NULL or invalid, is this done in the CALL_MD5_Function?



  • Assign the value to md5 on the same line as you declare it, even better initialise md5 with the correct value.



  • You are streaming without checking if it is empty or massively large is that a safe thing to do?



  • Why is this function in this file and not in Process.cpp



Tangerine.cpp - main

-
This is purely a style issue, either declare the variables at the top (C Style) or as near to use as you can (C++ Style), so move the declaration of array down one line.

-
You MUST validate the value in array. It could be anything, even NULL.

-
It might be worth telling the user to press a key to exit. Otherwise the app appears hung.

Scan.h

  • A header file should only contain one class definition. It makes it easier to find the code.



  • You are mixing wchar and char.



Scan.cpp

  • Why not use MAXPATH (?) for the size of sPath?



  • Recursion : OK but what happens when you encounter symbolic links (search for Junction.exe)



  • In process file and size could be constants.



  • In process() why return 0 why not make the function return void?



  • In file_size_calculator you are calculating the size as a double and then returning it as an int.



  • In file_duplication_detector the parameters could be constant.



  • In file_duplication_detector the iterator could be constant.



  • In hasher I think you iterators could all be const_iterators, but I would get rid of itr1 and itr2 they are confusing.



  • In display I think you iterators could all be const_iterators, but I would get rid of itr1 and itr2 they are confusing.

Context

StackExchange Code Review Q#146467, answer score: 2

Revisions (0)

No revisions yet.