HiveBrain v1.2.0
Get Started
← Back to all entries
patterncppMinor

Identify files within folder structure that have common file sizes

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fileidentifywithinsizesfilesfolderstructurethatcommonhave

Problem

The code takes in two arguments, a path to search and an output file name. It searches the supplied path (including sub directories) and writes out all files that have the same size as any other files in the path. This will form part of a multi-step process to identify duplicate files within the file system.

Example output, with 3 files 29 bytes long, 3 files 113 bytes long and 2 files 114 bytes long:

29 c:\Program Files\Git\mingw64\libexec\git-core\mergetools\gvimdiff
29 c:\Program Files\Git\mingw64\libexec\git-core\mergetools\gvimdiff2
29 c:\Program Files\Git\mingw64\libexec\git-core\mergetools\gvimdiff3
113 c:\Program Files\Git\mingw64\lib\tcl8.6\tzdata\Etc\GMT-7
113 c:\Program Files\Git\mingw64\lib\tcl8.6\tzdata\Etc\GMT-8
113 c:\Program Files\Git\mingw64\lib\tcl8.6\tzdata\Etc\GMT-9
114 c:\Program Files\Git\mingw64\lib\tcl8.6\tzdata\Etc\GMT+4
114 c:\Program Files\Git\mingw64\lib\tcl8.6\tzdata\Etc\GMT+5


`#include "stdafx.h"
#include
#include
#include
#include
#include

using namespace std::experimental::filesystem;

std::map> build_filesize_map(const std::string &pathToSearch);
void write_possible_duplicate_files(const std::string &outputFileName, const std::map> &sizeMap);

int main(int argc, char *argv[])
{
if (argc != 3) {
std::cout ";
return EXIT_FAILURE;
}

std::string pathToSearch(argv[1]);
std::string outputFileName(argv[2]);

auto sizeMap = build_filesize_map(pathToSearch);

// Write out a list of all files that share their size with other files
write_possible_duplicate_files(outputFileName, sizeMap);
}

// Search directory, any files with size > 0 are considered. They are added to a set
// within the map, which is keyed on the file size

std::map> build_filesize_map(const std::string &pathToSearch) {
auto sizeMap = std::map>();

for (recursive_directory_iterator next(pathToSearch), end; next != end; ++next) {
auto currentPath = next->path();
auto currentF

Solution

Clean nice code, reasonable data structure. Good job!

Don't worry too much about returning the std::map as even historically RVO would help you.

See: https://en.wikipedia.org/wiki/Return_value_optimization

iterator naming

Since you asked about naming convention I have to admit I wouldn't name this iterator next. More common name might be it but something more explicit like dir_entry would be probably even better.
It especially caught my eyes that "current path" is "next path" which is kind of strange.

for (recursive_directory_iterator next(pathToSearch), end; next != end; ++next) {
    auto currentPath = next->path();


const

My opinion is that there's never enough const. I would define all variables not intended to be changed later as const.

const auto currentPath = next->path();
const auto currentFileSize = is_regular_file(currentPath) ? file_size(currentPath) : 0;

    if (0 != currentFileSize) {
        const auto existingEntry = sizeMap.find(currentFileSize);


std::map::operator[]

Most of the time I am frustrated by map operator[] behavior. You on the other hand are lucky as it might simplify your code.

auto existingEntry = sizeMap.find(currentFileSize);
        if (existingEntry == sizeMap.end()) {
            std::set fileSet;
            fileSet.emplace(currentPath);
            sizeMap.emplace(currentFileSize, fileSet);
        }
        else {
            existingEntry->second.emplace(currentPath);
        }


could be done as

sizeMap[currentFileSize].emplace(currentPath)


Reason is that in case key given as argument to std::map::operator[] is not found new pair is inserted as {key, default value of value_type} which in your case means empty set.

See: http://en.cppreference.com/w/cpp/container/map/operator_at

iterating over map keys > 1

Since you are using an ordered map you might take advantage of that with std::map::lower_bound() and skip the first part with count = 1.

See: http://en.cppreference.com/w/cpp/container/map/lower_bound

for (auto sizeIterator = sizeMap.begin(); sizeIterator != sizeMap.end(); ++sizeIterator) {
    if (sizeIterator->second.size() == 1) continue;

    for (auto pathIteror = sizeIterator->second.begin(); pathIteror != sizeIterator->second.end(); ++pathIteror) {


could be done like this

for (auto sizeIterator = sizeMap.lower_bound(1); sizeIterator != sizeMap.end(); ++sizeIterator) {
    for (auto pathIteror = sizeIterator->second.begin(); pathIteror != sizeIterator->second.end(); ++pathIteror) {

Code Snippets

for (recursive_directory_iterator next(pathToSearch), end; next != end; ++next) {
    auto currentPath = next->path();
const auto currentPath = next->path();
const auto currentFileSize = is_regular_file(currentPath) ? file_size(currentPath) : 0;

    if (0 != currentFileSize) {
        const auto existingEntry = sizeMap.find(currentFileSize);
auto existingEntry = sizeMap.find(currentFileSize);
        if (existingEntry == sizeMap.end()) {
            std::set<path> fileSet;
            fileSet.emplace(currentPath);
            sizeMap.emplace(currentFileSize, fileSet);
        }
        else {
            existingEntry->second.emplace(currentPath);
        }
sizeMap[currentFileSize].emplace(currentPath)
for (auto sizeIterator = sizeMap.begin(); sizeIterator != sizeMap.end(); ++sizeIterator) {
    if (sizeIterator->second.size() == 1) continue;

    for (auto pathIteror = sizeIterator->second.begin(); pathIteror != sizeIterator->second.end(); ++pathIteror) {

Context

StackExchange Code Review Q#132927, answer score: 8

Revisions (0)

No revisions yet.