snippetcppMinor
In-place sorting of a binary file using heapsort
Viewed 0 times
sortingfileplaceheapsortbinaryusing
Problem
I have been working on sorting a file in-place with heapsort. I'm very glad it is working. But the time it takes is just too much. So I was wondering on how I could improve it.
Here are my requirements:
Before I show you my current code I want to add a few notes:
The reason I chose Heapsort over Quicksort is the fact that I want to have a progress bar. I ask you not to debate this requirement.
The files may be a lot larger than the available RAM. That's why you can't load the file entierly into RAM.
The first 8B of each entry are pretty much random and evenly distributed.
My current aproach constits of loading the first index entries into RAM until I reach the limit. And writing that data into the file after the sorting finishes.
Should you find any other issues like bad code style feel free to correct me too.
I use CSI escape sequences to display the progress bar. If you are not familiar with it just leave the code as it is. It is working as intended anyway and I just included it for the sake of completeness.
sortidx.h
```
#ifndef SORTIDX_H
#define SORTIDX_H
// Includes
#include
#include
#include
#include
#include
#include
#include
#include "util.h"
// Functions
// Returns the parent index.
constexpr size_t
Here are my requirements:
- The file consists of 14B index entries. One index entry consists of 8B of the actual index and 6B of offset. I want to sort these entries after the first 8B. The remaining 6B will need to be dragged along but will not be required for sorting.
- The sorting has to occur in-place. No additional files may be created
- I need to able to tell the progress. I display a progress bar of the sorting while it runs.
- Some amount of RAM may be used to speed up sorting. The amount will be passed as a parameter
- You can parallize the sorting. But if you do there will be a limit of threads you can create which would also be passed in a parameter if you need it.
- The sorting does not need to be stable.
- You may use a different sorting algorithm
Before I show you my current code I want to add a few notes:
The reason I chose Heapsort over Quicksort is the fact that I want to have a progress bar. I ask you not to debate this requirement.
The files may be a lot larger than the available RAM. That's why you can't load the file entierly into RAM.
The first 8B of each entry are pretty much random and evenly distributed.
My current aproach constits of loading the first index entries into RAM until I reach the limit. And writing that data into the file after the sorting finishes.
Should you find any other issues like bad code style feel free to correct me too.
I use CSI escape sequences to display the progress bar. If you are not familiar with it just leave the code as it is. It is working as intended anyway and I just included it for the sake of completeness.
sortidx.h
```
#ifndef SORTIDX_H
#define SORTIDX_H
// Includes
#include
#include
#include
#include
#include
#include
#include
#include "util.h"
// Functions
// Returns the parent index.
constexpr size_t
Solution
Overall
As I mention in the comments. There is nothing that makes this code inherently C++ like. It basically looks like a C program that has been translated in C++; a few utility classes like
So in modern C++ we use the concept of smart pointers to show ownership and its transfer of dynamically allocated objects. There are utility functions for correctly alloca
As I mention in the comments. There is nothing that makes this code inherently C++ like. It basically looks like a C program that has been translated in C++; a few utility classes like
std::fstream,std::string and a few other are used incorrectly so you may as well have used the C counterparts and the program looks the same.
- You do not take advantage of RAII (the biggest feature of C++).
- Your use of new/delete is very C like and you don't take into account ownership semantics (one of the greatest failings in C).
C++ very rarely uses RAW pointers and you always try and specify interfaces so that ownership is defined and thus lifespans controlled very tightly.
- Your lack of using classes bogs your code down into using
global mutable state a very bad pattern. That I am betting will make your threaded code hard to get correct.
Overall if you sent this in as the answer to an interview question for using C++ I would fail you and not give you the job. For a C job I suppose its OK.
Tools (Algorithms/Data structures/Iterators)
The C++ standard library contains a huge set of tools for you to use. You should learn them and use them rather than re-writting the wheel. std::sort() is a good example but the standard already has function to build and maintain a heap you don't need to write that yourself either.
Correctness
My initial thoughts were incorrect. The code works as expected.
Though it seems like some of the functions are not used (which threw my off). But with the test data provided I managed to run and validate the code as correct.
As it stands I am not convinced this code can actually sort a file larger than the cache. With a few fixes it could sort each section individually (each section being the size of the cache). But there is no code here that merges all the sorted sections together into a unified whole.
Given the two competing limitations I think this would be very hard to implement (though technically possible)
- The files may be a lot larger than the available RAM.
- The sorting has to occur in-place. No additional files may be created.
1) You either need to fit it all into memory to do the merge. 2) Use temporary files for each sorted section then merge into the destination file. 3) Do a complex dance of shuffling fields around the destination file as you do a merge sort into the destination file (this options does not seem attractive in the slightest).
Code Review
Using declarations.
Never do this.
using namespace std;
See every other code review. But you can find a detailed explanation here: Why is “using namespace std” considered bad practice?
Comments.
Useless comments are worse than no comments. The trouble with comments is that they need to be maintained (hence the concept of self documenting code). Your code should (function/variable names) should be written so that they are self documenting (which they are). If the name explains what it does there is little need to write a comment that says the same thing (as the comment needs to be maintained). If comments and code fall out of sync what does the maintainer do? Fix the comment to mach the code or fix the code to match the comment?
Good Comment:
// Sorts an idx file. Using chachSize bytes of RAM to speed it up.
void sortIDX( std::string idxFile, size_t cacheSize, bool quiet );
I get information from this comment about how the parameters are used. Though better parameter names may have achieved the same result.
Bad Comment:
// Writes the cache to the file
void writeFromCache();
If the comment is just echoing the function name then why have it?
Global Mutable State
streampos fileSize;
size_t numDataSets;
size_t limit;
atomic pos;
fstream* file;
size_t arraySize = 0;
IndexEntry* cacheArray;
Bunch of file local variables! These variables maintain the state for the following set of functions. So only one thread can use these functions at a time.
Also this is basic encapsulation. By using a class you are saying that all these work together as a whole. A subsequent maintainer is not going to come along and re-use the variables in the wrong way (as the class links them all into a single state).
By just making this a class. Putting all those variables as members of the class and the following functions as methods you get a way to localize changes. Now independent threads can use the same functions without trading on the toes of other threads.
New/Delete
In modern C++ you rarely see new and the use of delete` is even rarer. This is because RAW pointers do not convey ownership semantics. The owner of an object is the person responsible for deleting the object. Without ownership semantics it is unclear who should delete an object and this leads to a lot leaks and in bad cases double deletes.So in modern C++ we use the concept of smart pointers to show ownership and its transfer of dynamically allocated objects. There are utility functions for correctly alloca
Code Snippets
using namespace std;// Sorts an idx file. Using chachSize bytes of RAM to speed it up.
void sortIDX( std::string idxFile, size_t cacheSize, bool quiet );// Writes the cache to the file
void writeFromCache();streampos fileSize;
size_t numDataSets;
size_t limit;
atomic<size_t> pos;
fstream* file;
size_t arraySize = 0;
IndexEntry* cacheArray;IndexEntry* cacheArray;Context
StackExchange Code Review Q#142040, answer score: 4
Revisions (0)
No revisions yet.