snippetcMinor
Sort a binary file without loading it into memory or using a temporary file
Viewed 0 times
withoutfiletemporaryintobinaryloadingmemoryusingsort
Problem
I had a silly assignment:
You are given a binary file FLOATS.DAT of unknown length, type
data has been written in it. Write a program which sorts the data in
the file (the output is a sorted file FLOATS.DAT). You need to sort it
without loading the file into memory, or by using a temporary file.
Use the Shell sort algorithm.
Obviously by using a temp. file it would be easy, even more if you could load it into memory.
Here is my solution:
Can this be improved in any way?
Code snippet to generate the file with numbers 2.0, 3.0, 10.0, 4.0, 1.0, 7.0, 9.0, 5.0, 6.0, 8.0
```
int main()
{
FILE *dat;
dat = fopen("FLOATS.DAT", "wb");
You are given a binary file FLOATS.DAT of unknown length, type
floatdata has been written in it. Write a program which sorts the data in
the file (the output is a sorted file FLOATS.DAT). You need to sort it
without loading the file into memory, or by using a temporary file.
Use the Shell sort algorithm.
Obviously by using a temp. file it would be easy, even more if you could load it into memory.
Here is my solution:
#include
// sizeof(float) = 4 on my machine
int main(int argc, char **argv)
{
size_t sajz = sizeof(float); // for convenience
int n; // total number of floats in the file
FILE *dat;
dat = fopen("FLOATS.DAT", "r+b");
if (dat == NULL)
{
printf("Error while opening.\n");
return 1;
}
// find the number of floats in the file
fseek(dat, 0, SEEK_END);
n = ftell(dat)/sajz;
rewind(dat);
float tjmh, tx;
int i,j,h;
// standard shell sort algorithm, the hard part is when I have to
// read a position I first have to jump to it, then read it,
// same when writing to a location
for (h = n/2; h>0; h/=2)
{
for (i=h; i=h && tx < tjmh; j -= h)
{
fseek(dat, j*sajz, SEEK_SET);
fwrite(&tjmh, sajz, 1, dat);
rewind(dat);
// this was tricky to figure out
// I need the value of `j` after it has been
// subtracted from h
fseek(dat, (j - 2*h)*sajz, SEEK_SET);
fread(&tjmh, sajz, 1, dat);
rewind(dat);
}
fseek(dat, j*sajz, SEEK_SET);
fwrite(&tx, sajz, 1, dat);
rewind(dat);
}
}
fclose(dat);
return 0;
}Can this be improved in any way?
Code snippet to generate the file with numbers 2.0, 3.0, 10.0, 4.0, 1.0, 7.0, 9.0, 5.0, 6.0, 8.0
```
int main()
{
FILE *dat;
dat = fopen("FLOATS.DAT", "wb");
Solution
Don't Repeat Yourself (DRY)
You do this three times. The general rule is first time write, second time copy, third time refactor into a common function. Note that if there is sufficient commonality, you may create the common function with only two uses (e.g. with the
I made
Now you can rewrite the original section (and the other four) like
If you're worried about the overhead, you could create a macro instead
which you'd use as
And the compiler will turn it into your original version.
Note that a good compiler will probably do that with the function as well.
Better names
The only way that I know what
Similar problem with
Consider changing
Reusability
Rather than writing this directly into
You could generalize this by passing in a comparison function.
However, that can add unnecessary overhead in simple cases like this. Perhaps the compiler takes care of that. Perhaps not.
Robustness
You say
And later
Consider flipping the order around and saying
Then if you change the data type of
Similarly,
Consider rewriting this as
with
Not only is this less typing, it avoids the problem of changing the type in one place and not the other. Admittedly there's some extra work to write the function, but the reduction per test case should cover it.
fseek(dat, i*sajz, SEEK_SET);
fread(&tx, sajz, 1, dat);
rewind(dat);You do this three times. The general rule is first time write, second time copy, third time refactor into a common function. Note that if there is sufficient commonality, you may create the common function with only two uses (e.g. with the
fwrite version). Make a function:void read_at_position(FILE *source, size_t position, size_t size, void *datum)
{
fseek(source, position * size, SEEK_SET);
fread(datum, size, 1, source);
rewind(source);
}I made
datum a void * because that's what fread takes. This way you don't have to write a new function if you change the datum type. Now you can rewrite the original section (and the other four) like
read_at_position(dat, i, sajz, &tx);If you're worried about the overhead, you could create a macro instead
#define READ_AT_POSITION(source, position, size, datum) \
do {\
fseek(source, (position) * (size), SEEK_SET);\
fread(datum, size, 1, source);\
rewind(source);\
} while (0)which you'd use as
READ_AT_POSITION(dat, i, sajz, &tx);And the compiler will turn it into your original version.
Note that a good compiler will probably do that with the function as well.
Better names
The only way that I know what
sajz and dat do is from looking at their declarations. Good names should balance being easy to write (e.g. size) with being descriptive (size_of_each_datum or size_of_float). Perhaps datum_size instead of sajz. Similar problem with
tjmh and tx. Neither name tells me anything about what they do. Consider changing
h to gap_size. Yes, h is a standard name in a shellsort (as per Wikipedia), but will most people remember that if they see it? Reusability
Rather than writing this directly into
main consider making a function. E.g. void sort_floats_in_file(FILE *source, size_t datum_size);You could generalize this by passing in a comparison function.
void sort_in_file(FILE *source, size_t datum_size, int (*compar)(const void *, const void*));However, that can add unnecessary overhead in simple cases like this. Perhaps the compiler takes care of that. Perhaps not.
Robustness
You say
size_t sajz = sizeof(float); // for convenienceAnd later
float tjmh, tx;Consider flipping the order around and saying
float tjmh, tx;
size_t sajz = sizeof tjmh; // for convenienceThen if you change the data type of
tjmh and tx, sajz will change to match automatically. As is, it would be easy to forget and only change in the one place. Similarly,
float a;
if (dat)
{
a = 2.00;
fwrite(&a, sizeof(float), 1, dat);Consider rewriting this as
if (dat)
{
test_write(2.00, dat);with
void test_write(float a, FILE *dat)
{
fwrite(&a, sizeof a, 1, dat);
}Not only is this less typing, it avoids the problem of changing the type in one place and not the other. Admittedly there's some extra work to write the function, but the reduction per test case should cover it.
Code Snippets
fseek(dat, i*sajz, SEEK_SET);
fread(&tx, sajz, 1, dat);
rewind(dat);void read_at_position(FILE *source, size_t position, size_t size, void *datum)
{
fseek(source, position * size, SEEK_SET);
fread(datum, size, 1, source);
rewind(source);
}read_at_position(dat, i, sajz, &tx);#define READ_AT_POSITION(source, position, size, datum) \
do {\
fseek(source, (position) * (size), SEEK_SET);\
fread(datum, size, 1, source);\
rewind(source);\
} while (0)READ_AT_POSITION(dat, i, sajz, &tx);Context
StackExchange Code Review Q#135241, answer score: 8
Revisions (0)
No revisions yet.