HiveBrain v1.2.0
Get Started
← Back to all entries
snippetcMinor

Sort a binary file without loading it into memory or using a temporary file

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
withoutfiletemporaryintobinaryloadingmemoryusingsort

Problem

I had a silly assignment:


You are given a binary file FLOATS.DAT of unknown length, type float
data has been written in it. Write a program which sorts the data in
the file (the output is a sorted file FLOATS.DAT). You need to sort it
without loading the file into memory, or by using a temporary file.
Use the Shell sort algorithm.

Obviously by using a temp. file it would be easy, even more if you could load it into memory.

Here is my solution:

#include 

// sizeof(float) = 4 on my machine

int main(int argc, char **argv)
{
    size_t sajz = sizeof(float); // for convenience 
    int n; // total number of floats in the file

    FILE *dat;

    dat = fopen("FLOATS.DAT", "r+b");

    if (dat == NULL)
    {
        printf("Error while opening.\n");
        return 1;
    }

    // find the number of floats in the file
    fseek(dat, 0, SEEK_END);
    n = ftell(dat)/sajz;
    rewind(dat);

    float tjmh, tx;
    int i,j,h;

    // standard shell sort algorithm, the hard part is when I have to
    // read a position I first have to jump to it, then read it,
    // same when writing to a location
    for (h = n/2; h>0; h/=2)
    {
        for (i=h; i=h && tx < tjmh; j -= h)
            {

                fseek(dat, j*sajz, SEEK_SET);
                fwrite(&tjmh, sajz, 1, dat);
                rewind(dat);

                // this was tricky to figure out
                // I need the value of `j` after it has been
                // subtracted from h
                fseek(dat, (j - 2*h)*sajz, SEEK_SET);
                fread(&tjmh, sajz, 1, dat);
                rewind(dat);
            }

            fseek(dat, j*sajz, SEEK_SET);
            fwrite(&tx, sajz, 1, dat);
            rewind(dat);

        }
    }

    fclose(dat);

    return 0;
}


Can this be improved in any way?

Code snippet to generate the file with numbers 2.0, 3.0, 10.0, 4.0, 1.0, 7.0, 9.0, 5.0, 6.0, 8.0

```
int main()
{
FILE *dat;

dat = fopen("FLOATS.DAT", "wb");

Solution

Don't Repeat Yourself (DRY)

fseek(dat, i*sajz, SEEK_SET);
            fread(&tx, sajz, 1, dat);
            rewind(dat);


You do this three times. The general rule is first time write, second time copy, third time refactor into a common function. Note that if there is sufficient commonality, you may create the common function with only two uses (e.g. with the fwrite version). Make a function:

void read_at_position(FILE *source, size_t position, size_t size, void *datum)
{
    fseek(source, position * size, SEEK_SET);
    fread(datum, size, 1, source);
    rewind(source);
}


I made datum a void * because that's what fread takes. This way you don't have to write a new function if you change the datum type.

Now you can rewrite the original section (and the other four) like

read_at_position(dat, i, sajz, &tx);


If you're worried about the overhead, you could create a macro instead

#define READ_AT_POSITION(source, position, size, datum) \
    do {\
        fseek(source, (position) * (size), SEEK_SET);\
        fread(datum, size, 1, source);\
        rewind(source);\
    } while (0)


which you'd use as

READ_AT_POSITION(dat, i, sajz, &tx);


And the compiler will turn it into your original version.

Note that a good compiler will probably do that with the function as well.

Better names

The only way that I know what sajz and dat do is from looking at their declarations. Good names should balance being easy to write (e.g. size) with being descriptive (size_of_each_datum or size_of_float). Perhaps datum_size instead of sajz.

Similar problem with tjmh and tx. Neither name tells me anything about what they do.

Consider changing h to gap_size. Yes, h is a standard name in a shellsort (as per Wikipedia), but will most people remember that if they see it?

Reusability

Rather than writing this directly into main consider making a function. E.g.

void sort_floats_in_file(FILE *source, size_t datum_size);


You could generalize this by passing in a comparison function.

void sort_in_file(FILE *source, size_t datum_size, int (*compar)(const void *, const void*));


However, that can add unnecessary overhead in simple cases like this. Perhaps the compiler takes care of that. Perhaps not.

Robustness

You say

size_t sajz = sizeof(float); // for convenience


And later

float tjmh, tx;


Consider flipping the order around and saying

float tjmh, tx;
    size_t sajz = sizeof tjmh; // for convenience


Then if you change the data type of tjmh and tx, sajz will change to match automatically. As is, it would be easy to forget and only change in the one place.

Similarly,

float a;
    if (dat)
    {
        a = 2.00;
        fwrite(&a, sizeof(float), 1, dat);


Consider rewriting this as

if (dat)
    {
        test_write(2.00, dat);


with

void test_write(float a, FILE *dat)
{
    fwrite(&a, sizeof a, 1, dat);
}


Not only is this less typing, it avoids the problem of changing the type in one place and not the other. Admittedly there's some extra work to write the function, but the reduction per test case should cover it.

Code Snippets

fseek(dat, i*sajz, SEEK_SET);
            fread(&tx, sajz, 1, dat);
            rewind(dat);
void read_at_position(FILE *source, size_t position, size_t size, void *datum)
{
    fseek(source, position * size, SEEK_SET);
    fread(datum, size, 1, source);
    rewind(source);
}
read_at_position(dat, i, sajz, &tx);
#define READ_AT_POSITION(source, position, size, datum) \
    do {\
        fseek(source, (position) * (size), SEEK_SET);\
        fread(datum, size, 1, source);\
        rewind(source);\
    } while (0)
READ_AT_POSITION(dat, i, sajz, &tx);

Context

StackExchange Code Review Q#135241, answer score: 8

Revisions (0)

No revisions yet.