patterncMinor
Library to get entropy of files
Viewed 0 times
entropygetfileslibrary
Problem
This is a simple library I just finished writing. The main function is
caos.c:
caos.h:
```
#define ASCII 256
#define INIT 0
#define EREAD -1
#define READ "r"
#define
entropy(), which returns the entropy of a file. I'd really appreciate suggestions to improve it.caos.c:
#include
#include
#include
#include
#include "caos.h"
static unsigned long long int wc(const char * file_to_size);
static int get_times(double table_of_chars[ASCII], const char * file);
static int calc_frequencies(double table_of_chars[ASCII], const char * file);
double entropy(const char * file)
{
int iter;
double entropy = (double)INIT;
double table_of_chars[ASCII] = {INIT};
if (get_times(table_of_chars, file) == EREAD)
{
return EREAD;
}
if (calc_frequencies(table_of_chars, file) == EREAD)
{
return EREAD;
}
for (iter = INIT; iter < ASCII; iter++)
{
if (table_of_chars[iter] != INIT)
{
entropy -= (table_of_chars[iter] * log2(table_of_chars[iter]));
}
}
return fabs(entropy);
}
static int get_times(double table_of_chars[ASCII], const char * file)
{
int getch;
if (access(file, F_OK) != EREAD)
{
FILE * fget = fopen(file, READ);
while ((getch = getc(fget)) != EOF)
{
table_of_chars[getch]++;
}
fclose(fget);
return OK;
}
return EREAD;
}
static int calc_frequencies(double table_of_chars[ASCII], const char * file)
{
int iter;
unsigned long long int file_size = wc(file);
if (file_size != (unsigned)EREAD)
{
for (iter = INIT; iter < ASCII; iter++)
{
if (table_of_chars[iter] != INIT)
table_of_chars[iter] = (double)(table_of_chars[iter] / file_size);
}
return OK;
}
return EREAD;
}
static unsigned long long int wc(const char * file)
{
if (access(file, F_OK) != EREAD)
{
struct stat st;
if (stat(file, &st) == OK)
{
unsigned long long int bytes = st.st_size;
return bytes;
}
}
return EREAD;
}caos.h:
```
#define ASCII 256
#define INIT 0
#define EREAD -1
#define READ "r"
#define
Solution
At first glance, I thought, WOW, Nice library!
Decent variable and function names, decent indentation. Some application of
good software engineering principle such as the Single Responsibility Principle and the KISS principle.
On further inspection I found a few issues:
Error handling Versus Error Reporting
The code does handle errors and return a status of the function, but there
is no error reporting. Standard C library functions such as
variable errno. By including errno.h you have access to the error
codes and can report what is causing the error to the user/programmer
using the entropy library using perror() or stderror(). There is also
a discussion of this on stackoverflow.com. I know that if I was using
this library I would like to know whether a file is not found, or I
don't have read permission on the file.
A common practice when returning status of a function
such as
is to include stdlib.h and return either EXIT_SUCCESS or EXIT_FAILURE
instead of using
Direct Addressing Versus Indirect Addressing
One fairly common optimization in C is to use pointers (Direct Addressing)
instead of indexes (Indirect Addressing) when accessing the contents
of arrays. Direct Addressing is faster than Indirect Addressing. In some
cases using the highest level of optimization when compiling will convert
indirect addressing to direct addressing.
In most of the for loops in the library the code is using indirect
addressing:
Performance may be improved by:
An enhanced optimization would be:
Improve Type Safety
The library as written may insert floating point errors in the following
code:
It is somewhat rare to see the increment a float or double value in C.
I would suggest that table_of_chars be a table of integers rather than
doubles. A second array of doubles could be returned by
Variable and Function Names
In some languages, such as Pascal the name of the function is
assigned the return value. This is not the case in C, I would
differentiate between a local variable name and the function
name such as using entropy_val as the return value for the
entropy() function. Some C compilers may have problems with
the current implementation.
The function
On the Unix and Linux operating systems, wc stands for word count, and
returns the file size in lines, words and characters, a more appropriate
name for the function
Variable and Function Names
In some languages, such as Pascal the name of the function is
assigned the return value. This is not the case in C, I would
differentiate between a local variable name and the function
name such as using entropy_val as the return value for the
entropy() function. Some C compilers may have problems with
the current implementation.
The function
On the Unix and Linux operating systems, wc stands for word count, and
returns the file size in lines, words and characters, a more appropriate
name for the function
DRY Code, or the Don't Repeat Yourself Principle
Another software engineering principle is the Don't Repeat Yourself Principle.
Most times when you have repetitive code create you should create a function
that contains just that code so that the code is reusable.
Two functions in the code
this is repetitive code.
In addition to repeating, system calls such as
quite often programs or libraries will be swapped out during these
calls. In the execution of this library it is unlikely that the status of
advisable to call these functions only once.
Portability
Back in the old days ASCII was the primary ch
Decent variable and function names, decent indentation. Some application of
good software engineering principle such as the Single Responsibility Principle and the KISS principle.
On further inspection I found a few issues:
Error handling Versus Error Reporting
The code does handle errors and return a status of the function, but there
is no error reporting. Standard C library functions such as
access(), fopen() and stat() return the actual error code in the globalvariable errno. By including errno.h you have access to the error
codes and can report what is causing the error to the user/programmer
using the entropy library using perror() or stderror(). There is also
a discussion of this on stackoverflow.com. I know that if I was using
this library I would like to know whether a file is not found, or I
don't have read permission on the file.
A common practice when returning status of a function
such as
entropy(), get_times() or calc_frequencies()is to include stdlib.h and return either EXIT_SUCCESS or EXIT_FAILURE
instead of using
OK or EREAD.Direct Addressing Versus Indirect Addressing
One fairly common optimization in C is to use pointers (Direct Addressing)
instead of indexes (Indirect Addressing) when accessing the contents
of arrays. Direct Addressing is faster than Indirect Addressing. In some
cases using the highest level of optimization when compiling will convert
indirect addressing to direct addressing.
In most of the for loops in the library the code is using indirect
addressing:
for (iter = INIT; iter < ASCII; iter++)
{
if (table_of_chars[iter] != INIT)
table_of_chars[iter] = (double)(table_of_chars[iter] / file_size);
}Performance may be improved by:
double *toc_p = &table_of_chars[0];
for (iter = INIT; iter < ASCII; iter++, toc_p++)
{
if (*toc_p != INIT)
*toc_p = (double)(*toc_p / file_size);
}An enhanced optimization would be:
static int calc_frequencies(double table_of_chars[ASCII], const char * file)
{
unsigned long long int file_size = wc(file);
if (file_size != (unsigned)EREAD)
{
register int iter;
register double *toc_p = &table_of_chars[0];
for (iter = INIT; iter < ASCII; iter++, toc_p++)
{
if (*toc_p != INIT)
*toc_p = (double)(*toc_p / file_size);
}
return OK;
}
return EREAD;
}Improve Type Safety
The library as written may insert floating point errors in the following
code:
while ((getch = getc(fget)) != EOF)
{
table_of_chars[getch]++;
}It is somewhat rare to see the increment a float or double value in C.
I would suggest that table_of_chars be a table of integers rather than
doubles. A second array of doubles could be returned by
calc_frequencies().Variable and Function Names
In some languages, such as Pascal the name of the function is
assigned the return value. This is not the case in C, I would
differentiate between a local variable name and the function
name such as using entropy_val as the return value for the
entropy() function. Some C compilers may have problems with
the current implementation.
The function
get_times() might be better named ascount_all_character_instances()On the Unix and Linux operating systems, wc stands for word count, and
returns the file size in lines, words and characters, a more appropriate
name for the function
wc() might be get_file_size_in_bytes().Variable and Function Names
In some languages, such as Pascal the name of the function is
assigned the return value. This is not the case in C, I would
differentiate between a local variable name and the function
name such as using entropy_val as the return value for the
entropy() function. Some C compilers may have problems with
the current implementation.
The function
get_times() might be better named ascount_all_character_instances()On the Unix and Linux operating systems, wc stands for word count, and
returns the file size in lines, words and characters, a more appropriate
name for the function
wc() might be get_file_size_in_bytes().DRY Code, or the Don't Repeat Yourself Principle
Another software engineering principle is the Don't Repeat Yourself Principle.
Most times when you have repetitive code create you should create a function
that contains just that code so that the code is reusable.
Two functions in the code
get_times() and wc() contain if (access(file, F_OK) != EREAD)
{
}this is repetitive code.
In addition to repeating, system calls such as
access(), fopen(), stat() and fclose() are expensive time wise,quite often programs or libraries will be swapped out during these
calls. In the execution of this library it is unlikely that the status of
access() will change between calls. For this library it would beadvisable to call these functions only once.
Portability
Back in the old days ASCII was the primary ch
Code Snippets
for (iter = INIT; iter < ASCII; iter++)
{
if (table_of_chars[iter] != INIT)
table_of_chars[iter] = (double)(table_of_chars[iter] / file_size);
}double *toc_p = &table_of_chars[0];
for (iter = INIT; iter < ASCII; iter++, toc_p++)
{
if (*toc_p != INIT)
*toc_p = (double)(*toc_p / file_size);
}static int calc_frequencies(double table_of_chars[ASCII], const char * file)
{
unsigned long long int file_size = wc(file);
if (file_size != (unsigned)EREAD)
{
register int iter;
register double *toc_p = &table_of_chars[0];
for (iter = INIT; iter < ASCII; iter++, toc_p++)
{
if (*toc_p != INIT)
*toc_p = (double)(*toc_p / file_size);
}
return OK;
}
return EREAD;
}while ((getch = getc(fget)) != EOF)
{
table_of_chars[getch]++;
}if (access(file, F_OK) != EREAD)
{
}Context
StackExchange Code Review Q#142343, answer score: 2
Revisions (0)
No revisions yet.