HiveBrain v1.2.0
Get Started
← Back to all entries
patterncMinor

Library to get entropy of files

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
entropygetfileslibrary

Problem

This is a simple library I just finished writing. The main function is entropy(), which returns the entropy of a file. I'd really appreciate suggestions to improve it.

caos.c:

#include 
#include 
#include 
#include 
#include "caos.h"

static unsigned long long int wc(const char * file_to_size);
static int get_times(double table_of_chars[ASCII], const char * file);
static int calc_frequencies(double table_of_chars[ASCII], const char * file);

double entropy(const char * file)
{
  int iter;
  double entropy = (double)INIT;
  double table_of_chars[ASCII] = {INIT};
  if (get_times(table_of_chars, file) == EREAD)
  {
    return EREAD;
  }

  if (calc_frequencies(table_of_chars, file) == EREAD)
  {
    return EREAD;
  }

  for (iter = INIT; iter < ASCII; iter++)
  {
    if (table_of_chars[iter] != INIT)
    {
      entropy -= (table_of_chars[iter] * log2(table_of_chars[iter]));
    }
  }

  return fabs(entropy);
}

static int get_times(double table_of_chars[ASCII], const char * file)
{
  int getch;
  if (access(file, F_OK) != EREAD)
  {
    FILE * fget = fopen(file, READ);
    while ((getch = getc(fget)) != EOF)
    {
      table_of_chars[getch]++;
    }
    fclose(fget);
    return OK;
  }

  return EREAD;
}

static int calc_frequencies(double table_of_chars[ASCII], const char * file)
{
  int iter;
  unsigned long long int file_size = wc(file);
  if (file_size != (unsigned)EREAD)
 {
    for (iter = INIT; iter < ASCII; iter++)
    {
      if (table_of_chars[iter] != INIT)
      table_of_chars[iter] = (double)(table_of_chars[iter] / file_size);
    }

   return OK;
 }

 return EREAD;
}

static unsigned long long int wc(const char * file)
{
  if (access(file, F_OK) != EREAD)
  {
    struct stat st;
    if (stat(file, &st) == OK)
    {
      unsigned long long int bytes = st.st_size;
      return bytes;
    }
  }

  return EREAD;
}


caos.h:

```
#define ASCII 256
#define INIT 0

#define EREAD -1
#define READ "r"
#define

Solution

At first glance, I thought, WOW, Nice library!
Decent variable and function names, decent indentation. Some application of
good software engineering principle such as the Single Responsibility Principle and the KISS principle.

On further inspection I found a few issues:

Error handling Versus Error Reporting

The code does handle errors and return a status of the function, but there
is no error reporting. Standard C library functions such as access(),
fopen() and stat() return the actual error code in the global
variable errno. By including errno.h you have access to the error
codes and can report what is causing the error to the user/programmer
using the entropy library using perror() or stderror(). There is also
a discussion of this on stackoverflow.com. I know that if I was using
this library I would like to know whether a file is not found, or I
don't have read permission on the file.

A common practice when returning status of a function
such as entropy(), get_times() or calc_frequencies()
is to include stdlib.h and return either EXIT_SUCCESS or EXIT_FAILURE
instead of using OK or EREAD.

Direct Addressing Versus Indirect Addressing

One fairly common optimization in C is to use pointers (Direct Addressing)
instead of indexes (Indirect Addressing) when accessing the contents
of arrays. Direct Addressing is faster than Indirect Addressing. In some
cases using the highest level of optimization when compiling will convert
indirect addressing to direct addressing.

In most of the for loops in the library the code is using indirect
addressing:

for (iter = INIT; iter < ASCII; iter++)
    {
      if (table_of_chars[iter] != INIT)
      table_of_chars[iter] = (double)(table_of_chars[iter] / file_size);
    }


Performance may be improved by:

double *toc_p = &table_of_chars[0];
    for (iter = INIT; iter < ASCII; iter++, toc_p++)
    {
      if (*toc_p != INIT)
      *toc_p = (double)(*toc_p / file_size);
    }


An enhanced optimization would be:

static int calc_frequencies(double table_of_chars[ASCII], const char * file)
{
  unsigned long long int file_size = wc(file);
  if (file_size != (unsigned)EREAD)
  {
    register int iter;
    register double *toc_p = &table_of_chars[0];
    for (iter = INIT; iter < ASCII; iter++, toc_p++)
    {
      if (*toc_p != INIT)
      *toc_p = (double)(*toc_p / file_size);
    }

   return OK;
  }

  return EREAD;
}


Improve Type Safety

The library as written may insert floating point errors in the following
code:

while ((getch = getc(fget)) != EOF)
    {
      table_of_chars[getch]++;
    }


It is somewhat rare to see the increment a float or double value in C.

I would suggest that table_of_chars be a table of integers rather than
doubles. A second array of doubles could be returned by calc_frequencies().

Variable and Function Names

In some languages, such as Pascal the name of the function is
assigned the return value. This is not the case in C, I would
differentiate between a local variable name and the function
name such as using entropy_val as the return value for the
entropy() function. Some C compilers may have problems with
the current implementation.

The function get_times() might be better named as
count_all_character_instances()

On the Unix and Linux operating systems, wc stands for word count, and
returns the file size in lines, words and characters, a more appropriate
name for the function wc() might be get_file_size_in_bytes().

Variable and Function Names

In some languages, such as Pascal the name of the function is
assigned the return value. This is not the case in C, I would
differentiate between a local variable name and the function
name such as using entropy_val as the return value for the
entropy() function. Some C compilers may have problems with
the current implementation.

The function get_times() might be better named as
count_all_character_instances()

On the Unix and Linux operating systems, wc stands for word count, and
returns the file size in lines, words and characters, a more appropriate
name for the function wc() might be get_file_size_in_bytes().

DRY Code, or the Don't Repeat Yourself Principle

Another software engineering principle is the Don't Repeat Yourself Principle.
Most times when you have repetitive code create you should create a function
that contains just that code so that the code is reusable.

Two functions in the code get_times() and wc() contain

if (access(file, F_OK) != EREAD)
  {
  }


this is repetitive code.

In addition to repeating, system calls such as
access(), fopen(), stat() and fclose() are expensive time wise,
quite often programs or libraries will be swapped out during these
calls. In the execution of this library it is unlikely that the status of
access() will change between calls. For this library it would be
advisable to call these functions only once.

Portability

Back in the old days ASCII was the primary ch

Code Snippets

for (iter = INIT; iter < ASCII; iter++)
    {
      if (table_of_chars[iter] != INIT)
      table_of_chars[iter] = (double)(table_of_chars[iter] / file_size);
    }
double *toc_p = &table_of_chars[0];
    for (iter = INIT; iter < ASCII; iter++, toc_p++)
    {
      if (*toc_p != INIT)
      *toc_p = (double)(*toc_p / file_size);
    }
static int calc_frequencies(double table_of_chars[ASCII], const char * file)
{
  unsigned long long int file_size = wc(file);
  if (file_size != (unsigned)EREAD)
  {
    register int iter;
    register double *toc_p = &table_of_chars[0];
    for (iter = INIT; iter < ASCII; iter++, toc_p++)
    {
      if (*toc_p != INIT)
      *toc_p = (double)(*toc_p / file_size);
    }

   return OK;
  }

  return EREAD;
}
while ((getch = getc(fget)) != EOF)
    {
      table_of_chars[getch]++;
    }
if (access(file, F_OK) != EREAD)
  {
  }

Context

StackExchange Code Review Q#142343, answer score: 2

Revisions (0)

No revisions yet.