HiveBrain v1.2.0
Get Started
← Back to all entries
patterncMinor

String compaction algorithm

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
algorithmcompactionstring

Problem

I've been working on improving my C coding. I wanted to write two string algorithms:

  • to trim white space from the beginning and end of a string



  • to compact a string



I have a few problems with this code that I was hoping to get some advice on:

  • The casts from const char to char . Is there a clean and idiomatic way of cleaning those up?



  • The string compaction algorithm requires extra storage and I couldn't think of a clean way to do the compaction in place. Is there a clean way of doing it in place?



I'm looking for a review on algorithmic correctness, code style, design, and any other improvements you can think of. My main concern is performance!

```
#include
#include
#include
#include
#include
#include

static const char TERMINATOR = '\0';

/* skips space in @param[str] depending on the provided direction
@param[in] -- str -- the string
@param[in] -- direction -- the direction to skip space in
1 -- forward direction
0 -- back direction
@return -- pointer to first non-space character in the string if
@param[direction] and @param[str] are valid or the original
string otherwise
*/
const char skip_space(const char str, int direction)
{
if (direction == 1) {
while (isspace(*str)) { ++str; }
} else if (direction == 0) {
while (isspace(*str)) { --str; }
}
return str;
}

const char ltrim_white(const char str, size_t len)
{
char end = (char )&str[len - 1];
end = (char *)skip_space(end, 0);
*++end = TERMINATOR;
return skip_space(str, 1);
}

const char trim_white(const char str)
{
return ltrim_white(str, strlen(str));
}

char compact_with_mem(char str)
{
str = (char *)skip_space(str, 1);

size_t len = strlen(str);
if (len == 0) {
char *c = malloc(1);
if (c) {
*c = TERMINATOR;
}
return c;
}

char back = (char )skip_space(&str[len - 1], 0);
*++back = T

Solution

The functionality you have implemented in your “compact” is generally known as white-space normalization.

Get rid of redundant functions

You have a lot of functions. While breaking complicated logic up into manageable bits is a good thing, there is also a point where it becomes too much and hurts maintainability rather than helping it.

For example, the only use of ltrim_white is in trim_white which consists of a single line calling that function. You could have equally well put all the code into trim_white directly. ltrim_white also is a misnomer because unlike its name suggests, it trims white-space from both sides, not just from the left.

const char *skip_space(const char *str, int direction)
{
    if (direction == 1) {
        while (isspace(*str)) { ++str; }
    } else if (direction == 0) {
        while (isspace(*str)) { --str; }
    }
    return str;
}


This function doesn't really buy you anything. Making a function call

s = (char *) skip_space(s, 1);


isn't really simpler than writing

while (isspace(*s)) ++s;


directly. And the latter form saves you the headache figuring out what that 1 as second parameter means. As a bonus, it saves a branch and avoids the const trouble. If you really want a dedicated function, provide skip_space_forward and skip_space_backward instead of introducing the second parameter.

Make auxiliary functions static

Those auxiliary functions that you determine are in fact useful and should be kept, should still be private to your implementation and not pollute the global name-space. Therefore, you should declare them static. As a nice bonus, this might also make your code run faster (for free).

Put the standard library to good use

The ` standard library header provides highly optimized functions for dealing with memory. Use them. In particular, memmove and strlen will come in handy.

Be
const-correct

As you have already realized yourself, the adding and removing of
const has become a total mess. You should only add const where you can, not at all cost. It is far better to have an internal pointer being not const even if you don't need to modify the pointed-to object than it is to repetitively cast const away. Once you start casting, const doesn't buy you anything in terms of code correctness.

If you have a function like
skip_space, that takes a pointer to a buffer and returns a pointer into that buffer without modifying the buffer, it is good to keep the buffer const. Unfortunately, when you call skip_space with a non-const pointer, what you get back is always const and you cannot use the pointer to write through it even though you know it would be safe. This can be solved without a cast while still keeping the buffer skip_space operates on const, though: use the returned pointer to compute an index and use that one!

char buffer[] = "the quick brown fox";         // mutable buffer
const char *const p = skip_space(buffer, 1);   // pointer to const
const ptrdiff_t i = p - buffer;                // just an integer
buffer[i] = 'X';                               // perfectly fine now


If you have a decent compiler, it should be able to optimize the additional arithmetic away while still being able to verify the
const-correctness of your code.

Don't over-generalize

NUL terminated character strings are so prevalent throughout C that you can safely assume the convention is here to stay. By spelling out
'\0', you make your code actually more readable than by using a constant like TERMINATOR.

If you really want generality, make the caller pass in a pair of pointers. (If you have ever used the C++
standard library, this will be very familiar to you.) Then your function also works with arrays that are not terminated at all. You can provide a convenience function that only accepts a single pointer (that has to point to a terminated array in this case, of course) and uses rawmemchr (if you have it) or strlen to find the end. Then it calls the generic two-argument version.

Actually, a more useful generalization would probably be to accept a function pointer to use in place of
isspace. Such indirect calls can have a performance penalty, though.

Consider pointer arithmetic

Instead of
&p[n - i] you can just write p + n - i. I find this much more readable.

Modify pointers with care

Your
trim_white function trims space from the beginning by simply adding an offset to the pointer. While this is super-fast, there is a big downside. If I call your function with a pointer that I obtained through malloc, I still have to keep the original pointer along with the pointer to the trimmed buffer. I need the former to pass it to free` when I no longer need the string. I'm sure I will confuse them all the time.

Instead of advancing the start pointer, you could actually move the bytes of the string to the beginning. While this might seem costly at first, note that it doesn't actua

Code Snippets

const char *skip_space(const char *str, int direction)
{
    if (direction == 1) {
        while (isspace(*str)) { ++str; }
    } else if (direction == 0) {
        while (isspace(*str)) { --str; }
    }
    return str;
}
s = (char *) skip_space(s, 1);
while (isspace(*s)) ++s;
char buffer[] = "the quick brown fox";         // mutable buffer
const char *const p = skip_space(buffer, 1);   // pointer to const
const ptrdiff_t i = p - buffer;                // just an integer
buffer[i] = 'X';                               // perfectly fine now
char *
trim_space(const char *const src, char *const dst)
{
  size_t first = 0;
  size_t last = strlen(src);
  size_t length = last - first;;
  if (length > 0)
    {
      while (isspace(src[first]))
        ++first;
      while ((last > first) && isspace(src[last - 1]))
        --last;
      length = last - first;
      memmove(dst, src + first, length);
    }
  dst[length] = '\0';
  return dst + length;
}

Context

StackExchange Code Review Q#118282, answer score: 3

Revisions (0)

No revisions yet.