HiveBrain v1.2.0
Get Started
← Back to all entries
patterncMinor

Count words in a C string

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
countwordsstring

Problem

I have a string that must be splitted into an array of strings so using strtok makes perfect sense but before that I want to allocate the array of strings and thus I count how many words in it.

static inline int countWords(char *str)
{
        int count = 0;
        char *buf = str;

        if(isgraph(*buf) > 0)//checking for the first word
                count++;
        while(*buf != '\0')
        {
                if(isblank(*buf) > 0){//reached a blank
                        while(*buf == ' ' || *buf == '\t')//keep ignoring them
                                buf++;
                        if(*buf == '\0')//return if string is over
                                return count;
                        count++;//no more blank and string didn't end=new word
                }
                buf++;
        }
        return count;
}


Am I doing this right? I've tested all weird cases that can come to my mind and it seems to work; but is there a faster way to do this? I am mainly concerned about the performance of the above code.

Solution

Code comments:

  • The parameter to the function should be const - "const correctness".



  • Don't use the variable name "buf" for a variable that is not a memory buffer of some sort.



  • You have 3 different ways of checking for spaces in your code, you need to check for them in a consistent manner. Either use !isgraph which is the same as checking for ' '. Or isblank, which is the same as checking for ' ' and '\t'. Or possibly isspace, which checks for any form of white space character. I'll use isblank in my suggestion, since I don't know the nature of the string input.



  • Your code is a bit hard to follow, with several nested loops and if statements.



When I spontaneously rewrite your code to improve readability, I end up with this:

#include 

static int countWords (const char* str)
{
  int count = 0;

  while (*str != '\0')
  {
    while (*str != '\0' && isblank(*str)) // remove all spaces between words
    {
      str++;
    }
    if(*str != '\0')
    {
      count++;
    }

    while (*str != '\0' && !isblank(*str)) // loop past the found word
    {
      str++;
    }
  }

  return count;
}


This code is very easy to understand. However, you asked for optimization and my code doesn't look all that effective; it is possibly slower than the original. Particularly, we have multiple checks for '\0' all over the place.

Note that any truly meaningful optimization requires that 1) we have actually encountered and measured a real performance problem, 2) we have noticed that the compiler is doing a poor job at optimizing that particular problematic code, and 3) we have fairly good knowledge of the target CPU and hardware. It a bad idea to manually optimize code if those 3 above conditions are not met.

Still, I'll attempt a manual optimization of this, since it was requested. It may or may not be more effective. Branch prediction may be a more serious concern than the number of comparisons executed, for example. For what it is worth, here you go:

#include 
#include 

static inline int countWords (const char* str)
{
  int count = 0;
  bool look_for_data = true;

  while (*str != '\0')
  {
    if( (look_for_data ^ isblank(*str)) != 0) // is it a character of interest?
    {
      count += (int)look_for_data; // increase count if we are looking for data
      look_for_data = !look_for_data;
    }
    str++;
  }

  return count;
}


With this optimization we have managed to reduce the number of instructions and the number of branches both. But as you can tell, the code now turned quite obscure.

Explanation:
The fundament of this code is that one lap in the loop corresponds to one character in the input string. The bool variable keeps track of whether the code is currently looking for spaces or non-spaces.

We are only interested in the places in the string where we go from looking for spaces to looking for data, or vice versa. When that happens, we should start looking for data if we were looking for spaces, and the other way around. Also, in the case where we go from spaces to data, we should increase the word counter.

The boolean logic truth table is:

look_for_data  isblank  of interest  comment
false          false    false        we are at symbols and found another symbol
false          true     true         we are at symbols and found a space
true           true     false        we are at spaces and found another space
true           false    true         we are at spaces and found a symbol


We can see that this happens to be the truth table for logical XOR.

Code Snippets

#include <ctype.h>

static int countWords (const char* str)
{
  int count = 0;

  while (*str != '\0')
  {
    while (*str != '\0' && isblank(*str)) // remove all spaces between words
    {
      str++;
    }
    if(*str != '\0')
    {
      count++;
    }

    while (*str != '\0' && !isblank(*str)) // loop past the found word
    {
      str++;
    }
  }

  return count;
}
#include <ctype.h>
#include <stdbool.h>

static inline int countWords (const char* str)
{
  int count = 0;
  bool look_for_data = true;

  while (*str != '\0')
  {
    if( (look_for_data ^ isblank(*str)) != 0) // is it a character of interest?
    {
      count += (int)look_for_data; // increase count if we are looking for data
      look_for_data = !look_for_data;
    }
    str++;
  }

  return count;
}
look_for_data  isblank  of interest  comment
false          false    false        we are at symbols and found another symbol
false          true     true         we are at symbols and found a space
true           true     false        we are at spaces and found another space
true           false    true         we are at spaces and found a symbol

Context

StackExchange Code Review Q#19787, answer score: 8

Revisions (0)

No revisions yet.