HiveBrain v1.2.0
Get Started
← Back to all entries
patterncMinor

Splitting a string into tokens in C

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
intostringsplittingtokens

Problem

I am trying to improve my skill in C, and this time by getting away from strtok. The following code splits the input string into a token_list, an array of strings.

#include 
#include 
#include 

#define MAX_LINE_LEN 256
#define NULL_TERM '\0'

/* A function that will print the content of the token_list */
int printcharlist(char **tok_list)
{
    char **pptr = tok_list;

    while (*pptr) {
        printf ("++ %s\n", *pptr);
        pptr++;
    }

    return 0;
}

/* returns the string between 2 pointers (not forgetting to append NULL_TERM */
char *getStr(char *start, char *end)
{
    int length=end-start;
    int i=0;
    char *result = (char *)malloc (sizeof (char) * (length+1));

    while (i < length) {
    result[i] = *start;
    start++;
    i++;
    }

    result[i] = NULL_TERM;

    return result;
}

int main (int argc, char ** argv)
{
    char *input = "something;in;the;way;she;moves";
    char reject = ';';

    char *start = input;
    char *end = input;

    char *token_list[MAX_LINE_LEN] = {NULL};
    int i=0;

    /* Is this predicate okay? I read that certain compiler might raise warnings here */
    while (end=strchr(start, (int)reject)) {

        token_list[i]=getStr(start, end);

        i++;
        end++;
        start=end;
    }

    /* It feels awkward to add one more token after the while loop. Is it standard? Should/can I avoid this? */ 
    token_list[i]=start;
    printcharlist(token_list);

    return EXIT_SUCCESS;
}


I am interested in any kind of feedback I can get. But if I absolutely must ask questions it would be:

  • Would something make you react badly if someone sent you this code in a patch/pull-request? If yes, what?



  • Consider this code is going to production. What are the edge cases that I miss and how can I solve them?

Solution

Module Design

The trouble with your design is that you need to know how it works before you can use it. This is not true for strtok (OK to a slight extent it is true for strtok).

When designing a software system the user should not need to know how it operates, but rather the interface to the system. Doing it this way has two distict advantages.

  • User does not need to learn how it works thus making it easier for them to use.



This means it is easier to distribute in binary form if you want.

  • User is less likely to break it by making assumptions that do not hold (as they have not read the code to understand it).



  • You can change how it works without affecting the user code.



Thus I would have provided this functionality with a couple of functions and some opaque data types that the user never needs to know or understand.

String Copying

while (i < length) {
result[i] = *start;
start++;
i++;
}


This is done in the std lib look up strncpy().

Also this can be written much more succinctly as:

for(loop = 0; loop < length; ++loop)
{
   result[loop] = start[loop];
}


Code Review

I don't think you need to define this.

#define NULL_TERM '\0'


Putting the literal '\0' into your code is very readable and people know what it means. The macro NULL_TERM may seem self explanatory but I would go look it up.

The function printcharlist assumes the list is NULL terminated.

int printcharlist(char **tok_list)


This is asking for people to do it incorrectly (and forget the NULL terminator). You even mention yourself that this is akward:

/* It feels awkward to add one more token after the while loop. Is it standard? Should/can I avoid this? */


I would change the interface to pass the number of values in the list:

int printcharlist(char **tok_list, size_t count)


Then you make people explicitly fill in the length and they can not accidentally pass you an array that will cause problems.

Alternative design

One advantage of strtok() of your function is that it does not parse the whole string. It only fetches the next token. Thus in a situation where you only need the first couple of tokens it is much more efficient.

I would change your algorithm so that it behaves in a similar manor. Retrieve one string (and store it in your data structure). Next call retrieves the next string etc (this would be a lot easier to change it you had made a modular design).

Questions


Would make you react badly if someone sent you this code in a patch/pull-request? If yes, what?

No. But I would want to see the test cases the validate it.


Consider this code is going to production. What are the edge cases that I miss and how can I solve them?

I would set up some unit tests.

  • ""



  • ";"



  • ";;"



  • ";;;"



  • "Some Text;"



  • "Some Text;;"



  • ";;SomeText;;"



  • "Some\nText"



  • "Some\nText;"



  • "Some\nText;;"



  • "Some\n;Text"



  • "Some\n;;Text"



  • "Some;\n;Text"



  • "Some;\n;;Text"



  • "Some;;\n;Text"



  • "Some;;\n;;Text"

Code Snippets

while (i < length) {
result[i] = *start;
start++;
i++;
}
for(loop = 0; loop < length; ++loop)
{
   result[loop] = start[loop];
}
#define NULL_TERM '\0'
int printcharlist(char **tok_list)
/* It feels awkward to add one more token after the while loop. Is it standard? Should/can I avoid this? */

Context

StackExchange Code Review Q#11981, answer score: 7

Revisions (0)

No revisions yet.