patterncMinor
Implementation of tokenizer object and finite state machine in C
Viewed 0 times
finitestatetokenizermachineandimplementationobject
Problem
I was given an assignment to create an implementation of a tokenizer object (by object, I mean a struct with associated functions that operate on instances of the struct). This tokenizer retrieves tokens from an input string given the following rules:
-
The input string cannot be changed
-
Tokens can only contain digits 0-9, hex digits, E and e, X and x, +, -, and periods(.).
-
Tokens are delimited by white space
-
If a character that is not a delimiter or a valid token character is encountered in the input string, a message containing its ASCII and hexadecimal representation is printed.
The program then uses the tokenizer object to retrieve tokens from an input string and classify them as zero, integer, float, octal, hexadecimal, or malformed. The classification is done using a finite state machine.
This is my first C program so I'd like pointers on where I could improve.
(And to ease any worries about cheating, this assignment was due September 29 and has already been turned in)
Code:
```
#include
#include
#include
typedef enum {
STATE_A, STATE_B, STATE_C,
STATE_D, STATE_E, STATE_F,
STATE_G, STATE_H, STATE_I,
STATE_J, STATE_ENTRY,
STATE_DECIMAL, STATE_FLOAT, STATE_OCTAL, STATE_HEXADECIMAL, STATE_MALFORMED, STATE_ZERO
} state_t;
static char *classifications[] = {"Decimal", "Float", "Octal", "Hexadecimal", "Malformed", "Zero"};
static state_t A(char ); static state_t B(char ); static state_t C(char *);
static state_t D(char ); static state_t E(char ); static state_t F(char *);
static state_t G(char ); static state_t H(char ); static state_t I(char *);
static state_t J(char ); static state_t entryState(char );
typedef state_t (state_func)(char *);
static state_func *STATE_FUNCTIONS[] = {A, B, C, D, E, F, G, H, I, J, entryState};
struct TokenizerT_ {
char *current;
};
typedef struct TokenizerT_ TokenizerT;
int isoctal(int c) {
return c >= '0' && c current &&
-
The input string cannot be changed
-
Tokens can only contain digits 0-9, hex digits, E and e, X and x, +, -, and periods(.).
-
Tokens are delimited by white space
-
If a character that is not a delimiter or a valid token character is encountered in the input string, a message containing its ASCII and hexadecimal representation is printed.
The program then uses the tokenizer object to retrieve tokens from an input string and classify them as zero, integer, float, octal, hexadecimal, or malformed. The classification is done using a finite state machine.
This is my first C program so I'd like pointers on where I could improve.
(And to ease any worries about cheating, this assignment was due September 29 and has already been turned in)
Code:
```
#include
#include
#include
typedef enum {
STATE_A, STATE_B, STATE_C,
STATE_D, STATE_E, STATE_F,
STATE_G, STATE_H, STATE_I,
STATE_J, STATE_ENTRY,
STATE_DECIMAL, STATE_FLOAT, STATE_OCTAL, STATE_HEXADECIMAL, STATE_MALFORMED, STATE_ZERO
} state_t;
static char *classifications[] = {"Decimal", "Float", "Octal", "Hexadecimal", "Malformed", "Zero"};
static state_t A(char ); static state_t B(char ); static state_t C(char *);
static state_t D(char ); static state_t E(char ); static state_t F(char *);
static state_t G(char ); static state_t H(char ); static state_t I(char *);
static state_t J(char ); static state_t entryState(char );
typedef state_t (state_func)(char *);
static state_func *STATE_FUNCTIONS[] = {A, B, C, D, E, F, G, H, I, J, entryState};
struct TokenizerT_ {
char *current;
};
typedef struct TokenizerT_ TokenizerT;
int isoctal(int c) {
return c >= '0' && c current &&
Solution
Don't Repeat Yourself (DRY)
So if
This flips the logic. If
I changed from
Don't allocate memory constantly
Since every
This might be correct. At least we are no longer guaranteed to be wrong.
Then later you have
Instead consider
This only attempts to allocate memory if it needs to do so.
This could be
This will shrink the array to fit if necessary.
Anything less than
Another alternative would be to scan the token string twice. The first time, you find out how long the token is. The second time, you allocate memory (once) and copy the token over.
TokenizerT *tokenizer = malloc(sizeof(TokenizerT));
if (!tokenizer) {
return NULL;
}
tokenizer->current = ts;
return tokenizer;So if
tokenizer is NULL, you return NULL. If not, you return tokenizer. Consider TokenizerT *tokenizer = malloc(sizeof *tokenizer);
if (tokenizer) {
tokenizer->current = ts;
}
return tokenizer;This flips the logic. If
tokenizer is not NULL, we can set members to values. And we return tokenizer whether it is NULL or not. This gives us the same results with fewer statements. I changed from
sizeof(TokenizerT) to sizeof *tokenizer so that even if you change the type, it will still have the right size. Don't allocate memory constantly
char *token = malloc(1);Since every
token string is at least one character plus a null, this will never be right. size_t capacity = CHUNK_SIZE;
char *token = malloc(capacity);This might be correct. At least we are no longer guaranteed to be wrong.
Then later you have
size++;
token = realloc(token, size);
if (!token) {
failedToAllocateMemory();
}Instead consider
size++;
if (size >= capacity) {
capacity += CHUNK_SIZE;
token = realloc(token, capacity);
if (!token) {
failedToAllocateMemory();
}
}This only attempts to allocate memory if it needs to do so.
token = realloc(token, size + 1);This could be
if (size + 1 != capacity) {
token = realloc(token, size + 1);
}This will shrink the array to fit if necessary.
Anything less than
CHUNK_SIZE, this version does two allocations. The original always did two, so this is never worse. At CHUNK_SIZE, this actually only does one allocation, because it doesn't need to shrink the array afterwards. And of course, the original does size + 2 allocations. Another alternative would be to scan the token string twice. The first time, you find out how long the token is. The second time, you allocate memory (once) and copy the token over.
Code Snippets
TokenizerT *tokenizer = malloc(sizeof(TokenizerT));
if (!tokenizer) {
return NULL;
}
tokenizer->current = ts;
return tokenizer;TokenizerT *tokenizer = malloc(sizeof *tokenizer);
if (tokenizer) {
tokenizer->current = ts;
}
return tokenizer;char *token = malloc(1);size_t capacity = CHUNK_SIZE;
char *token = malloc(capacity);size++;
token = realloc(token, size);
if (!token) {
failedToAllocateMemory();
}Context
StackExchange Code Review Q#143065, answer score: 6
Revisions (0)
No revisions yet.