HiveBrain v1.2.0
Get Started
← Back to all entries
patterncMinor

String escaper in C

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
escaperstringstackoverflow

Problem

As part of my stupidly large project Concaten, I'm writing a string escaper. It's the majority of the part that turns string tokens (like "hello world" or "buncha\n\n\n\n\n\nnewlines") into the string data that they represent. It's part of the token-to-object module, hence the prefix.

The code that calls the function below is intentionally not in the question, since it relies very heavily on some other components that deserve their own questions, but it's available here -- though keep in mind that my GitHub is subject to change.

A few important things, first:

  • ERROR is defined with typedef unsigned long long ERROR; everything that ends with _ERROR or _FAIL is of that type.



  • val_len is intentionally one more than the result strlen would give you; it's the amount of memory needed, not the result of strlen. This is a convention I have across my codebase, and I'm not going to change it.



Code

```
#include
#include

const ERROR NO_ERROR = 0;
const ERROR TTO_WORDS_VALUELESS_FAIL = 9000;
const ERROR TTO_UNKNOWN_TYPE_FAIL = 9001;
const ERROR TTO_NOT_IMPLEMENTED_FAIL = 9002;
const ERROR TTO_CREATE_OBJ_FAIL = 9003;
const ERROR TTO_STRING_ESCAPE_FAIL = 9004;
const ERROR TTO_ESCAPE_END_FAIL = 9005;
const ERROR TTO_MALLOC_FAIL = 9006;
const ERROR TTO_ESCAPE_BAD_HEX_FAIL = 9007;
const ERROR TTO_INVALID_NUM_FAIL = 9008;

char tto_hexchar_to_val(const char c) {
switch (c) {
case '0': return 0;
case '1': return 1;
case '2': return 2;
case '3': return 3;
case '4': return 4;
case '5': return 5;
case '6': return 6;
case '7': return 7;
case '8': return 8;
case '9': return 9;
case 'a': case 'A': return 10;
case 'b': case 'B': return 11;
case 'c': case 'C': return 12;
case 'd': case 'D': return 13;
case 'e': case 'E': return 14;
case 'f': case 'F': return 15;
default: return 16;
}
}

ERROR tto_escape_hex(const char **pos

Solution

I will only address the issue of performance.

-
You might want to use strchr to find occurrences of \ and just memcpy everything in between. It's highly probable that these are optimized by SSE or AVX.

-
If you can (I suspect you cannot), don't allocate memory for each string separately, and if you do, don't reallocate, it's probably not worth the overhead.

-
To kill two birds with one stone, you can allocate an array where you save the positions of \ in the string. Then you allocate exactly as much memory as needed, and do the memcpy and parsing of escape sequences.
EDIT To deal with escape sequences of variable length, you can parse and store the escape sequences as you scan the string for \s. Store them in another array, along with their positions, and then do memcpy of plain text plus individually copy the parsed characters.

-
Preferably put the most common branch first, eg. if (*pos!='\\'). Although the branch prediction buffer will probably alleviate the negative effects of doing it the way you do it now. You can take a look at the macros __builtin_expect and likely / unlikely.

-
In function tto_escape_hex, save *pos into a variable instead of using it directly. The way you do it now, you dereference a pointer on every access. That, if your compiler didn't optimize it, would be slow. Allocating an extra variable is worth it and if you have optimizations turned on (or maybe even if you don't), the compiler probably stores the value in a register anyway.

-
If you are serious about this, you can take inspiration from some reference-grade compilers such as GCC (although that one might be a bit too heavyweight).

Context

StackExchange Code Review Q#159295, answer score: 4

Revisions (0)

No revisions yet.