patterncMinor
String escaper in C
Viewed 0 times
escaperstringstackoverflow
Problem
As part of my stupidly large project Concaten, I'm writing a string escaper. It's the majority of the part that turns string tokens (like
The code that calls the function below is intentionally not in the question, since it relies very heavily on some other components that deserve their own questions, but it's available here -- though keep in mind that my GitHub is subject to change.
A few important things, first:
Code
```
#include
#include
const ERROR NO_ERROR = 0;
const ERROR TTO_WORDS_VALUELESS_FAIL = 9000;
const ERROR TTO_UNKNOWN_TYPE_FAIL = 9001;
const ERROR TTO_NOT_IMPLEMENTED_FAIL = 9002;
const ERROR TTO_CREATE_OBJ_FAIL = 9003;
const ERROR TTO_STRING_ESCAPE_FAIL = 9004;
const ERROR TTO_ESCAPE_END_FAIL = 9005;
const ERROR TTO_MALLOC_FAIL = 9006;
const ERROR TTO_ESCAPE_BAD_HEX_FAIL = 9007;
const ERROR TTO_INVALID_NUM_FAIL = 9008;
char tto_hexchar_to_val(const char c) {
switch (c) {
case '0': return 0;
case '1': return 1;
case '2': return 2;
case '3': return 3;
case '4': return 4;
case '5': return 5;
case '6': return 6;
case '7': return 7;
case '8': return 8;
case '9': return 9;
case 'a': case 'A': return 10;
case 'b': case 'B': return 11;
case 'c': case 'C': return 12;
case 'd': case 'D': return 13;
case 'e': case 'E': return 14;
case 'f': case 'F': return 15;
default: return 16;
}
}
ERROR tto_escape_hex(const char **pos
"hello world" or "buncha\n\n\n\n\n\nnewlines") into the string data that they represent. It's part of the token-to-object module, hence the prefix.The code that calls the function below is intentionally not in the question, since it relies very heavily on some other components that deserve their own questions, but it's available here -- though keep in mind that my GitHub is subject to change.
A few important things, first:
ERRORis defined withtypedef unsigned long long ERROR; everything that ends with_ERRORor_FAILis of that type.
val_lenis intentionally one more than the resultstrlenwould give you; it's the amount of memory needed, not the result ofstrlen. This is a convention I have across my codebase, and I'm not going to change it.
Code
```
#include
#include
const ERROR NO_ERROR = 0;
const ERROR TTO_WORDS_VALUELESS_FAIL = 9000;
const ERROR TTO_UNKNOWN_TYPE_FAIL = 9001;
const ERROR TTO_NOT_IMPLEMENTED_FAIL = 9002;
const ERROR TTO_CREATE_OBJ_FAIL = 9003;
const ERROR TTO_STRING_ESCAPE_FAIL = 9004;
const ERROR TTO_ESCAPE_END_FAIL = 9005;
const ERROR TTO_MALLOC_FAIL = 9006;
const ERROR TTO_ESCAPE_BAD_HEX_FAIL = 9007;
const ERROR TTO_INVALID_NUM_FAIL = 9008;
char tto_hexchar_to_val(const char c) {
switch (c) {
case '0': return 0;
case '1': return 1;
case '2': return 2;
case '3': return 3;
case '4': return 4;
case '5': return 5;
case '6': return 6;
case '7': return 7;
case '8': return 8;
case '9': return 9;
case 'a': case 'A': return 10;
case 'b': case 'B': return 11;
case 'c': case 'C': return 12;
case 'd': case 'D': return 13;
case 'e': case 'E': return 14;
case 'f': case 'F': return 15;
default: return 16;
}
}
ERROR tto_escape_hex(const char **pos
Solution
I will only address the issue of performance.
-
You might want to use
-
If you can (I suspect you cannot), don't allocate memory for each string separately, and if you do, don't reallocate, it's probably not worth the overhead.
-
To kill two birds with one stone, you can allocate an array where you save the positions of
EDIT To deal with escape sequences of variable length, you can parse and store the escape sequences as you scan the string for
-
Preferably put the most common branch first, eg.
-
In function
-
If you are serious about this, you can take inspiration from some reference-grade compilers such as GCC (although that one might be a bit too heavyweight).
-
You might want to use
strchr to find occurrences of \ and just memcpy everything in between. It's highly probable that these are optimized by SSE or AVX.-
If you can (I suspect you cannot), don't allocate memory for each string separately, and if you do, don't reallocate, it's probably not worth the overhead.
-
To kill two birds with one stone, you can allocate an array where you save the positions of
\ in the string. Then you allocate exactly as much memory as needed, and do the memcpy and parsing of escape sequences.EDIT To deal with escape sequences of variable length, you can parse and store the escape sequences as you scan the string for
\s. Store them in another array, along with their positions, and then do memcpy of plain text plus individually copy the parsed characters.-
Preferably put the most common branch first, eg.
if (*pos!='\\'). Although the branch prediction buffer will probably alleviate the negative effects of doing it the way you do it now. You can take a look at the macros __builtin_expect and likely / unlikely.-
In function
tto_escape_hex, save *pos into a variable instead of using it directly. The way you do it now, you dereference a pointer on every access. That, if your compiler didn't optimize it, would be slow. Allocating an extra variable is worth it and if you have optimizations turned on (or maybe even if you don't), the compiler probably stores the value in a register anyway.-
If you are serious about this, you can take inspiration from some reference-grade compilers such as GCC (although that one might be a bit too heavyweight).
Context
StackExchange Code Review Q#159295, answer score: 4
Revisions (0)
No revisions yet.