HiveBrain v1.2.0
Get Started
← Back to all entries
patterncMinor

Wildcard string search with capture

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
wildcardsearchwithcapturestring

Problem

My goal was to create a function as performant as possible with results being:

  • True/False as to whether wild matches str.



  • Optional capture of certain patterns.



My purposes required:

  • Case sensitive match.



  • Pattern must match entire string to succeed.



  • Assumption that * and ? do not exist in str.



  • * matches zero or more characters.



  • ? matches exactly one character.



  • Lazy in this case meaning that going from left to right, would match as little as possible while still ensuring a match if such were possible.



It's been a challenge as many times I almost had it but then found an edge case that didn't work. This is basically globbing, but I need such to work for strings rather than files and I also hoped to make something much faster than using a regex which IMHO would be overkill for my needs.

I have tested the following to work on many patterns but I can't be certain yet if there isn't a permutation where it fails. I would be very interested if anyone can point out a situation where it fails the terms set above. I'm also interested to hear if there are any ways the code can be improved (made faster).

Hacks like (unsigned)(*wild-40) (hello:5)

  • hello *(d) -> (d:1)



  • (hello) (*)(d) -> (hello:5) (worl:4) (d:1)



  • (hello) ()r(*)? -> (hello:5) (wo:2) (l:1)



  • (hello) ()r()? -> (hello:5) (wo:2) (0)`



All the results are exactly what I was hoping given the parameters. Can it be done faster?

Solution

I see a number of things that may help you improve your code.

Use a switch instead of long if ...else chain

The pattern matching logic is much easier to see if a swtich statement is used instead of the long if...else chain. The default case can then be used solely for the non-pattern characters and a relatively short if...else chain would go there. On my machine, this also makes the code slightly faster.

Use longer, more meaningful names

Names like n and N are not very descriptive. Instead, you might use capturegroup and groupnum.

Use const where practical

The strings should not be altered by the search and the captured array c is simply a list of pointers into the string. For that reason, I would advise changing the signature of the function to this:

int globSearch(const char *pattern, const char *str, const char **capture)


Consider checking for NULL pointers

If speed is the primary consideration and if the caller checks, you don't need this, but I'd normally recommend having the called function check for NULL pointers before attempting to dereference them.

Differentiate between '(' and ')'

Right now, your program produces the same output for this pattern:

char *m = "(hello) (*)(?*?)?(?)?(*)(*)";


as this pattern:

char *m = ")hello) )*))?*?)?)?)?)*))*)";


Maybe that's OK, but it also produces the same for this pattern:

char *m = "(hello) (*((?*?)?)?)?(*)(*)";


That's a problem because it appears to those familiar with regular expressions that this is pattern contains nested capture patterns, which is not how the code actually interprets the string.

Consider changing the interface

Instead of an array of pointers, it might be more useful to have pointers and counts. Otherwise, any time the values are returned, you'll have to do things like this:

int z = c[x + 1] - c[x];
memcpy(tmp, c[x], z);
tmp[z] = 0;
printf("%s:%d\n", tmp, z);


Pass the size of the array of pointers

Any time you pass an array to a function, you should either use a sentinel value (such as the terminating '\0' character to mark the end of a string) or pass a size with the pointer. Otherwise, the called routine has no way to assure that it won't overrun the end of the array.

Avoid buffer overflow vulnerabilities

I realized it's just demo code, but any time you use memcpy it's important to make sure that the destination buffer has enough space for the number of bytes being copied. That's the case here with these particular fixed strings, but it's better generally to dynamically check that at runtime, especially when the length parameter is not a constant.

Eliminate the return 0; in main

For many years, the C standard has specified that the compiler must automatically generate the equivalent of return 0; at the end of main so there is no reason to specfically include it manually.

Code Snippets

int globSearch(const char *pattern, const char *str, const char **capture)
char *m = "(hello) (*)(?*?)?(?)?(*)(*)";
char *m = ")hello) )*))?*?)?)?)?)*))*)";
char *m = "(hello) (*((?*?)?)?)?(*)(*)";
int z = c[x + 1] - c[x];
memcpy(tmp, c[x], z);
tmp[z] = 0;
printf("%s:%d\n", tmp, z);

Context

StackExchange Code Review Q#93329, answer score: 3

Revisions (0)

No revisions yet.