HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Simplify regular expression? (Converting Unicode fractions to TeX)

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
expressionregularunicodesimplifytexconvertingfractions

Problem

Background

I'm converting Unicode text to TeX for typesetting. In the input, I'm allowing simple fractions like ½ and ⅔ using single Unicode characters and complex fractions like ¹²³/₄₅₆ using superscripted and subscripted numerals. First I convert the simple fractions (½ becomes \frac{1}{2}) and superscripted and subscripted numerals (¹²³ becomes $^{123}$ and ₄₅₆ becomes $_{456}$) using a character lookup table, then I make a second pass to collapse runs of numerals and combine numerator and denominator into a fraction (so ¹²³/₄₅₆ then becomes \frac{123}{456}). Finally, I make a third pass to insert a ⅙-em thin space between an integer and a fraction (so that, for example, 2¼ displays as 2 ¼).

Question(s)

Code works great, but I'm wondering how to simplify the regular expressions. There are three transformations.

  • In the first transformation, is there a way to avoid the alternation operator (|) and simply match on either \^ or \_? I don't see a way to do it using ([\^\_]) and \1.



  • Also in the first transformation, is there a way to avoid the nested substitutions?



  • Is there some completely other solution to this that would work even better? (By better I don't mean faster but easier to understand.)



Here are the guts:

```
# First collapse runs of superscripted and/or subscripted numerals.
$text =~ s{
(
(?: \$ \^ [0-9] \$ ){2,} # Match superscripted numerals
| (?: \$ \_ [0-9] \$ ){2,} # Match subscripted numerals
)
}{
my $x = $1;
$x =~ s{\$}{}g; # Remove '$'s
$x =~ s{(?<!^)[\^\_]}{}g; # Remove redundant '^'s and '_'s
$x =~ s{^(.)(.*)$}{\$${1}{$2}\$}; # Wrap in '{}' and replace '$'.
$x
}xeg;

# Now combine complete fractions.
$text =~ s{
\$ \^ \{? ([0-9]+) \}? \$ # $1 = numerator
(?: / | \x{2044} ) # Slash or Fraction Slash
\$ \_ \{? ([0-9]+) \}? \$ # $2 = denominator
}{
"\\frac{$1}{$2}"
}x

Solution

(Disclaimer: I don't know perl, just regular expressions)

First question


is there a way to avoid the alternation operator (|) and simply match on either \^ or \_?

(?: \$ \^ [0-9] \$ ){2,}         # Match superscripted numerals
  | (?: \$ \_ [0-9] \$ ){2,}         # Match subscripted numerals


Try doing this:

(?: \$ [_^] [0-9] \$ ){2,}         # Match (super|sub)scripted numerals


(edit: this above regex actually wouldn't work for this case, see comments below for another suggestion)

The item [_^] is a Character Class - same as [0-9]. Perl syntax might require the escape characters, in which case it would look like [\^\_] - but generally in regexes you don't need to escape inside a character class. As you mentioned, in this case if the ^ character is first then it is a negated character class - and it tends to be up to the developer (my preference is to escape only when necessary).

Here's some more information on character classes: http://www.regular-expressions.info/charclass.html

Second question


in the first transformation, is there a way to avoid the nested substitutions?

I assume you mean this:

$x =~ s{\$}{}g;                      # Remove '

With regular expressions, I've found that the most compact ones aren't necessarily the easiest ones to read. The way you have it split up here and documented seems helpful if it doesn't cause a performance hit.
That said, you could probably combine the first two lines like this:

$x =~ s{\$|(?<!^)[_^]}{}g;           # Remove '

Third question


Is there some completely other solution to this that would work even better?

I can only think of one way to do part of this differently...
In the beginning, it looks like you can safely (maybe?) hunt down and kill ALL $$^ instances, e.g.:


This DeLorean DMC-12 runs on $^2$$^3$$^9$Pu and uses 1.21$\times$10$^9$ W.
This DeLorean DMC-12 runs on $^239$Pu and uses 1.21$\times$10$^9$ W.

Then you could search for instances of (<?=\$^)(.*)(?=\$) and wrap those with the {} characters:


This DeLorean DMC-12 runs on $^{239}$Pu and uses 1.21$\times$10$^9$ W.

You'll have to make the best judgement as to whether or not $$^ occurs validly another way, however, in which case this wouldn't help.

You could also write it as a humongous sed command if you're looking to spend a lot of time.s $x =~ s{(?<!^)[\^\_]}{}g; # Remove redundant '^'s and '_'s $x =~ s{^(.)(.*)$}{\${1}{$2}\$}; # Wrap in '{}' and replace '

With regular expressions, I've found that the most compact ones aren't necessarily the easiest ones to read. The way you have it split up here and documented seems helpful if it doesn't cause a performance hit.
That said, you could probably combine the first two lines like this:

%%CODEBLOCK_3%%

Third question


Is there some completely other solution to this that would work even better?

I can only think of one way to do part of this differently...
In the beginning, it looks like you can safely (maybe?) hunt down and kill ALL $$^ instances, e.g.:


This DeLorean DMC-12 runs on $^2$$^3$$^9$Pu and uses 1.21$\times$10$^9$ W.
This DeLorean DMC-12 runs on $^239$Pu and uses 1.21$\times$10$^9$ W.

Then you could search for instances of (<?=\$^)(.*)(?=\$) and wrap those with the {} characters:


This DeLorean DMC-12 runs on $^{239}$Pu and uses 1.21$\times$10$^9$ W.

You'll have to make the best judgement as to whether or not $$^ occurs validly another way, however, in which case this wouldn't help.

You could also write it as a humongous sed command if you're looking to spend a lot of time..


With regular expressions, I've found that the most compact ones aren't necessarily the easiest ones to read. The way you have it split up here and documented seems helpful if it doesn't cause a performance hit.
That said, you could probably combine the first two lines like this:

%%CODEBLOCK_3%%

Third question


Is there some completely other solution to this that would work even better?

I can only think of one way to do part of this differently...
In the beginning, it looks like you can safely (maybe?) hunt down and kill ALL $$^ instances, e.g.:


This DeLorean DMC-12 runs on $^2$$^3$$^9$Pu and uses 1.21$\times$10$^9$ W.
This DeLorean DMC-12 runs on $^239$Pu and uses 1.21$\times$10$^9$ W.

Then you could search for instances of (<?=\$^)(.*)(?=\$) and wrap those with the {} characters:


This DeLorean DMC-12 runs on $^{239}$Pu and uses 1.21$\times$10$^9$ W.

You'll have to make the best judgement as to whether or not $$^ occurs validly another way, however, in which case this wouldn't help.

You could also write it as a humongous sed command if you're looking to spend a lot of time.s and redundant ^ and _ chars


Third question


Is there some completely other solution to this that would work even better?

I can only think of one way to do part of this differently...
In the beginning, it looks like you can safely (maybe?) hunt down and kill ALL $$^ instances, e.g.:


This DeLorean DMC-12 runs on $^2$$^3$$^9$Pu and uses 1.21$\times$10$^9$ W.
This DeLorean DMC-12 runs on $^239$Pu and uses 1.21$\times$10$^9$ W.

Then you could search for instances of (<?=\$^)(.*)(?=\$) and wrap those with the {} characters:


This DeLorean DMC-12 runs on $^{239}$Pu and uses 1.21$\times$10$^9$ W.

You'll have to make the best judgement as to whether or not $$^ occurs validly another way, however, in which case this wouldn't help.

You could also write it as a humongous sed command if you're looking to spend a lot of time.s $x =~ s{(?<!^)[\^\_]}{}g; # Remove redundant '^'s and '_'s $x =~ s{^(.)(.*)$}{\${1}{$2}\$}; # Wrap in '{}' and replace '

With regular expressions, I've found that the most compact ones aren't necessarily the easiest ones to read. The way you have it split up here and documented seems helpful if it doesn't cause a performance hit.
That said, you could probably combine the first two lines like this:

%%CODEBLOCK_3%%

Third question


Is there some completely other solution to this that would work even better?

I can only think of one way to do part of this differently...
In the beginning, it looks like you can safely (maybe?) hunt down and kill ALL $$^ instances, e.g.:


This DeLorean DMC-12 runs on $^2$$^3$$^9$Pu and uses 1.21$\times$10$^9$ W.
This DeLorean DMC-12 runs on $^239$Pu and uses 1.21$\times$10$^9$ W.

Then you could search for instances of (<?=\$^)(.*)(?=\$) and wrap those with the {} characters:


This DeLorean DMC-12 runs on $^{239}$Pu and uses 1.21$\times$10$^9$ W.

You'll have to make the best judgement as to whether or not $$^ occurs validly another way, however, in which case this wouldn't help.

You could also write it as a humongous sed command if you're looking to spend a lot of time..

With regular expressions, I've found that the most compact ones aren't necessarily the easiest ones to read. The way you have it split up here and documented seems helpful if it doesn't cause a performance hit.
That said, you could probably combine the first two lines like this:

%%CODEBLOCK_3%%

Third question


Is there some completely other solution to this that would work even better?

I can only think of one way to do part of this differently...
In the beginning, it looks like you can safely (maybe?) hunt down and kill ALL $$^ instances, e.g.:


This DeLorean DMC-12 runs on $^2$$^3$$^9$Pu and uses 1.21$\times$10$^9$ W.
This DeLorean DMC-12 runs on $^239$Pu and uses 1.21$\times$10$^9$ W.

Then you could search for instances of (<?=\$^)(.*)(?=\$) and wrap those with the {} characters:


This DeLorean DMC-12 runs on $^{239}$Pu and uses 1.21$\times$10$^9$ W.

You'll have to make the best judgement as to whether or not $$^ occurs validly another way, however, in which case this wouldn't help.

You could also write it as a humongous sed command if you're looking to spend a lot of time.

Code Snippets

(?: \$ \^ [0-9] \$ ){2,}         # Match superscripted numerals
  | (?: \$ \_ [0-9] \$ ){2,}         # Match subscripted numerals
(?: \$ [_^] [0-9] \$ ){2,}         # Match (super|sub)scripted numerals
$x =~ s{\$}{}g;                      # Remove '$'s
$x =~ s{(?<!^)[\^\_]}{}g;            # Remove redundant '^'s and '_'s
$x =~ s{^(.)(.*)$}{\$${1}{$2}\$};    # Wrap in '{}' and replace '$'.
$x =~ s{\$|(?<!^)[_^]}{}g;           # Remove '$'s and redundant ^ and _ chars

Context

StackExchange Code Review Q#8510, answer score: 4

Revisions (0)

No revisions yet.