HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Canonicalizing a large set of addresses using many regex substitutions

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
substitutionscanonicalizinglargeaddressesusingmanyregexset

Problem

I have a script that is standardizing a large amount of data in the database. The standardization involves applying over 500 regular expressions to the data.

Here is some quick pseudocode:

Load the records from the database
for each record
   for each regular expression ##predefined list of 500 regular expressions 
      apply regular expression to record
    end;
    write record to file
end;


Here is the relevant block of code in Perl:

while(my @row = $queryHandle->fetchrow_array())
{
    $accountKey = @row[0];
    $addressLine1 = @row[1];
    $addressLine2 = @row[2];

    #first remove all special characters leaving only numbers and alphabets
    $addressLine1 =~ s/[^A-Za-z0-9 ]//g;
    $addressLine2 =~ s/[^A-Za-z0-9 ]//g;

    for my $regexRef (@regexesList)
    {
        #now standardize the addresses
        $addressLine1 =~ s/$regexRef->{pattern}/$regexRef->{output}/ig;
        $addressLine2 =~ s/$regexRef->{pattern}/$regexRef->{output}/ig;
    }

    my $standardizedAddress = $addressLine1 . $addressLine2;
    $standardizedAddress =~ s/\s+//g; #remove all white space
    #Et_Run_Log("the normalized address is: $normalizedAddress");

    print $dataFileHandle "${standardizedAddress}\n";
    $rowCount++;
}


An excerpt of regexList:

my %regexHash;
$regexHash{pattern} = "\bstr\b";
$regexHash{output} = "street";

#add regex hash to list 
push(@regexesList, {%regexHash});

$regexHash{pattern} = "\brd\b";
$regexHash{output} = "road";
push(@regexesList, {%regexHash});


This is working code but the performance is abysmal. Currently the script has been running for 2.5hours and has written out 3.13 million records to the output file with about 7million to go.

Is this the best it can get? Is there another faster, or less slower way? Maybe writing each row to a file first and then run each regular expression on the whole file?

I would like to know if there is better way to implement this before I try the above mentioned alternative.

Solution

Nested loops can be problematic, but you can build regex which can be applied once instead of running foreach

my %replace = (
  str => { re => '\bstr\b', output => "street" },
  rd  => { re => '\brd\b', output => "road" },
);
my ($reCombined) = map qr/($_)/, join "|", map $_->{re}, values %replace;


and later inside while loop,

# my $string = "Foo str bar rd foo";
$string =~ s/$reCombined/$replace{$1}{output}/g;

Code Snippets

my %replace = (
  str => { re => '\bstr\b', output => "street" },
  rd  => { re => '\brd\b', output => "road" },
);
my ($reCombined) = map qr/($_)/, join "|", map $_->{re}, values %replace;
# my $string = "Foo str bar rd foo";
$string =~ s/$reCombined/$replace{$1}{output}/g;

Context

StackExchange Code Review Q#56059, answer score: 2

Revisions (0)

No revisions yet.