patternMinor
Canonicalizing a large set of addresses using many regex substitutions
Viewed 0 times
substitutionscanonicalizinglargeaddressesusingmanyregexset
Problem
I have a script that is standardizing a large amount of data in the database. The standardization involves applying over 500 regular expressions to the data.
Here is some quick pseudocode:
Here is the relevant block of code in Perl:
An excerpt of
This is working code but the performance is abysmal. Currently the script has been running for 2.5hours and has written out 3.13 million records to the output file with about 7million to go.
Is this the best it can get? Is there another faster, or less slower way? Maybe writing each row to a file first and then run each regular expression on the whole file?
I would like to know if there is better way to implement this before I try the above mentioned alternative.
Here is some quick pseudocode:
Load the records from the database
for each record
for each regular expression ##predefined list of 500 regular expressions
apply regular expression to record
end;
write record to file
end;Here is the relevant block of code in Perl:
while(my @row = $queryHandle->fetchrow_array())
{
$accountKey = @row[0];
$addressLine1 = @row[1];
$addressLine2 = @row[2];
#first remove all special characters leaving only numbers and alphabets
$addressLine1 =~ s/[^A-Za-z0-9 ]//g;
$addressLine2 =~ s/[^A-Za-z0-9 ]//g;
for my $regexRef (@regexesList)
{
#now standardize the addresses
$addressLine1 =~ s/$regexRef->{pattern}/$regexRef->{output}/ig;
$addressLine2 =~ s/$regexRef->{pattern}/$regexRef->{output}/ig;
}
my $standardizedAddress = $addressLine1 . $addressLine2;
$standardizedAddress =~ s/\s+//g; #remove all white space
#Et_Run_Log("the normalized address is: $normalizedAddress");
print $dataFileHandle "${standardizedAddress}\n";
$rowCount++;
}An excerpt of
regexList:my %regexHash;
$regexHash{pattern} = "\bstr\b";
$regexHash{output} = "street";
#add regex hash to list
push(@regexesList, {%regexHash});
$regexHash{pattern} = "\brd\b";
$regexHash{output} = "road";
push(@regexesList, {%regexHash});This is working code but the performance is abysmal. Currently the script has been running for 2.5hours and has written out 3.13 million records to the output file with about 7million to go.
Is this the best it can get? Is there another faster, or less slower way? Maybe writing each row to a file first and then run each regular expression on the whole file?
I would like to know if there is better way to implement this before I try the above mentioned alternative.
Solution
Nested loops can be problematic, but you can build regex which can be applied once instead of running
and later inside while loop,
foreachmy %replace = (
str => { re => '\bstr\b', output => "street" },
rd => { re => '\brd\b', output => "road" },
);
my ($reCombined) = map qr/($_)/, join "|", map $_->{re}, values %replace;and later inside while loop,
# my $string = "Foo str bar rd foo";
$string =~ s/$reCombined/$replace{$1}{output}/g;Code Snippets
my %replace = (
str => { re => '\bstr\b', output => "street" },
rd => { re => '\brd\b', output => "road" },
);
my ($reCombined) = map qr/($_)/, join "|", map $_->{re}, values %replace;# my $string = "Foo str bar rd foo";
$string =~ s/$reCombined/$replace{$1}{output}/g;Context
StackExchange Code Review Q#56059, answer score: 2
Revisions (0)
No revisions yet.