patternMinor
Perl script to match case law references
Viewed 0 times
scriptcaseperlmatchreferenceslaw
Problem
I am very new to Perl and decided to work on a simple script that could solve a problem I encounter in my day to day work. The purpose of the code that follows is to search over a body of text and extract English case law publication references. There are quite a few flavours of reference style, so I have several expressions looking for different reference structures.
I'm very much at the "hooray, I've managed to make it work"-stage on the Perl learning curve, but know enough to recognise that this code is pretty hideous.
There are two main areas I'm aiming to improve:
directly into the code.
An examples of obvious poor practice is that I have used
I'd be really grateful for some feedback on this very modest first attempt to use Perl in an applied way. It is pretty hideous and I'm sure there are endless ways in which this could be achieved more elegantly.
The Code
```
#!/usr/bin/perl
# Paste text to review below
$search_text = <<EOD;
In Salomon v A Salomon and Co Ltd [1897] AC 22, the House of Lords held that these principles applied as much to a company that was wholly owned and controlled by one man as to any other company. In Macaura v Northern Assurance Co Ltd [1925] AC 619, the House of Lords held that the sole owner and controller of a company did not even have an insurable interest in property of the company, although economically he was liable to suffer by its destruction. Lord Buckmaster, at pp 626-627 said:
EOD
print "\n-----------------------------\n";
print "Case References Found in Text";
print "\n-----------------------------\n";
# Find NCits
print "\n* NCits...\n\n";
while ($search_text =~ m/((\(|\[)\d{4}(\)|\]))(\s+((EWHC)|(EWHC\s+Admin)|(CAT)|(EWCA)|(EWCA\s+Civ)|(EWCA\s+Crim)|(EWCOP)|(EWFC)|(EWFC\s+B)|(EWPCC)|(UKHL)|(UKIAT)|(UKP
I'm very much at the "hooray, I've managed to make it work"-stage on the Perl learning curve, but know enough to recognise that this code is pretty hideous.
There are two main areas I'm aiming to improve:
- Getting the source text from a file, rather than plonking it
directly into the code.
- Getting the output to write to a output.txt file.
An examples of obvious poor practice is that I have used
strict or warnings. The interpreter didn't like various aspects of the code. I'd be really grateful for some feedback on this very modest first attempt to use Perl in an applied way. It is pretty hideous and I'm sure there are endless ways in which this could be achieved more elegantly.
The Code
```
#!/usr/bin/perl
# Paste text to review below
$search_text = <<EOD;
In Salomon v A Salomon and Co Ltd [1897] AC 22, the House of Lords held that these principles applied as much to a company that was wholly owned and controlled by one man as to any other company. In Macaura v Northern Assurance Co Ltd [1925] AC 619, the House of Lords held that the sole owner and controller of a company did not even have an insurable interest in property of the company, although economically he was liable to suffer by its destruction. Lord Buckmaster, at pp 626-627 said:
EOD
print "\n-----------------------------\n";
print "Case References Found in Text";
print "\n-----------------------------\n";
# Find NCits
print "\n* NCits...\n\n";
while ($search_text =~ m/((\(|\[)\d{4}(\)|\]))(\s+((EWHC)|(EWHC\s+Admin)|(CAT)|(EWCA)|(EWCA\s+Civ)|(EWCA\s+Crim)|(EWCOP)|(EWFC)|(EWFC\s+B)|(EWPCC)|(UKHL)|(UKIAT)|(UKP
Solution
Adding
The easiest way to read the input from a file is to specify the filename as a parameter to the script, and read from the file with the diamond operator
Also, there's a lot of repetition in your code. To deduplicate it, you can store the regular expressions in a hash keyed by the type of the reference. It shortens your script to
In a regex,
which changes the variables to output to
To output to a file, just redirect the output:
use warnings; at the top of the script doesn't emit any warnings. Adding use strict; makes the variable $search_text undeclared, but it's easy to fix it by adding my before its first use.The easiest way to read the input from a file is to specify the filename as a parameter to the script, and read from the file with the diamond operator
<> looping over the input lines:while (my $search_text = <>) {Also, there's a lot of repetition in your code. To deduplicate it, you can store the regular expressions in a hash keyed by the type of the reference. It shortens your script to
#!/usr/bin/perl
use warnings;
use strict;
my %reference = (
NCits => [ qr/((\(|\[)\d{4}(\)|\]))(\s+((EWHC)|(EWHC\s+Admin)|(CAT)|(EWCA)|(EWCA\s+Civ)|(EWCA\s+Crim)|(EWCOP)|(EWFC)|(EWFC\s+B)|(EWPCC)|(UKHL)|(UKIAT)|(UKPC)|(UKSC)|(CSOH)|(CSIH)|(NICA)|(IESC)|(IECCA)|(IECA)|(IEHC)|(UKUT))\s+\d+)/ ],
WLR => [ qr/((|\[)\d{4}(\)|\]))(\s+\d+\s+((WLR))\s+\d+)/ ],
'Appeal Cases' => [ qr/((|\[)\d{4}(\)|\]))(\s+\d+\s+((AC))\s+\d+)/,
qr/((|\[)\d{4}(\)|\]))(\s+((AC))\s+\d+)/ ],
"Queen's Bench Cases" => [ qr/((|\[)\d{4}(\)|\]))(\s+\d+\s+((QB))\s+\d+)/,
qr/((|\[)\d{4}(\)|\]))(\s+((QB))\s+\d+)/ ],
'External references' => [ qr/((\(|\[)\d{4}(\)|\])){0,1}(\s*\d+\s+((TLR)|(TLR\s+\(Pt\s+1\))|(TLR\s+\(Pt\s+2\))|(LGR)|(Cr\s*App\s*R)|(Cr\s*App\s*R\s*\(S\))|(Ll\s*L\s*Rep)|(LlLR)|(TC)|(FLR)|(BCLC))\s+\d+)/ ],
"NI Queen's Bench Cases" => [ qr/((|\[)\d{4}(\)|\]))(\s+((NIQB))\s+\d+)/ ],
NZLR => [ qr/((|\[)\d{4}(\)|\]))(\s+\d+\s+((NZLR))\s+\d+)/,
qr/((|\[)\d{4}(\)|\]))(\s+\d+\s+((All ER))\s+\d+)/ ],
ECR => [ qr/((|\[)\d{4}(\)|\]))(\s+((ECR))\s+\d+)/ ],
'Older volume' => [ qr/(\d+\s+((App\s*Cas)|(Ch\s*D)|(CPD)|(Ex\s*D)|(P.D.)|(Q.B.D.))\s+\d+)/ ],
);
print "\n-----------------------------\n";
print "Case References Found in Text";
print "\n-----------------------------\n";
while (my $search_text = <>) {
for my $type (keys %reference) {
print "\n***** $type *****\n";
for my $regex ( @{ $reference{$type} } ) {
while ($search_text =~ /$regex/ig) {
print "$1$4\n";
}
}
}
}In a regex,
(\(|\[) can be more readably written as ([[(]). For grouping without capturing, you can use (?: instead of (. For example, the NCits regex can be written as/((?:[[(])\d{4}(?:[])]))(\s+(?:EWHC(?:\s+Admin)?|CAT|EWCA(?:\s+(?:Civ|Crim))?|EW(?:COP|PCC)|EWFC(?:\s+B)?|UK(?:HL|IAT|[PS]C|UT)|CS(?:[OI]H)|NICA|IE(?:SC|CCA|CA|HC))\s+\d+)/which changes the variables to output to
$1$2.To output to a file, just redirect the output:
script.perl input-file > output-fileCode Snippets
while (my $search_text = <>) {#!/usr/bin/perl
use warnings;
use strict;
my %reference = (
NCits => [ qr/((\(|\[)\d{4}(\)|\]))(\s+((EWHC)|(EWHC\s+Admin)|(CAT)|(EWCA)|(EWCA\s+Civ)|(EWCA\s+Crim)|(EWCOP)|(EWFC)|(EWFC\s+B)|(EWPCC)|(UKHL)|(UKIAT)|(UKPC)|(UKSC)|(CSOH)|(CSIH)|(NICA)|(IESC)|(IECCA)|(IECA)|(IEHC)|(UKUT))\s+\d+)/ ],
WLR => [ qr/((|\[)\d{4}(\)|\]))(\s+\d+\s+((WLR))\s+\d+)/ ],
'Appeal Cases' => [ qr/((|\[)\d{4}(\)|\]))(\s+\d+\s+((AC))\s+\d+)/,
qr/((|\[)\d{4}(\)|\]))(\s+((AC))\s+\d+)/ ],
"Queen's Bench Cases" => [ qr/((|\[)\d{4}(\)|\]))(\s+\d+\s+((QB))\s+\d+)/,
qr/((|\[)\d{4}(\)|\]))(\s+((QB))\s+\d+)/ ],
'External references' => [ qr/((\(|\[)\d{4}(\)|\])){0,1}(\s*\d+\s+((TLR)|(TLR\s+\(Pt\s+1\))|(TLR\s+\(Pt\s+2\))|(LGR)|(Cr\s*App\s*R)|(Cr\s*App\s*R\s*\(S\))|(Ll\s*L\s*Rep)|(LlLR)|(TC)|(FLR)|(BCLC))\s+\d+)/ ],
"NI Queen's Bench Cases" => [ qr/((|\[)\d{4}(\)|\]))(\s+((NIQB))\s+\d+)/ ],
NZLR => [ qr/((|\[)\d{4}(\)|\]))(\s+\d+\s+((NZLR))\s+\d+)/,
qr/((|\[)\d{4}(\)|\]))(\s+\d+\s+((All ER))\s+\d+)/ ],
ECR => [ qr/((|\[)\d{4}(\)|\]))(\s+((ECR))\s+\d+)/ ],
'Older volume' => [ qr/(\d+\s+((App\s*Cas)|(Ch\s*D)|(CPD)|(Ex\s*D)|(P.D.)|(Q.B.D.))\s+\d+)/ ],
);
print "\n-----------------------------\n";
print "Case References Found in Text";
print "\n-----------------------------\n";
while (my $search_text = <>) {
for my $type (keys %reference) {
print "\n***** $type *****\n";
for my $regex ( @{ $reference{$type} } ) {
while ($search_text =~ /$regex/ig) {
print "$1$4\n";
}
}
}
}/((?:[[(])\d{4}(?:[])]))(\s+(?:EWHC(?:\s+Admin)?|CAT|EWCA(?:\s+(?:Civ|Crim))?|EW(?:COP|PCC)|EWFC(?:\s+B)?|UK(?:HL|IAT|[PS]C|UT)|CS(?:[OI]H)|NICA|IE(?:SC|CCA|CA|HC))\s+\d+)/script.perl input-file > output-fileContext
StackExchange Code Review Q#146864, answer score: 7
Revisions (0)
No revisions yet.