patternpythonMinor
Extracting citations of court documents in multiple languages
Viewed 0 times
courtlanguagescitationsmultipleextractingdocuments
Problem
I've got 20'000+ court documents I want to pull specific data points out of: date, document number, verdict. I am using Python and Regex to perform this.
The verdicts are in three languages (German, French and Italian) and some of them have slightly different formatting. I am trying to develop functions for the various data points that take this and the different languages into regards.
I'm finding my functions very clumsy. Has anybody got a more pythonic way to develop these functions?
```
def gericht(doc):
Gericht = re.findall(
r"Beschwerde gegen [a-z]+ [A-Z][a-züöä]+ ([^\n\n]*)", doc)
Gericht1 = re.findall(
r"Beschwerde nach [A-Za-z]. [0-9]+ [a-z]+. [A-Z]+ [a-z]+ [a-z]+[A-Za-z]+ [a-z]+ [0-9]+. [A-Za-z]+ [0-9]+ ([^\n\n]*)", doc)
Gericht2 = re.findall(
r"Revisionsgesuch gegen das Urteil ([^\n\n]*)", doc)
Gericht3 = re.findall(
r"Urteil des ([^\n\n]*)", doc)
Gericht_it = re.findall(
r"ricorso contro la sentenza emanata il [0-9]+ [a-z]+ [0-9]+ [a-z]+ ([^\n\n]*)", doc)
Gericht_fr = re.findall(
r"recours contre l'arrêt ([^\n\n]*)", doc)
Gericht_fr_1 = re.findall(
r"recours contre le jugement ([^\n\n]*)", doc)
Gericht_fr_2 = re.findall(
r"demande de révision de l'arrêt ([^\n\n]*)", doc)
try:
if Gericht != None:
return Gericht[0]
except:
None
try:
if Gericht1 != None:
return Gericht1[0]
except:
None
try:
if Gericht2 != None:
return Gericht2[0]
except:
None
try:
if Gericht3 != None:
return Gericht3[0]
except:
None
try:
if Gericht_it != None:
return Gericht_it[0]
except:
None
try:
if Gericht_fr != None:
Gericht_fr = Gericht_fr[0].replace('de la ', '').replace('du ', '')
return Gericht_fr
except:
None
try:
if Gericht_fr_1 != None:
Gericht_fr_1 = Gericht_fr_1[0].rep
The verdicts are in three languages (German, French and Italian) and some of them have slightly different formatting. I am trying to develop functions for the various data points that take this and the different languages into regards.
I'm finding my functions very clumsy. Has anybody got a more pythonic way to develop these functions?
```
def gericht(doc):
Gericht = re.findall(
r"Beschwerde gegen [a-z]+ [A-Z][a-züöä]+ ([^\n\n]*)", doc)
Gericht1 = re.findall(
r"Beschwerde nach [A-Za-z]. [0-9]+ [a-z]+. [A-Z]+ [a-z]+ [a-z]+[A-Za-z]+ [a-z]+ [0-9]+. [A-Za-z]+ [0-9]+ ([^\n\n]*)", doc)
Gericht2 = re.findall(
r"Revisionsgesuch gegen das Urteil ([^\n\n]*)", doc)
Gericht3 = re.findall(
r"Urteil des ([^\n\n]*)", doc)
Gericht_it = re.findall(
r"ricorso contro la sentenza emanata il [0-9]+ [a-z]+ [0-9]+ [a-z]+ ([^\n\n]*)", doc)
Gericht_fr = re.findall(
r"recours contre l'arrêt ([^\n\n]*)", doc)
Gericht_fr_1 = re.findall(
r"recours contre le jugement ([^\n\n]*)", doc)
Gericht_fr_2 = re.findall(
r"demande de révision de l'arrêt ([^\n\n]*)", doc)
try:
if Gericht != None:
return Gericht[0]
except:
None
try:
if Gericht1 != None:
return Gericht1[0]
except:
None
try:
if Gericht2 != None:
return Gericht2[0]
except:
None
try:
if Gericht3 != None:
return Gericht3[0]
except:
None
try:
if Gericht_it != None:
return Gericht_it[0]
except:
None
try:
if Gericht_fr != None:
Gericht_fr = Gericht_fr[0].replace('de la ', '').replace('du ', '')
return Gericht_fr
except:
None
try:
if Gericht_fr_1 != None:
Gericht_fr_1 = Gericht_fr_1[0].rep
Solution
Always use 4 spaces for your indentation, you use; 3, 4, and 5. If you're one space out, then it can break your code, so it really does matter.
You're doing roughly the same for all your different regexes, and so you should loop through the regexes, with the common code in the loop.
However I'd change your common code to not replace
so it isn't indexed if it's
Take:
However this is needlessly finding all items for the regex in the document, which is not what you want. You just want the first.
To do this without editing the regexes you could use
This can get you:
You may be able to change this to merge all the regexes together with
But I'm not too good with regexes.
You're doing roughly the same for all your different regexes, and so you should loop through the regexes, with the common code in the loop.
However I'd change your common code to not replace
de la or du, and to not need the try-except as you can check if gericht is truthy,so it isn't indexed if it's
None or an empty array.Take:
REGEXES = [
r"Beschwerde gegen [a-z]+ [A-Z][a-züöä]+ ([^\n\n]*)",
r"Beschwerde nach [A-Za-z]. [0-9]+ [a-z]+. [A-Z]+ [a-z]+ [a-z]+[A-Za-z]+ [a-z]+ [0-9]+. [A-Za-z]+ [0-9]+ ([^\n\n]*)",
r"Revisionsgesuch gegen das Urteil ([^\n\n]*)",
r"Urteil des ([^\n\n]*)",
r"ricorso contro la sentenza emanata il [0-9]+ [a-z]+ [0-9]+ [a-z]+ ([^\n\n]*)",
r"recours contre l'arrêt ([^\n\n]*)",
r"recours contre le jugement ([^\n\n]*)",
r"demande de révision de l'arrêt ([^\n\n]*)",
]
def gericht(doc):
for regex in REGEXES:
gericht = re.findall(regex, doc)
if gericht:
return gericht[0]However this is needlessly finding all items for the regex in the document, which is not what you want. You just want the first.
To do this without editing the regexes you could use
itertools.chain and use re.finditer.This can get you:
from itertools import chain
def gericht(doc):
gericht = next(chain.from_iterable(re.finditer(r, doc) for r in REGEXES), None)
if gericht is not None:
return gericht.group(0)You may be able to change this to merge all the regexes together with
|, but then you won't prioritize the first regex over the others. Which may work with:def gericht(doc):
gericht = next(re.finditer('|'.join(REGEXES), doc), None)
if gericht is not None:
return gericht.group(0)But I'm not too good with regexes.
Code Snippets
REGEXES = [
r"Beschwerde gegen [a-z]+ [A-Z][a-züöä]+ ([^\n\n]*)",
r"Beschwerde nach [A-Za-z]. [0-9]+ [a-z]+. [A-Z]+ [a-z]+ [a-z]+[A-Za-z]+ [a-z]+ [0-9]+. [A-Za-z]+ [0-9]+ ([^\n\n]*)",
r"Revisionsgesuch gegen das Urteil ([^\n\n]*)",
r"Urteil des ([^\n\n]*)",
r"ricorso contro la sentenza emanata il [0-9]+ [a-z]+ [0-9]+ [a-z]+ ([^\n\n]*)",
r"recours contre l'arrêt ([^\n\n]*)",
r"recours contre le jugement ([^\n\n]*)",
r"demande de révision de l'arrêt ([^\n\n]*)",
]
def gericht(doc):
for regex in REGEXES:
gericht = re.findall(regex, doc)
if gericht:
return gericht[0]from itertools import chain
def gericht(doc):
gericht = next(chain.from_iterable(re.finditer(r, doc) for r in REGEXES), None)
if gericht is not None:
return gericht.group(0)def gericht(doc):
gericht = next(re.finditer('|'.join(REGEXES), doc), None)
if gericht is not None:
return gericht.group(0)Context
StackExchange Code Review Q#151579, answer score: 8
Revisions (0)
No revisions yet.