HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Extracting citations of court documents in multiple languages

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
courtlanguagescitationsmultipleextractingdocuments

Problem

I've got 20'000+ court documents I want to pull specific data points out of: date, document number, verdict. I am using Python and Regex to perform this.

The verdicts are in three languages (German, French and Italian) and some of them have slightly different formatting. I am trying to develop functions for the various data points that take this and the different languages into regards.

I'm finding my functions very clumsy. Has anybody got a more pythonic way to develop these functions?

```
def gericht(doc):
Gericht = re.findall(
r"Beschwerde gegen [a-z]+ [A-Z][a-züöä]+ ([^\n\n]*)", doc)
Gericht1 = re.findall(
r"Beschwerde nach [A-Za-z]. [0-9]+ [a-z]+. [A-Z]+ [a-z]+ [a-z]+[A-Za-z]+ [a-z]+ [0-9]+. [A-Za-z]+ [0-9]+ ([^\n\n]*)", doc)
Gericht2 = re.findall(
r"Revisionsgesuch gegen das Urteil ([^\n\n]*)", doc)
Gericht3 = re.findall(
r"Urteil des ([^\n\n]*)", doc)
Gericht_it = re.findall(
r"ricorso contro la sentenza emanata il [0-9]+ [a-z]+ [0-9]+ [a-z]+ ([^\n\n]*)", doc)
Gericht_fr = re.findall(
r"recours contre l'arrêt ([^\n\n]*)", doc)
Gericht_fr_1 = re.findall(
r"recours contre le jugement ([^\n\n]*)", doc)
Gericht_fr_2 = re.findall(
r"demande de révision de l'arrêt ([^\n\n]*)", doc)

try:
if Gericht != None:
return Gericht[0]
except:
None

try:
if Gericht1 != None:
return Gericht1[0]
except:
None

try:
if Gericht2 != None:
return Gericht2[0]
except:
None

try:
if Gericht3 != None:
return Gericht3[0]
except:
None

try:
if Gericht_it != None:
return Gericht_it[0]
except:
None

try:
if Gericht_fr != None:
Gericht_fr = Gericht_fr[0].replace('de la ', '').replace('du ', '')
return Gericht_fr
except:
None

try:
if Gericht_fr_1 != None:
Gericht_fr_1 = Gericht_fr_1[0].rep

Solution

Always use 4 spaces for your indentation, you use; 3, 4, and 5. If you're one space out, then it can break your code, so it really does matter.

You're doing roughly the same for all your different regexes, and so you should loop through the regexes, with the common code in the loop.
However I'd change your common code to not replace de la or du, and to not need the try-except as you can check if gericht is truthy,
so it isn't indexed if it's None or an empty array.
Take:

REGEXES = [
    r"Beschwerde gegen [a-z]+ [A-Z][a-züöä]+ ([^\n\n]*)",
    r"Beschwerde nach [A-Za-z]. [0-9]+ [a-z]+. [A-Z]+ [a-z]+ [a-z]+[A-Za-z]+ [a-z]+ [0-9]+. [A-Za-z]+ [0-9]+ ([^\n\n]*)",
    r"Revisionsgesuch gegen das Urteil ([^\n\n]*)",
    r"Urteil des ([^\n\n]*)",
    r"ricorso contro la sentenza emanata il [0-9]+ [a-z]+ [0-9]+ [a-z]+ ([^\n\n]*)",
    r"recours contre l'arrêt ([^\n\n]*)",
    r"recours contre le jugement ([^\n\n]*)",
    r"demande de révision de l'arrêt ([^\n\n]*)",
]

def gericht(doc):
    for regex in REGEXES:
        gericht = re.findall(regex, doc)
        if gericht:
            return gericht[0]


However this is needlessly finding all items for the regex in the document, which is not what you want. You just want the first.
To do this without editing the regexes you could use itertools.chain and use re.finditer.
This can get you:

from itertools import chain

def gericht(doc):
    gericht = next(chain.from_iterable(re.finditer(r, doc) for r in REGEXES), None)
    if gericht is not None:
        return gericht.group(0)


You may be able to change this to merge all the regexes together with |, but then you won't prioritize the first regex over the others. Which may work with:

def gericht(doc):
    gericht = next(re.finditer('|'.join(REGEXES), doc), None)
    if gericht is not None:
        return gericht.group(0)


But I'm not too good with regexes.

Code Snippets

REGEXES = [
    r"Beschwerde gegen [a-z]+ [A-Z][a-züöä]+ ([^\n\n]*)",
    r"Beschwerde nach [A-Za-z]. [0-9]+ [a-z]+. [A-Z]+ [a-z]+ [a-z]+[A-Za-z]+ [a-z]+ [0-9]+. [A-Za-z]+ [0-9]+ ([^\n\n]*)",
    r"Revisionsgesuch gegen das Urteil ([^\n\n]*)",
    r"Urteil des ([^\n\n]*)",
    r"ricorso contro la sentenza emanata il [0-9]+ [a-z]+ [0-9]+ [a-z]+ ([^\n\n]*)",
    r"recours contre l'arrêt ([^\n\n]*)",
    r"recours contre le jugement ([^\n\n]*)",
    r"demande de révision de l'arrêt ([^\n\n]*)",
]

def gericht(doc):
    for regex in REGEXES:
        gericht = re.findall(regex, doc)
        if gericht:
            return gericht[0]
from itertools import chain

def gericht(doc):
    gericht = next(chain.from_iterable(re.finditer(r, doc) for r in REGEXES), None)
    if gericht is not None:
        return gericht.group(0)
def gericht(doc):
    gericht = next(re.finditer('|'.join(REGEXES), doc), None)
    if gericht is not None:
        return gericht.group(0)

Context

StackExchange Code Review Q#151579, answer score: 8

Revisions (0)

No revisions yet.