HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Python CGI front-end for web service to perform machine translation

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
cgifrontserviceforpythonendwebperformtranslationmachine

Problem

I am trying to optimize this python script that is used to process web requests for machine translation. The actual translation executable that is called is quite fast. Also, the perl scripts that are called are fast as well.

The largest performance boost came from removing unnecessary import libraries. I would like to have this code reviewed so I can further optimize the performance. Also, I welcome any advice on a pythonic way of testing performance. My code is littered with timing and print commands that I removed for this post.

```
#!/usr/bin/env python
# -- coding: UTF-8 --

import time
import sys
import cgi
import subprocess
import string
import xmlrpclib

reload(sys)
sys.setdefaultencoding('utf8')

isTestPerformance = len(sys.argv) == 4

# Parameters
if isTestPerformance:
source = sys.argv[1]
target = sys.argv[2]
sourceText = sys.argv[3]
else:
# this part is important to tell the browser that output is html text.
print "Access-Control-Allow-Origin: *"
print "Content-Type: text/plain;charset=utf-8"
print

form = cgi.FieldStorage()
sourceText = form.getvalue("sourceText").decode('utf8')
source = form.getvalue("source").lower()
target = form.getvalue("target").lower()

# Decode the CGI encoded source text
# NOTE: Custom encoding of semicolon (;), (?), (&), (#), etc, is only done here because
# CGI can not handle them. Do not used this (decode) if you are not using CGI,
# or use some other decoding that matches the encoding from the caller of this code
sourceText = sourceText.replace("__QUESTION_MARK__", "?")
sourceText = sourceText.replace("__SEMICOLON__", ";")
sourceText = sourceText.replace("__AMPERSAND__", "&")
sourceText = sourceText.replace("__NUMBER__", "#")
# sourceText = sourceText.replace("__NEWLINE__", "\n")

# Tokenize the Source Text
if source == "zh":
# Chinese has to do word alignment
# options are slim: write the text to a file
# use NLTK Stanford NLP (python>java) to segment chinese

Solution

I don't really know what moses or RPC2 are, and so I've not tested your code, or changed it too much.
However, there are still some things you can do to make your code much easier to understand and maintain.

You want to use a lot more functions, and I would use a couple of classes. There are two main classes that I'd make Moses and Translater. The former should be your interface with Moses, where Translater should allow you to translate from one language to another with relative ease. Doing this will allow you to re-use the code in an easy way.

This allows you to remove most of your comments, as now you have variable names that tell us the same as your comments. It also means that if you need to explain them in greater detail then you can in a docstring.

I'd also change your massive if block to a dictionary. This will allow you to define the source and target as keys, and then return the ports.

The performance problems that you have will almost definitely be due to executing three other Perl shells, and interfacing with another server. You're unlikely able to fix this, unless you translate the code from Perl to Python / C, or there's a native Python interface, that doesn't have a slow interface.

You don't need a lot of your imports, and so I'd remove them. I'd also recommend against using most of sys, as the docs say they aren't intended for non-internal use, and so you're setting yourself up to have horrific bugs.

Finally, I'd recommend that you use a better web framework, such as Flask or Django. This would allow for a simpler interface, and mean that isTestPerformance, and the odd prints aren't needed. And should allow you to remove the 'CGI Custom encoding' from the file, as these should have solid implementations of percent encoding, or better yet, allow you to post to the server. Which your program doesn't seem to support very well.

In all I changed your code to:

```
#!/usr/bin/env python
# -- coding: UTF-8 --
import os.path
import subprocess
import sys
import xmlrpclib

PORTS = {
('en', 'zh'): '3001',
('en', 'de'): '3002',
('en', 'es'): '3003',
('en', 'fr'): '3004',
('en', 'it'): '3005',
('en', 'nl'): '3006',
('en', 'pl'): '3007',
('en', 'pt'): '3008',
('en', 'ro'): '3009',
('en', 'ru'): '3010',
('en', 'sl'): '3011',
('en', 'hr'): '3012',
('en', 'tr'): '3013',
('en', 'ar'): '3014',
('en', 'fa'): '3015',

('zh', 'en'): '4001',
('de', 'en'): '4002',
('es', 'en'): '4003',
('fr', 'en'): '4004',
('it', 'en'): '4005',
('nl', 'en'): '4006',
('pl', 'en'): '4007',
('pt', 'en'): '4008',
('ro', 'en'): '4009',
('ru', 'en'): '4010',
('sl', 'en'): '4011',
('hr', 'en'): '4012',
('tr', 'en'): '4013',
('ar', 'en'): '4014',
('fa', 'en'): '4015',
}

class Moses(object):
def __init__(self, path):
self.tokenizer_path = path

def _call(self, path):
pipe = subprocess.Popen(["perl", path, "-l", lang], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
return pipe.communicate(text)[0]

def tokenize(self, lang, text):
return _call(os.path.join(self.tokenizer_path, "/scripts/tokenizer/tokenizer.perl"))

def detokenize(self, lang, text):
return _call(os.path.join(self.tokenizer_path, "/scripts/tokenizer/detokenizer.perl"))

def normalize_punctuation(self, lang, text):
return _call(os.path.join(self.tokenizer_path, "/scripts/tokenizer/normalize-punctuation.perl"))

class Translater(object):
def __init__(self, path):
self.moses = Moses(path)

def _tokenize(self, source, target, text):
if source == "zh":
# Chinese has to do word alignment
# options are slim: write the text to a file
# use NLTK Stanford NLP (python>java) to segment chinese phrase
# then read the file and get the segmented phrase and continue
# TODO
# solution found (kinda) mini-segmenter
# https://github.com/alvations/mini-segmenter
import miniseg.minisegmenter as mini
return mini.segmenter(text)
else:
return self.moses.tokenize(source, text)

def _translate(self, source, text):
port = PORTS[(source, target)]
proxy = xmlrpclib.ServerProxy("http://localhost:" + port + "/RPC2")
params = {"text": text, "align": "false", "report-all-factors": "false"}
result = proxy.translate(params)
return result['text'].encode('utf-8')

def _post_process(self, target, text):
if target == "zh":
# Chinese - Get rid of the spaces (word segmentation)
text = text.replace(" ", "")
# Post-Processes the translation output (regardless of language)
text = text.replace("__UNK__,", ",")
text = text.replace("__UNK__", " ")
#text = text.replace(" _ _ NEWLINE _ _ ", "\n")
text = text.replace(" ", " ")
return text

Code Snippets

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import os.path
import subprocess
import sys
import xmlrpclib


PORTS = {
    ('en', 'zh'): '3001',
    ('en', 'de'): '3002',
    ('en', 'es'): '3003',
    ('en', 'fr'): '3004',
    ('en', 'it'): '3005',
    ('en', 'nl'): '3006',
    ('en', 'pl'): '3007',
    ('en', 'pt'): '3008',
    ('en', 'ro'): '3009',
    ('en', 'ru'): '3010',
    ('en', 'sl'): '3011',
    ('en', 'hr'): '3012',
    ('en', 'tr'): '3013',
    ('en', 'ar'): '3014',
    ('en', 'fa'): '3015',

    ('zh', 'en'): '4001',
    ('de', 'en'): '4002',
    ('es', 'en'): '4003',
    ('fr', 'en'): '4004',
    ('it', 'en'): '4005',
    ('nl', 'en'): '4006',
    ('pl', 'en'): '4007',
    ('pt', 'en'): '4008',
    ('ro', 'en'): '4009',
    ('ru', 'en'): '4010',
    ('sl', 'en'): '4011',
    ('hr', 'en'): '4012',
    ('tr', 'en'): '4013',
    ('ar', 'en'): '4014',
    ('fa', 'en'): '4015',
}


class Moses(object):
    def __init__(self, path):
        self.tokenizer_path = path

    def _call(self, path):
        pipe = subprocess.Popen(["perl", path, "-l", lang], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
        return pipe.communicate(text)[0]

    def tokenize(self, lang, text):
        return _call(os.path.join(self.tokenizer_path, "/scripts/tokenizer/tokenizer.perl"))

    def detokenize(self, lang, text):
        return _call(os.path.join(self.tokenizer_path, "/scripts/tokenizer/detokenizer.perl"))

    def normalize_punctuation(self, lang, text):
        return _call(os.path.join(self.tokenizer_path, "/scripts/tokenizer/normalize-punctuation.perl"))


class Translater(object):
    def __init__(self, path):
        self.moses = Moses(path)

    def _tokenize(self, source, target, text):
        if source == "zh":
            # Chinese has to do word alignment
            # options are slim: write the text to a file
            # use NLTK Stanford NLP (python>java) to segment chinese phrase
            # then read the file and get the segmented phrase and continue 
            # TODO 
            # solution found (kinda) mini-segmenter
            # https://github.com/alvations/mini-segmenter
            import miniseg.minisegmenter as mini
            return mini.segmenter(text)
        else:
            return self.moses.tokenize(source, text)

    def _translate(self, source, text):
        port = PORTS[(source, target)]
        proxy = xmlrpclib.ServerProxy("http://localhost:" + port + "/RPC2")
        params = {"text": text, "align": "false", "report-all-factors": "false"}
        result = proxy.translate(params)
        return result['text'].encode('utf-8')

    def _post_process(self, target, text):
        if target == "zh":
            # Chinese - Get rid of the spaces (word segmentation)
            text = text.replace(" ", "")
        # Post-Processes the translation output (regardless of language)
        text = text.replace("__UNK__,", ",")
        text = text.replace("__UNK__", " ")
        #text = text.replace(" _ _ NEWLINE 

Context

StackExchange Code Review Q#163505, answer score: 2

Revisions (0)

No revisions yet.