HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Preprocessing text input to a machine-learning algorithm

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
preprocessingtextinputlearningalgorithmmachine

Problem

I have written the following function to preprocess some text data as input to machine learning algorithm. It lowercase, tokenises, removes stop words and lemmatizes, returning a string of space-separated tokens. However, this code runs extremely slowly. What can I do to optimise it?

import os
import re
import csv
import time
import nltk
import string
import pickle
import numpy as np
import pandas as pd
import pyparsing as pp
import matplotlib.pyplot as plt
from sklearn import preprocessing
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

def preprocessText(text, lemmatizer, lemma, ps):
        '''
        Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
        '''
        words = text.lower()
        words = re.sub("[^a-zA-Z]", " ", words)
        words = word_tokenize(words)
        stemmed_words = []
        stops = set(stopwords.words("english"))
        meaningful_words = [w for w in words if not w in stops]
        text = ""
        if lemmatizer == True:
            pos_translate = {'J':'a', 'V':'v', 'N':'n', 'R':'r'}
            meaningful_words = [lemma.lemmatize(w,pos=pos_translate[pos[0]] if pos[0] in pos_translate else 'n') for w,pos in nltk.pos_tag(meaningful_words)]
            for each in meaningful_words:
                if len(each) > 1:
                    text = text +" " + each
            return text
        else:
            words_again = []
            for each in meaningful_words:
                words_again.append(ps.stem(each))
            text = ""
            for each in words_again:
                if len(each) > 1:
                    text = text +" " +each
            return(text)

Solution

Given that you are already using Python, I would highly recommend using Spacy (base text parsing & tagging) and Textacy (higher level text processing built on top of Spacy). It can do everything you want to do, and more, with one function call:

http://textacy.readthedocs.io/en/latest/api_reference.html#textacy.preprocess.preprocess_text

For your further travels in text based machine learning, there are also a wealth of additional features, particularly with Spacy 2.0 and its universe.

Context

StackExchange Code Review Q#154088, answer score: 3

Revisions (0)

No revisions yet.