HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Preprocessing steps to follow while cleaning and extracting text data from tweets

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
preprocessingwhiletextdatafollowextractingandtweetsfromcleaning

Problem

I have a dataset of around 200,000 tweets. I am running a classification task on them. Dataset has two columns - class label and the tweet text. In the preprocessing step I am passing the dataset through following cleaning step:

import re
from nltk.corpus import stopwords
import pandas as pd

def preprocess(raw_text):

    # keep only words
    letters_only_text = re.sub("[^a-zA-Z]", " ", raw_text)

    # convert to lower case and split 
    words = letters_only_text.lower().split()

    # remove stopwords
    stopword_set = set(stopwords.words("english"))
    meaningful_words = [w for w in words if w not in stopword_set]

    # join the cleaned words in a list
    cleaned_word_list = " ".join(meaningful_words)

    return cleaned_word_list

def process_data(dataset):
    tweets_df = pd.read_csv(dataset,delimiter='|',header=None)

    num_tweets = tweets_df.shape[0]
    print("Total tweets: " + str(num_tweets))

    cleaned_tweets = []
    print("Beginning processing of tweets at: " + str(datetime.now()))

    for i in range(num_tweets):
        cleaned_tweet = preprocess(tweets_df.iloc[i][1])
        cleaned_tweets.append(cleaned_tweet)
        if(i % 10000 == 0):
            print(str(i) + " tweets processed")

    print("Finished processing of tweets at: " + str(datetime.now()))
    return cleaned_tweets

cleaned_data = process_data("tweets.csv)


And here is the relevant output:

Total tweets: 216041
Beginning processing of tweets at: 2017-05-16 13:45:47.183113
Finished processing of tweets at: 2017-05-16 13:47:01.436338


It's taking approx. 2 minutes to process the tweets. Although it looks relatively a small timeframe for current dataset I would like to improve it further especially when I use a dataset of much bigger size.

Can the steps/code in the preprocess(raw_text) method be improved in order to achieve faster execution?

Solution

Copying my answer from SO:

You can use pandas vectorized string methods to do your processing and it also removes the for loop for more efficient pandas operations, this should give you some speed.

# column you are working on
df_ = tweets_df[1]

stopword_set = set(stopwords.words("english"))

# convert to lower case and split 
df_ = df_.str.lower().split()

# remove stopwords
df_ = df_.apply(lambda x: [item for item in x if item not in stopword_set])

# keep only words
regex_pat = re.compile(r'[^a-zA-Z\s]', flags=re.IGNORECASE)
df_ = df_.str.replace(regex_pat, '')

# join the cleaned words in a list
df_.str.join("")


Also I've changed your regex to [^a-zA-Z\s] so that it does not match the space character.

Code Snippets

# column you are working on
df_ = tweets_df[1]

stopword_set = set(stopwords.words("english"))

# convert to lower case and split 
df_ = df_.str.lower().split()

# remove stopwords
df_ = df_.apply(lambda x: [item for item in x if item not in stopword_set])

# keep only words
regex_pat = re.compile(r'[^a-zA-Z\s]', flags=re.IGNORECASE)
df_ = df_.str.replace(regex_pat, '')

# join the cleaned words in a list
df_.str.join("")

Context

StackExchange Code Review Q#163446, answer score: 6

Revisions (0)

No revisions yet.