patternpythonMinor
Preprocessing steps to follow while cleaning and extracting text data from tweets
Viewed 0 times
preprocessingwhiletextdatafollowextractingandtweetsfromcleaning
Problem
I have a dataset of around 200,000 tweets. I am running a classification task on them. Dataset has two columns - class label and the tweet text. In the preprocessing step I am passing the dataset through following cleaning step:
And here is the relevant output:
It's taking approx. 2 minutes to process the tweets. Although it looks relatively a small timeframe for current dataset I would like to improve it further especially when I use a dataset of much bigger size.
Can the steps/code in the
import re
from nltk.corpus import stopwords
import pandas as pd
def preprocess(raw_text):
# keep only words
letters_only_text = re.sub("[^a-zA-Z]", " ", raw_text)
# convert to lower case and split
words = letters_only_text.lower().split()
# remove stopwords
stopword_set = set(stopwords.words("english"))
meaningful_words = [w for w in words if w not in stopword_set]
# join the cleaned words in a list
cleaned_word_list = " ".join(meaningful_words)
return cleaned_word_list
def process_data(dataset):
tweets_df = pd.read_csv(dataset,delimiter='|',header=None)
num_tweets = tweets_df.shape[0]
print("Total tweets: " + str(num_tweets))
cleaned_tweets = []
print("Beginning processing of tweets at: " + str(datetime.now()))
for i in range(num_tweets):
cleaned_tweet = preprocess(tweets_df.iloc[i][1])
cleaned_tweets.append(cleaned_tweet)
if(i % 10000 == 0):
print(str(i) + " tweets processed")
print("Finished processing of tweets at: " + str(datetime.now()))
return cleaned_tweets
cleaned_data = process_data("tweets.csv)And here is the relevant output:
Total tweets: 216041
Beginning processing of tweets at: 2017-05-16 13:45:47.183113
Finished processing of tweets at: 2017-05-16 13:47:01.436338It's taking approx. 2 minutes to process the tweets. Although it looks relatively a small timeframe for current dataset I would like to improve it further especially when I use a dataset of much bigger size.
Can the steps/code in the
preprocess(raw_text) method be improved in order to achieve faster execution?Solution
Copying my answer from SO:
You can use
Also I've changed your regex to
You can use
pandas vectorized string methods to do your processing and it also removes the for loop for more efficient pandas operations, this should give you some speed. # column you are working on
df_ = tweets_df[1]
stopword_set = set(stopwords.words("english"))
# convert to lower case and split
df_ = df_.str.lower().split()
# remove stopwords
df_ = df_.apply(lambda x: [item for item in x if item not in stopword_set])
# keep only words
regex_pat = re.compile(r'[^a-zA-Z\s]', flags=re.IGNORECASE)
df_ = df_.str.replace(regex_pat, '')
# join the cleaned words in a list
df_.str.join("")Also I've changed your regex to
[^a-zA-Z\s] so that it does not match the space character.Code Snippets
# column you are working on
df_ = tweets_df[1]
stopword_set = set(stopwords.words("english"))
# convert to lower case and split
df_ = df_.str.lower().split()
# remove stopwords
df_ = df_.apply(lambda x: [item for item in x if item not in stopword_set])
# keep only words
regex_pat = re.compile(r'[^a-zA-Z\s]', flags=re.IGNORECASE)
df_ = df_.str.replace(regex_pat, '')
# join the cleaned words in a list
df_.str.join("")Context
StackExchange Code Review Q#163446, answer score: 6
Revisions (0)
No revisions yet.