HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

More efficient way to make a Twitter status in a string of just words

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
justwordsmorewaymakeefficientstatustwitterstring

Problem

I am making an application where I will be fetching tweets and storing them in a database. I will have a column for the complete text of the tweet and another where only the words of the tweet will remain (I need the words to calculate which words were most used later).

How I currently do it is by using 6 different .replaceAll() functions which some of them might be triggered twice. For example I will have a for loop to remove every "hashtag" using replaceAll().

The problem is that I will be editing as many as thousands of tweets that I fetch every few minutes and I think that the way I am doing it will not be too efficient.

What my requirements are in this order (also written in comments down below):

  • Delete all usernames mentioned



  • Delete all RT (retweets flags)



  • Delete all hashtags mentioned



  • Replace all break lines with spaces



  • Replace all double spaces with single spaces



  • Delete all special characters except spaces



Here is a short and compilable example:

```
public class StringTest {

public static void main(String args[]) {

String text = "RT @AshStewart09: Vote for Lady Gaga for \"Best Fans\""
+ " at iHeart Awards\n"
+ "\n"
+ "RT!!\n"
+ "\n"
+ "My vote for #FanArmy goes to #LittleMonsters #iHeartAwards"
+ " htt…";

String[] hashtags = {"#FanArmy", "#LittleMonsters", "#iHeartAwards"};
System.out.println("Before: " + text + "\n");

// Delete all usernames mentioned (may run multiple times)
text = text.replaceAll("@AshStewart09", "");
System.out.println("First Phase: " + text + "\n");

// Delete all RT (retweets flags)
text = text.replaceAll("RT", "");
System.out.println("Second Phase: " + text + "\n");

// Delete all hashtags mentioned
for (String hashtag : hashtags) {
text = text.replaceAll(hashtag, "");
}
System.out.println("

Solution

(subject to further changes)

In your simple example, how are the hashtags and usernames actually derived from the tweet?

My suggestion is to tokenize the tweet by whitespaces first, then look at the individual words to determine if they must be stored ("Vote") or discarded ("#LittleMonsters").

// Delete all RT (retweets flags)
    text = text.replaceAll("RT", "");


You do realize that this will turn text like "ART!" into just "A!" right? Tokenizing first should remedy this issue.

On a related note, Apache Incubator Storm's tutorials usually use tweets as an example to demonstrate the Big Data approach. I'm not suggesting that you need such a set-up in your context, but perhaps you can give those a quick read-through to pick up some tips.

Code Snippets

// Delete all RT (retweets flags)
    text = text.replaceAll("RT", "");

Context

StackExchange Code Review Q#47378, answer score: 2

Revisions (0)

No revisions yet.