HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Finding the longest common substring between two tweets

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
thesubstringtwobetweenfindingtweetslongestcommon

Problem

I retrieved 200 tweets using the Jersey API. I want to find two tweets which have the longest common substring.

tweetList is an ArrayList of Tweet objects. The comapreTweets method compares tweet objects for longest substring.

Tweet t1=new Tweet();Tweet t2=new Tweet();
int longestString=0;

for(int i=0;istore){
            store=result;
            comparer=tweetList.get(j);
        }}
    }
    if(longestString<store)
    {
        longestString=store;
        t1=tweetList.get(i); t2=comparer;
    }
}


If I retrieve 200 tweets then this will loop approx. 40000 times. I require a more efficient way.

Solution

The classic solution to this well-known problem is using a Suffix array (see also Suffix tree and Generalized Suffix tree which is a more general data structure for these kinds of problems.). Basically, a Suffix array is a sorted set in which every suffix of every tweet is inserted. By using the prefixes of those sorted suffixes, you can find the longest common substring by iterating through every pair of substrings from different tweets in the Suffix array.

Construction of all suffixes can be performed in O(140 k) time, where k is the the amount of tweets, and 140 is the length of a tweet. These can then be sorted in O(k * log(k)) time. Finally allowing you to find the longest common substring in O(k 140) time. Which results in roughly 57 800 operations.

The algorithm proposed in your orignal post loops through every tweet twice, and has to, worst case, compare all the characters. This leaves a total complexity of O(k^2 * 140) which is 5 600 000 operations.

Example

Imagine you have three strings:

  • ababa



  • bataba



  • dafjkl



We generate all suffixes of every string. Every suffix of the first string is:

  • ababa



  • baba



  • aba



  • ba



  • a



For the second we have:

  • batabak



  • atabak



  • tabak



  • abak



  • bak



  • ak



  • a



And finally for the third:

  • dafjkl



  • afjkl



  • fjkl



  • jkl



  • kl



  • l



If we sort all suffixes for all strings, we get the following sorted set:

  • a



  • a



  • aba



  • ababa



  • abak



  • afjkl



  • ak



  • atabak



  • ba



  • baba



  • bak



  • batabak



  • dafjkl



  • fjkl



  • jkl



  • kl



  • l



  • tabak



Thus we iterate through every pair of suffixes for different strings, to find the longest common prefix in suffixes from two different strings. This is equal to the longest common substring.

First encounter is a (origining from the 1st string) and a (2nd string), thus our longest common substring (LCS) is of length 1 and equal to a.

We go to the next pair from different strings, which is aba/ababa and abak, both of length 3. This is set as the LCS so far. If you keep going through the set using this algorithm, you will find that this is the LCS for these three strings.

I do not know Java, however, I just created a C++ solution to a very similar problem. The solution for your problem is essentially the same, you just have to generate substrings for k strings rather than two.

Context

StackExchange Code Review Q#19200, answer score: 4

Revisions (0)

No revisions yet.