HiveBrain v1.2.0
Get Started
← Back to all entries
patterncppMinor

Max repeated word in a string

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
wordmaxstringrepeated

Problem

I was trying to do a very common interview problem "Finding the max repeated word in a string " and could not find much resources in net for c/c++ implementation. So I coded it myself here. I have tried to do most of the coding from scratch for better understanding. Could you review my code and provide comments on my algorithm. Some people have suggested using hashtables for storing the count, but am not using hashtables here.

#include
#include
#include
#include
#include

using namespace std;
string word[10];

//splitting string into words
int parsestr(string str)
{   
    int index = 0;
    int i = 0;
    int maxlength = str.length();
    int wordcnt = 0;
    while(i  max)
        {
            max = wordcntArr[i];
            index = i;
        }
    }

    return word[index];
}

void countwords(int count)
{
    int wordcnt = 0;
    int wordcntArr[10];
    string maxrepeatedword;
    for(int i = 0; i <= count; i++)
    {
        for(int j = 0; j <= count; j++)
        {
            if(word[i] == word[j])
            {
                wordcnt++;
                //word[j] = "";
            }
            else
            {}
        }
        cout << " word " << word[i] << " occurs " << wordcnt << " times " << endl;
        wordcntArr[i] = wordcnt;
        wordcnt = 0;
    }

    maxrepeatedword = maxrepeatedWord(wordcntArr,count);
    cout << " Max Repeated Word is " << maxrepeatedword;
}

int main()
{
    string str = "I am am am good good";
    int wordcount = 0;
    wordcount = parsestr(str);
    countwords(wordcount);
}

Solution

Here's a long list of feedback. Sorry for the blunt tone, but I'm trying to be brief:

-
parsestr mutates a global variable but returns a value that's relevant to
that variable (the count). That's inconsistent. Either have it return both
the count and the array (or make things easier on yourself and use a
vector that knows its size), or make the count global too.

-
You don't handle punctuation characters. Is that a requirement? Should

"a(b)" be one word? Should "a. a" be two unique words?

-
Do you need to split on characters other than space? What about \t or
\n? Other Unicode whitespace characters?

-
Your while(i

-
Building up the word by incrementally contatenating characters onto a
string is slow. It will periodically require dynamic allocation. A more
efficient solution would be to remember the start index of the word. When
you reach the end, create a string from the substring
(start, end) in one
step.

Going further from there, there's no reason to even store those words
separately since that start, end pair is enough to reconstruct it as needed.

-
Your program will mysteriously crash if you pass a sentence with more than
ten words. :( Since you're using dynamic allocation everywhere else (
string
does that under the hood) there's no reason to use a fixed array for words.
At the very least, you should have a constant for that size and check that
you don't overflow the array.

-
index isn't a helpful variable name. Call it wordIndex.

-
Likewise,
wordcnt would be better as wordCount.

-
maxrepeatedWord is a mixture of camelCase and all lowercase. Be consistent
(and camelCase is generally better since it makes it easier for the reader
to split it into words).

-
wordcntArr would be better as wordCounts. The fact that it's an array is
self-evident.

-
i

-
index -> indexOfMax.

-
You're passing count into countwords even though that count is relevant
to a global variable that it accesses directly. Why not just make the count
global too?

-
The else {} accomplishes nothing. Remove it.

-
If you declare wordcnt inside the for(int i... loop, you won't have to
reinitialize it each iteration.

-
int wordcntArr[10] again duplicates the magical literal 10. Use a
constant or, better, a dynamically-sized container.

-
You're redundantly recounting each word every time it occurs. With
"I am am am good good", you'll get a count array like 1 3 3 3 2 2.
If you had a collection of unique words instead, your \$O(n^2)\$ complexity
would go down to \$O(mn)\$ where \$m\$ is the number of unique words.

A hash table would be a good way to create the set of unique words.

-
Your test string assumes repeated words will be contiguous (as opposed to,
say "I am good am good am."). Is that intentional? Desirable?

At a high level, your algorithm is also not optimal. You've got \$O(n^2)\$ performance, ignoring the dynamic allocation as you incrementally build up those strings. With that, it gets worse.

I believe the canonical solution to this is (roughly):

create a map of words -> counts
set wordStart to 0
iterate through the string
    if the current character is a word delimiter
        set word to the substring from wordStart to here
        set wordStart to here
        if map contains word
            increment the count for that word
        else
            add the word to the map with a count of 1
        end
    end
end

set maxWord to null
set max to 0
iterate through the map
    if count for this word > max
        set maxWord to this word
        set max to count
    end
end

return maxWord


That gives you \$O(n)\$: you only walk the entire string once.

Code Snippets

create a map of words -> counts
set wordStart to 0
iterate through the string
    if the current character is a word delimiter
        set word to the substring from wordStart to here
        set wordStart to here
        if map contains word
            increment the count for that word
        else
            add the word to the map with a count of 1
        end
    end
end

set maxWord to null
set max to 0
iterate through the map
    if count for this word > max
        set maxWord to this word
        set max to count
    end
end

return maxWord

Context

StackExchange Code Review Q#1394, answer score: 7

Revisions (0)

No revisions yet.