patterncppModerate
Text "analyzer" in C++
Viewed 0 times
textanalyzerstackoverflow
Problem
I have few years of experience with web and I recently started learning C++ and I feel a bit lost, so I would like to ask to for some tips how to improve my code overall. This text summarizer should get the sentences with the highest weight (most important ones).
Header file:
Cpp file:
```
#include "stdafx.h"
#include "SmartlyParser.h"
double SmartAnalyzer::sentence_intersection(std::set const& a, std::set const& b)
{
std::vector common;
std::set_intersection(a.begin(), a.end(), b.begin(), b.end(), std::back_inserter(common));
return (double) common.size() / ((a.size() + b.size()) / 2);
}
std::vector SmartAnalyzer::parseSentences(std::string const& text)
{
std::vector output;
std::istringstream iss(text);
std::string token;
while (std::getline(iss, token, '.')) {
output.push_back(token);
}
return output;
}
setWords SmartAnalyzer::stringSets(std::string const& text)
{
setWords output;
std::istringstream iss(text), current;
std::string token;
while (std::getline(iss, token, '.')) {
current.clear();
current.str(token);
output.push_back(std::set((std::istream_iterator(current)), std::istream_iterator()));
}
return output;
}
std::string SmartAnalyzer::format(std::string text, bool const& includeDot)
{
text.erase(std::remove_if(text.begin(), text.end(), includeDot { return c == ',' || c == '!' || c == '"' || (includeDot && c == '.' );
Header file:
#include
#include
#include
#include
#include
#include
#include
#include
typedef std::vector> setWords;
class SmartAnalyzer
{
private:
double sentence_intersection(std::set const& a, std::set const& b);
std::vector parseSentences(std::string const& text);
setWords stringSets(std::string const& text);
std::string format(std::string text, bool const& includeDot = false);
public:
SmartAnalyzer() {}
std::string getSummary(std::string title, std::string const& text, int const& limit);
};Cpp file:
```
#include "stdafx.h"
#include "SmartlyParser.h"
double SmartAnalyzer::sentence_intersection(std::set const& a, std::set const& b)
{
std::vector common;
std::set_intersection(a.begin(), a.end(), b.begin(), b.end(), std::back_inserter(common));
return (double) common.size() / ((a.size() + b.size()) / 2);
}
std::vector SmartAnalyzer::parseSentences(std::string const& text)
{
std::vector output;
std::istringstream iss(text);
std::string token;
while (std::getline(iss, token, '.')) {
output.push_back(token);
}
return output;
}
setWords SmartAnalyzer::stringSets(std::string const& text)
{
setWords output;
std::istringstream iss(text), current;
std::string token;
while (std::getline(iss, token, '.')) {
current.clear();
current.str(token);
output.push_back(std::set((std::istream_iterator(current)), std::istream_iterator()));
}
return output;
}
std::string SmartAnalyzer::format(std::string text, bool const& includeDot)
{
text.erase(std::remove_if(text.begin(), text.end(), includeDot { return c == ',' || c == '!' || c == '"' || (includeDot && c == '.' );
Solution
Using a class instead of a namespace
This class has no data members and therefore no state, nor any invariants to protect. C++ is a "multi-paradigm" language, so you can and should choose the "paradigm" or programming style that suits best for the problem.
I do not see why a class suits well to house an algorithm, that's why I said in the comments "Your class looks like a namespace in disguise.". Even if you plan to add some state later, the current member functions are just algorithms, so you can just implement them as free functions.
If those algorithms are applicable to a wider set of problems, formulate them generically and make them "public" for others to use. If they are private member functions, others cannot reuse them. Otherwise, i.e. if they're not applicable to a wide set of problems, you can put them as free functions in the source file (.cpp) instead of declaring them in the header. In that case, make their linkage internal so their names don't collide with names of functions in other source files.
The O(n²) algorithm
You used the [optimizations] and [performance] tags. As far as I can see, most time-consuming part here is the weighting algorithm:
It should be obvious that this algorithm is in O(n²). To improve the performance of you program, you should try to come up with an algorithm that's asymptotically faster. A first improvement can be achieved by noticing that the correlation is symmetric in
As this is the most time-consuming part, this optimization halves the total run-time of the program.
Size of the intersection
This is a most astonishing piece of code. It calculates the size of the intersection between to sets by producing the intersection and then computing its size.
Unfortunately, I couldn't find a Standard Library algorithm to do this. So I split this into two generic algorithms:
(Even though I've thought that you could use binary search to speed this up, the linear approach is faster in practice; probably because the sets are small and random-access is expensive.)
Note that
Since these alg
class SmartAnalyzer
{
private:
double sentence_intersection(..);
std::vector parseSentences(..);
setWords stringSets(..);
std::string format(..);
public:
SmartAnalyzer() {}
std::string getSummary(..);
};This class has no data members and therefore no state, nor any invariants to protect. C++ is a "multi-paradigm" language, so you can and should choose the "paradigm" or programming style that suits best for the problem.
I do not see why a class suits well to house an algorithm, that's why I said in the comments "Your class looks like a namespace in disguise.". Even if you plan to add some state later, the current member functions are just algorithms, so you can just implement them as free functions.
If those algorithms are applicable to a wider set of problems, formulate them generically and make them "public" for others to use. If they are private member functions, others cannot reuse them. Otherwise, i.e. if they're not applicable to a wide set of problems, you can put them as free functions in the source file (.cpp) instead of declaring them in the header. In that case, make their linkage internal so their names don't collide with names of functions in other source files.
The O(n²) algorithm
You used the [optimizations] and [performance] tags. As far as I can see, most time-consuming part here is the weighting algorithm:
for (int i = 0; i < sentLen; i++)
{
sum = 0;
for (int j = 0; j < sentLen; j++)
{
if (i == j) { /* find intersection with the title, instead of self */
sum += sentence_intersection(sentencesC[i], titles) * 2;
continue;
}
sum += sentence_intersection(sentencesC[i], sentencesC[j]);
}
sentencesWeight.insert({ sum, i });
}It should be obvious that this algorithm is in O(n²). To improve the performance of you program, you should try to come up with an algorithm that's asymptotically faster. A first improvement can be achieved by noticing that the correlation is symmetric in
i and j. I used a std::vector to store the sentences alongside with the weighting and slightly changed the sentence_intersection algorithm, the new name is intersection_weight:for (std::size_t i = 0; i < sentLen; i++)
{
sentences[i].w += 2 * intersection_weight(sentencesC[i].begin(),
sentencesC[i].end(),
titles.begin(), titles.end());
for (auto j = i+1; j < sentLen; j++)
{
auto const res = intersection_weight(sentencesC[i].begin(),
sentencesC[i].end(),
sentencesC[j].begin(),
sentencesC[j].end());
sentences[i].w += res;
sentences[j].w += res;
}
}As this is the most time-consuming part, this optimization halves the total run-time of the program.
Size of the intersection
double SmartAnalyzer::sentence_intersection(std::set const& a,
std::set const& b)
{
std::vector common;
std::set_intersection(a.begin(), a.end(), b.begin(), b.end(),
std::back_inserter(common));
return (double)common.size() / ((a.size() + b.size()) / 2);
}This is a most astonishing piece of code. It calculates the size of the intersection between to sets by producing the intersection and then computing its size.
Unfortunately, I couldn't find a Standard Library algorithm to do this. So I split this into two generic algorithms:
template
Num count_intersection(FwdIt0 beg0, FwdIt0 end0, FwdIt1 beg1, FwdIt1 end1,
Comp less, Num n)
// requires:
// [beg0, end0) is a sorted range wrt less
// [beg1, end1) is a sorted range wrt less
// less is a strict weak ordering on *beg0, *beg1 and between *beg0 and *beg1
// Num is a numerical data type
// returns:
// the size of the intersection
{
while (beg0 != end0 && beg1 != end1)
{
if (less(*beg0, *beg1)) ++beg0;
else if (less(*beg1, *beg0)) ++beg1;
else
{
++n;
++beg0;
++beg1;
}
}
return n;
}(Even though I've thought that you could use binary search to speed this up, the linear approach is faster in practice; probably because the sets are small and random-access is expensive.)
template
double intersection_weight(FwdIt0 beg0, FwdIt0 end0, FwdIt1 beg1, FwdIt1 end1)
{
auto const mid_size = 0.5 * ( std::distance(beg0, end0)
+ std::distance(beg1, end1));
auto const intsc = count_intersection(beg0, end0, beg1, end1,
std::less<>(), int());
return intsc / mid_size;
}Note that
std::less<> is a C++1y feature, but MSVC12 (2013) already supports it.Since these alg
Code Snippets
class SmartAnalyzer
{
private:
double sentence_intersection(..);
std::vector<std::string> parseSentences(..);
setWords stringSets(..);
std::string format(..);
public:
SmartAnalyzer() {}
std::string getSummary(..);
};for (int i = 0; i < sentLen; i++)
{
sum = 0;
for (int j = 0; j < sentLen; j++)
{
if (i == j) { /* find intersection with the title, instead of self */
sum += sentence_intersection(sentencesC[i], titles) * 2;
continue;
}
sum += sentence_intersection(sentencesC[i], sentencesC[j]);
}
sentencesWeight.insert({ sum, i });
}for (std::size_t i = 0; i < sentLen; i++)
{
sentences[i].w += 2 * intersection_weight(sentencesC[i].begin(),
sentencesC[i].end(),
titles.begin(), titles.end());
for (auto j = i+1; j < sentLen; j++)
{
auto const res = intersection_weight(sentencesC[i].begin(),
sentencesC[i].end(),
sentencesC[j].begin(),
sentencesC[j].end());
sentences[i].w += res;
sentences[j].w += res;
}
}double SmartAnalyzer::sentence_intersection(std::set<std::string> const& a,
std::set<std::string> const& b)
{
std::vector<std::string> common;
std::set_intersection(a.begin(), a.end(), b.begin(), b.end(),
std::back_inserter(common));
return (double)common.size() / ((a.size() + b.size()) / 2);
}template<typename FwdIt0, typename FwdIt1, typename Comp, typename Num>
Num count_intersection(FwdIt0 beg0, FwdIt0 end0, FwdIt1 beg1, FwdIt1 end1,
Comp less, Num n)
// requires:
// [beg0, end0) is a sorted range wrt less
// [beg1, end1) is a sorted range wrt less
// less is a strict weak ordering on *beg0, *beg1 and between *beg0 and *beg1
// Num is a numerical data type
// returns:
// the size of the intersection
{
while (beg0 != end0 && beg1 != end1)
{
if (less(*beg0, *beg1)) ++beg0;
else if (less(*beg1, *beg0)) ++beg1;
else
{
++n;
++beg0;
++beg1;
}
}
return n;
}Context
StackExchange Code Review Q#48424, answer score: 12
Revisions (0)
No revisions yet.