HiveBrain v1.2.0
Get Started
← Back to all entries
patterncppMinor

Fastest way to search istringstream for patterns in around 0.02 seconds

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
aroundistringstreamsearchsecondswaypatternsforfastest

Problem

Problem

I have a stream composed of 2 columns and 1000 lines:

  • Column 1: contains the patterns that I want to find



  • Column 2: contains the values corresponding the the patterns in Column1



I want to search for the patterns in column 1, extract their corresponding values in column 2 and save them in vectors. I actually have 3 types that I want to look for which are : {type, label, name}. Therefore i'll have 3 vectors for every type .pair represents column 1 and column 2.

vector>` types
vector> Labels
vector> names


I read the file sequentially. If I find the pattern that I am looking for in column 1 I extract the value in column 2 and append it with the corresponding vector.

All the other patterns will be saved in a vector called others.

vector> others


Sample

Column 1 contains the patterns that I want to look for{type, label, name}. and Column 2 the corresponding values

rdf-syntax-ns#type base.qualia.topic
rdf-syntax-ns#type common.topic
rdf-syntax-ns#type film.producer
rdf-schema#label  สตีฟ จอบส์
rdf-schema#label  ﺎﺴﺗیﻭ ﺝﺎﺑﺯ
rdf-schema#label  Styvas Džobsas
type.object.name ﺎﺴﺗیﻭ ﺝﺎﺑﺯ
type.object.name Styvas Džobsas
type.object.name Steve Jobs
type.object.name Steve Jobs


What is the fastest way to search and extract the values?

The following source code takes 0.04 seconds to read 1000 lines, find the patterns and extract their corresponding values.

```
void returnValues(const string & file, vector> & types, vector> & labels, vector> & names, vector> & others)
{
istringstream str(file);
string line;
//skip first line
getline(str,line);
while(getline(str, line))
{

vector values;
line.erase(remove( line.begin(), line.end(), '\"' ), line.end());
boost::split(values, line, boost::is_any_of("\t"));
if(contains(values[0],"type"))
{
pair fact = make_pair(values[0], values[1]);
types.push_back(fact);
}
else if(contains(values[0],"label"))
{

Solution

Without profiling data, I can only guess... so here goes:

At a quick glance the biggest inefficiency I can see is that you allocate and de-allocate the vector capacity for values in each iteration. This takes some time, just move the vector outside of the loop and use clear() at the head of the loop.

Like this:

vector values;
while(getline(str, line))
{
   values.clear();


Another thing that you can do (even though I don't believe it will affect your result significantly) is to use emplace_back instead of push_back to avoid the possibility of a copy.

Like this:

types.emplace_back(values[0], values[1]);


This will construct the pair in place and avoid the pair copy constructor (the compiler might have optimized this for you already). While we're at it, we can realize that values will not be used after this statement until it is cleared. So we can steal the memory and avoid another two new/deletes (unless your STL has SSO and your strings are small, in which case this is moot) just activate the move construction of the pair like this:

types.emplace_back(std::move(values[0]), std::move(values[1]));

Code Snippets

vector<string> values;
while(getline(str, line))
{
   values.clear();
types.emplace_back(values[0], values[1]);
types.emplace_back(std::move(values[0]), std::move(values[1]));

Context

StackExchange Code Review Q#94771, answer score: 7

Revisions (0)

No revisions yet.