HiveBrain v1.2.0
Get Started
← Back to all entries
patterncppMinor

Reading awkward text fields

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fieldsreadingtextawkward

Problem

I have written nested while loops to read in some of the columns in a text file that looks like this:

comp id subreddit created ranks recorded_at rank_length
0 3ckf7b pics 1436373189.0 [1, 1, 3, 5] [1436392502, 1436396101, 1436399701, 1436403301] 4.0
0 3csv79 UpliftingNews 1436538581.0 [16, 24] [1436558101, 1436594101] 2.0
0 3ccx4y gifs 1436223351.0 [6, 7, 7, 10] [1436259301, 1436262901, 1436266501, 1436273701] 4.0
0 3bldf2 todayilearned 1435636909.0 [4, 3] [1435665301, 1435668901] 2.0
0 3acrl2 pics 1434677487.0 [0, 0, 0, 4] [1434686101, 1434689701, 1434693301, 1434714901] 4.0
0 3cosrl space 1436457300.0 [22, 16, 15, 14, 15, 17, 15, 18, 18] [1436489702, 1436493301, 1436496901, 1436500501, 1436504101, 1436507701, 1436511301, 1436518501, 1436522101] 9.0
0 3d2m5l pics 1436748860.0 [6] [1436781302] 1.0
0 3b5ll4 nottheonion 1435291130.0 [14, 14, 17] [1435326901, 1435330501, 1435334101] 3.0
0 3a7l67 Showerthoughts 1434575878.0 [16, 13] [1434617702, 1434628502] 2.0


I'm parsing/saving the subreddit, ranks, and recorded columns for later use. Here's what I'm doing, but I suspect this could be a lot tighter.

` std::ifstream infile("data.tsv");
std::string line;
bool first(true);
std::vector vec;

std::map> SubjectList;
std::map>> SubjectTraits;
int column_counter = 0;

while ( std::getline(infile, line) ) {
std::vector subvec;
std::string i, letter;
std::istringstream iss(line);
std::vector stringvec1;
std::vector stringvec2;
std::string keyString;

// Reads in info from data.tsv, where useful data is in columns
// 2 (reddit thread name), 3 (timestamp),
// 4 (ranks recorded for a given thread), 5 (time ranks recorded)
while (iss >> i) {
if ( column_counter == 2 ) { // Column 2 contains name of subreddit to which a particular thread (row) belongs
keyS

Solution

I see a number of things that I think could help you improve your code.

Separate parsing from processing

One reason your code seems a little more complex than it needs to be is that it's doing both parsing of input data and processing it into your own custom data structures. If the input file structure changes, even slightly, you'll need to rework this whole code. What would be simpler is to separate the parsing and processing functions into two (or more) functions.

Use a custom data structure

Because you're interested in just a few fields from the input file, I'd suggest creating a custom structure that would encapsulate just the fields of interest, and then write a custom extractor. The sketch of it would be something like this:

struct RedditRank
{
    friend std::istream &operator>>(std::istream &in, RedditRank &r) { /* code */ };
    std::string threadname;
    double timestamp;
    std::vector ranks;
    std::vector times;
};


That way, once the function was done, you could use the member fields and do whatever processing was required.

Use std::regex to simplify parsing

The std::regex and related functions allow for the possibility of greatly simplifying parsing of regular, machine generated data such as you've got. For example, here's a function that takes a string such as "32, 9, 10" and converts it into a std::vector:

std::vector getvect(const std::string &s) {
    std::vector v;
    static const std::regex re{R"x((\d+),?\s*)x"};
    auto begin = std::sregex_iterator(s.begin(), s.end(), re);
    auto end = std::sregex_iterator();
    for (auto i = begin; i != end; ++i) {
        std::smatch m = *i;
        v.emplace_back(std::stoi(m.str()));
    }
    return v;
}


Note that I have used a "Raw string" to make the regex simpler to write and to read. If you're not familiar with std::regex, you could start here.

Finishing up

All that remains is to supply the code for the istream extractor mentioned in the outline. Here is how I'd do that:

friend std::istream &operator>>(std::istream &in, RedditRank &r) {
    std::smatch m;
    static const std::regex re{R"x(\S+\s+\S+\s+(\S+)\s+(\S+)\s+\[([^\]]+)\]\s+\[([^\]]+)\])x"}; 
    std::string line;
    std::getline(in, line);
    std::regex_search(line, m, re);
    if (m.size() != 5) {
        in.setstate(std::ios::failbit);
    } else {
        r.threadname = m[1];
        r.timestamp = std::stod(m[2]);
        r.ranks = r.getvect(m[3]);
        r.times = r.getvect(m[4]);
    }
    return in;
}


Just to be pedantic, we can go ahead and write a stream inserter, too, which can be useful for troubleshooting. It's not brilliant, but sufficient:

friend std::ostream &operator<<(std::ostream &out, const RedditRank &r) {
    out << r.threadname << '\t'
        << r.timestamp << "\t[";
        for (const auto n : r.ranks) 
            out << n << ',';
        out << "]\t[";
        for (const auto n : r.times) 
            out << n << ',';
        return out << "]";
}


Now all we need is a test script. I'm going to trust that once you have this structure, you can do the processing with your own Hist2D class and friends. Here's a simple test script:

#include 
#include 
#include 
#include 
#include 
#include 

// above listed RedditRank class goes here

int main()
{
    std::ifstream infile("data.tsv");
    RedditRank r;
    std::string line;
    std::getline(infile, line);  // burn off header line
    while (infile >> r) {
        std::cout << r << std::endl;
    }
}


Here all I'm doing is reading in the structures and printing them again. Your routine would sling the contents into your own data structures instead of printing them.

Code Snippets

struct RedditRank
{
    friend std::istream &operator>>(std::istream &in, RedditRank &r) { /* code */ };
    std::string threadname;
    double timestamp;
    std::vector<int> ranks;
    std::vector<int> times;
};
std::vector<int> getvect(const std::string &s) {
    std::vector<int> v;
    static const std::regex re{R"x((\d+),?\s*)x"};
    auto begin = std::sregex_iterator(s.begin(), s.end(), re);
    auto end = std::sregex_iterator();
    for (auto i = begin; i != end; ++i) {
        std::smatch m = *i;
        v.emplace_back(std::stoi(m.str()));
    }
    return v;
}
friend std::istream &operator>>(std::istream &in, RedditRank &r) {
    std::smatch m;
    static const std::regex re{R"x(\S+\s+\S+\s+(\S+)\s+(\S+)\s+\[([^\]]+)\]\s+\[([^\]]+)\])x"}; 
    std::string line;
    std::getline(in, line);
    std::regex_search(line, m, re);
    if (m.size() != 5) {
        in.setstate(std::ios::failbit);
    } else {
        r.threadname = m[1];
        r.timestamp = std::stod(m[2]);
        r.ranks = r.getvect(m[3]);
        r.times = r.getvect(m[4]);
    }
    return in;
}
friend std::ostream &operator<<(std::ostream &out, const RedditRank &r) {
    out << r.threadname << '\t'
        << r.timestamp << "\t[";
        for (const auto n : r.ranks) 
            out << n << ',';
        out << "]\t[";
        for (const auto n : r.times) 
            out << n << ',';
        return out << "]";
}
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <memory>
#include <regex>

// above listed RedditRank class goes here

int main()
{
    std::ifstream infile("data.tsv");
    RedditRank r;
    std::string line;
    std::getline(infile, line);  // burn off header line
    while (infile >> r) {
        std::cout << r << std::endl;
    }
}

Context

StackExchange Code Review Q#102528, answer score: 6

Revisions (0)

No revisions yet.