patterncppMinor
Reading awkward text fields
Viewed 0 times
fieldsreadingtextawkward
Problem
I have written nested
I'm parsing/saving the subreddit, ranks, and recorded columns for later use. Here's what I'm doing, but I suspect this could be a lot tighter.
` std::ifstream infile("data.tsv");
std::string line;
bool first(true);
std::vector vec;
std::map> SubjectList;
std::map>> SubjectTraits;
int column_counter = 0;
while ( std::getline(infile, line) ) {
std::vector subvec;
std::string i, letter;
std::istringstream iss(line);
std::vector stringvec1;
std::vector stringvec2;
std::string keyString;
// Reads in info from data.tsv, where useful data is in columns
// 2 (reddit thread name), 3 (timestamp),
// 4 (ranks recorded for a given thread), 5 (time ranks recorded)
while (iss >> i) {
if ( column_counter == 2 ) { // Column 2 contains name of subreddit to which a particular thread (row) belongs
keyS
while loops to read in some of the columns in a text file that looks like this: comp id subreddit created ranks recorded_at rank_length
0 3ckf7b pics 1436373189.0 [1, 1, 3, 5] [1436392502, 1436396101, 1436399701, 1436403301] 4.0
0 3csv79 UpliftingNews 1436538581.0 [16, 24] [1436558101, 1436594101] 2.0
0 3ccx4y gifs 1436223351.0 [6, 7, 7, 10] [1436259301, 1436262901, 1436266501, 1436273701] 4.0
0 3bldf2 todayilearned 1435636909.0 [4, 3] [1435665301, 1435668901] 2.0
0 3acrl2 pics 1434677487.0 [0, 0, 0, 4] [1434686101, 1434689701, 1434693301, 1434714901] 4.0
0 3cosrl space 1436457300.0 [22, 16, 15, 14, 15, 17, 15, 18, 18] [1436489702, 1436493301, 1436496901, 1436500501, 1436504101, 1436507701, 1436511301, 1436518501, 1436522101] 9.0
0 3d2m5l pics 1436748860.0 [6] [1436781302] 1.0
0 3b5ll4 nottheonion 1435291130.0 [14, 14, 17] [1435326901, 1435330501, 1435334101] 3.0
0 3a7l67 Showerthoughts 1434575878.0 [16, 13] [1434617702, 1434628502] 2.0
I'm parsing/saving the subreddit, ranks, and recorded columns for later use. Here's what I'm doing, but I suspect this could be a lot tighter.
` std::ifstream infile("data.tsv");
std::string line;
bool first(true);
std::vector vec;
std::map> SubjectList;
std::map>> SubjectTraits;
int column_counter = 0;
while ( std::getline(infile, line) ) {
std::vector subvec;
std::string i, letter;
std::istringstream iss(line);
std::vector stringvec1;
std::vector stringvec2;
std::string keyString;
// Reads in info from data.tsv, where useful data is in columns
// 2 (reddit thread name), 3 (timestamp),
// 4 (ranks recorded for a given thread), 5 (time ranks recorded)
while (iss >> i) {
if ( column_counter == 2 ) { // Column 2 contains name of subreddit to which a particular thread (row) belongs
keyS
Solution
I see a number of things that I think could help you improve your code.
Separate parsing from processing
One reason your code seems a little more complex than it needs to be is that it's doing both parsing of input data and processing it into your own custom data structures. If the input file structure changes, even slightly, you'll need to rework this whole code. What would be simpler is to separate the parsing and processing functions into two (or more) functions.
Use a custom data structure
Because you're interested in just a few fields from the input file, I'd suggest creating a custom structure that would encapsulate just the fields of interest, and then write a custom extractor. The sketch of it would be something like this:
That way, once the function was done, you could use the member fields and do whatever processing was required.
Use
The
Note that I have used a "Raw string" to make the regex simpler to write and to read. If you're not familiar with
Finishing up
All that remains is to supply the code for the
Just to be pedantic, we can go ahead and write a stream inserter, too, which can be useful for troubleshooting. It's not brilliant, but sufficient:
Now all we need is a test script. I'm going to trust that once you have this structure, you can do the processing with your own
Here all I'm doing is reading in the structures and printing them again. Your routine would sling the contents into your own data structures instead of printing them.
Separate parsing from processing
One reason your code seems a little more complex than it needs to be is that it's doing both parsing of input data and processing it into your own custom data structures. If the input file structure changes, even slightly, you'll need to rework this whole code. What would be simpler is to separate the parsing and processing functions into two (or more) functions.
Use a custom data structure
Because you're interested in just a few fields from the input file, I'd suggest creating a custom structure that would encapsulate just the fields of interest, and then write a custom extractor. The sketch of it would be something like this:
struct RedditRank
{
friend std::istream &operator>>(std::istream &in, RedditRank &r) { /* code */ };
std::string threadname;
double timestamp;
std::vector ranks;
std::vector times;
};That way, once the function was done, you could use the member fields and do whatever processing was required.
Use
std::regex to simplify parsingThe
std::regex and related functions allow for the possibility of greatly simplifying parsing of regular, machine generated data such as you've got. For example, here's a function that takes a string such as "32, 9, 10" and converts it into a std::vector:std::vector getvect(const std::string &s) {
std::vector v;
static const std::regex re{R"x((\d+),?\s*)x"};
auto begin = std::sregex_iterator(s.begin(), s.end(), re);
auto end = std::sregex_iterator();
for (auto i = begin; i != end; ++i) {
std::smatch m = *i;
v.emplace_back(std::stoi(m.str()));
}
return v;
}Note that I have used a "Raw string" to make the regex simpler to write and to read. If you're not familiar with
std::regex, you could start here.Finishing up
All that remains is to supply the code for the
istream extractor mentioned in the outline. Here is how I'd do that:friend std::istream &operator>>(std::istream &in, RedditRank &r) {
std::smatch m;
static const std::regex re{R"x(\S+\s+\S+\s+(\S+)\s+(\S+)\s+\[([^\]]+)\]\s+\[([^\]]+)\])x"};
std::string line;
std::getline(in, line);
std::regex_search(line, m, re);
if (m.size() != 5) {
in.setstate(std::ios::failbit);
} else {
r.threadname = m[1];
r.timestamp = std::stod(m[2]);
r.ranks = r.getvect(m[3]);
r.times = r.getvect(m[4]);
}
return in;
}Just to be pedantic, we can go ahead and write a stream inserter, too, which can be useful for troubleshooting. It's not brilliant, but sufficient:
friend std::ostream &operator<<(std::ostream &out, const RedditRank &r) {
out << r.threadname << '\t'
<< r.timestamp << "\t[";
for (const auto n : r.ranks)
out << n << ',';
out << "]\t[";
for (const auto n : r.times)
out << n << ',';
return out << "]";
}Now all we need is a test script. I'm going to trust that once you have this structure, you can do the processing with your own
Hist2D class and friends. Here's a simple test script:#include
#include
#include
#include
#include
#include
// above listed RedditRank class goes here
int main()
{
std::ifstream infile("data.tsv");
RedditRank r;
std::string line;
std::getline(infile, line); // burn off header line
while (infile >> r) {
std::cout << r << std::endl;
}
}Here all I'm doing is reading in the structures and printing them again. Your routine would sling the contents into your own data structures instead of printing them.
Code Snippets
struct RedditRank
{
friend std::istream &operator>>(std::istream &in, RedditRank &r) { /* code */ };
std::string threadname;
double timestamp;
std::vector<int> ranks;
std::vector<int> times;
};std::vector<int> getvect(const std::string &s) {
std::vector<int> v;
static const std::regex re{R"x((\d+),?\s*)x"};
auto begin = std::sregex_iterator(s.begin(), s.end(), re);
auto end = std::sregex_iterator();
for (auto i = begin; i != end; ++i) {
std::smatch m = *i;
v.emplace_back(std::stoi(m.str()));
}
return v;
}friend std::istream &operator>>(std::istream &in, RedditRank &r) {
std::smatch m;
static const std::regex re{R"x(\S+\s+\S+\s+(\S+)\s+(\S+)\s+\[([^\]]+)\]\s+\[([^\]]+)\])x"};
std::string line;
std::getline(in, line);
std::regex_search(line, m, re);
if (m.size() != 5) {
in.setstate(std::ios::failbit);
} else {
r.threadname = m[1];
r.timestamp = std::stod(m[2]);
r.ranks = r.getvect(m[3]);
r.times = r.getvect(m[4]);
}
return in;
}friend std::ostream &operator<<(std::ostream &out, const RedditRank &r) {
out << r.threadname << '\t'
<< r.timestamp << "\t[";
for (const auto n : r.ranks)
out << n << ',';
out << "]\t[";
for (const auto n : r.times)
out << n << ',';
return out << "]";
}#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <memory>
#include <regex>
// above listed RedditRank class goes here
int main()
{
std::ifstream infile("data.tsv");
RedditRank r;
std::string line;
std::getline(infile, line); // burn off header line
while (infile >> r) {
std::cout << r << std::endl;
}
}Context
StackExchange Code Review Q#102528, answer score: 6
Revisions (0)
No revisions yet.