patterngoMinor
Summing chemical properties in a CSV stream
Viewed 0 times
streamsummingpropertiescsvchemical
Problem
I have written my first program in go. It's rewritten from python.
Program takes csv file, emulates query
It's written messy, but runs 100x faster than same program written in python.
Here is full code.
As you can see, there are a lot of repeating constructions.
What is reasonable way to rid off from constructions like?
And.. How efficient this part in terms of memory usage?
Full code list:
``
Usage:
chemical
chemical
chemical [--chemical=] [--facility=] [--media=] [--restore=]
chemical --help
Options:
--chemical= Query by chemical number
--facility= Query by
Program takes csv file, emulates query
SELECT X, Y, SUM(Conc), SUM(ToxConc) FROM very_big_table GROUP BY X, Y. And runs on terabytes of data. It's written messy, but runs 100x faster than same program written in python.
Here is full code.
As you can see, there are a lot of repeating constructions.
What is reasonable way to rid off from constructions like?
if has_chemical_numbers {
CN := record[4]
i, _ := strconv.Atoi(CN)
if !take_it {
take_it = chemical_numbers[i]
}
}And.. How efficient this part in terms of memory usage?
//Aggregating results
for {
record, err := r2.Read()
if err == io.EOF {
break
}
X, _ := strconv.Atoi(record[0])
Y, _ := strconv.Atoi(record[1])
Conc, _ := strconv.ParseFloat(record[2], 32)
ToxConc, _ := strconv.ParseFloat(record[3], 32)
var key Coordinates
key.X = X
key.Y = Y
conc := results[key]
conc.Conc += Conc
conc.ToxConc += ToxConc
results[key] = conc
}Full code list:
``
package main
import (
"encoding/csv"
"fmt"
"github.com/docopt/docopt-go"
"io"
"log"
"os"
"strconv"
"strings"
)
type Coordinates struct {
X int
Y int
}
type SConc struct {
Conc float64
ToxConc float64
}
func raw_to_map(param interface{}) map[int]bool {
numbers := make(map[int]bool)
if param == nil {
return numbers
}
_numbers := strings.Split(param.(string), ",")
for _, i := range _numbers {
j, err := strconv.Atoi(i)
if err != nil {
panic(err)
}
numbers[j] = true
}
fmt.Println(numbers)
return numbers
}
func main() {
usage := Chemical Sum.Usage:
chemical
chemical
chemical [--chemical=] [--facility=] [--media=] [--restore=]
chemical --help
Options:
--chemical= Query by chemical number
--facility= Query by
Solution
Structure
Functions..... you need more of them.
Your large, monolithic main method makes your code hard to follow, and separating out functions would help a lot.
Oh, and errors ... you need to handle them.
For a start, take this block:
And put it all in to a function:
Now your main function is (with error handling ....):
Right, options gone...
Filtering
Let's look at the filtering now.... and, remember, in Go, functions are "first class" which means they can be a type, or variable.
Create a filter type, which is a function that takes a string slice input, and returns a bool. "Do we filter this string-slice?" Let's take our arguments, and create a system that returns a Filter for each type of supplied argument...
OK, so we have a method, that takes your arguments, and returns a function that filters a CSV record if the right field is matching an argument.
Now, our main method can have:
Even the bove can be simplified with a function to conditionally do the append...
Now, your main method can be:
OK we have some filters (perhaps none, though - it may be empty). Let's make a function that simplifies some logic:
Time for a change in direction...
Double-Read
Why are you performing 2 read loops? One loop filters the data, and stores it in a temp file. The other loop reads the temp file, and aggregates the data.
You only need 1 loop (and then it really will be "streaming"...)
Let's simplify the loops. First up, discarding the header line... don't do that inside the loop, do it as a special case, outside the loop. Also, put your error handling after the statement that produces an error, don't put other logic in between.... this:
should be:
Now, discard the header....
Then, have your loop, and, filter the records... But, let's get our aggregation system set up first, too....
That's pretty much how it should look...
Functions..... you need more of them.
Your large, monolithic main method makes your code hard to follow, and separating out functions would help a lot.
Oh, and errors ... you need to handle them.
For a start, take this block:
usage := `Chemical Sum.
Usage:
chemical
chemical
chemical [--chemical=] [--facility=] [--media=] [--restore=]
chemical --help
Options:
--chemical= Query by chemical number
--facility= Query by facility number
--media= Query by media
--restore= Restore from file
--help Show this screen
`
arguments, _ := docopt.Parse(usage, nil, true, "Naval Fate 2.0", false)And put it all in to a function:
func handleCommandLine() (map[string]interface{}, error) {
....
}Now your main function is (with error handling ....):
func main() {
arguments, err := handleCommandLine()
if err != nil {
log.Fatal(err)
}Right, options gone...
Filtering
Let's look at the filtering now.... and, remember, in Go, functions are "first class" which means they can be a type, or variable.
type Filter func([]string) boolCreate a filter type, which is a function that takes a string slice input, and returns a bool. "Do we filter this string-slice?" Let's take our arguments, and create a system that returns a Filter for each type of supplied argument...
func BuildFilter(option string, field int, arguments map[string]interface{}) Filter {
toadd := arguments[option]
if len(toadd) == 0 {
// nothing specified, no filtering.
return nil
}
mapall := map[string]bool
for _, val := range toadd {
mapall[val] = true
}
// OK, now for the magic. Return a function!
return func(row []string) bool {
data := row[field]
return mapall[data]
}
}OK, so we have a method, that takes your arguments, and returns a function that filters a CSV record if the right field is matching an argument.
Now, our main method can have:
// start with an empty slice of filters.
filters := []Filter{}
if filt := BuildFilter("--chemical", 4, arguments); filt != nil {
filters = append(filters, filt)
}
if filt := BuildFilter("--facility", 5, arguments); filt != nil {
filters = append(filters, filt)
}
if filt := BuildFilter("--media", 6, arguments); filt != nil {
filters = append(filters, filt)
}Even the bove can be simplified with a function to conditionally do the append...
func CheckFilter(filt Filter, filters []Filter) []Filter {
if filt != nil {
return append (filters, filt)
}
return filters
}Now, your main method can be:
filters := []Filter{}
filters = CheckFilter(BuildFilter("--chemical", 4, arguments), filters)
filters = CheckFilter(BuildFilter("--facility", 5, arguments), filters)
filters = CheckFilter(BuildFilter("--media", 6, arguments), filters)OK we have some filters (perhaps none, though - it may be empty). Let's make a function that simplifies some logic:
func KeepRecord(record []string, filters []Filter) bool {
if len(filters) == 0 {
return true
}
for _, filt := range filters {
if filt(record) {
return true;
}
}
return false
}Time for a change in direction...
Double-Read
Why are you performing 2 read loops? One loop filters the data, and stores it in a temp file. The other loop reads the temp file, and aggregates the data.
You only need 1 loop (and then it really will be "streaming"...)
Let's simplify the loops. First up, discarding the header line... don't do that inside the loop, do it as a special case, outside the loop. Also, put your error handling after the statement that produces an error, don't put other logic in between.... this:
rf, err := os.Open(input_file)
r := csv.NewReader(rf)
if err != nil {
log.Fatal("Error: %s", err)
}should be:
rf, err := os.Open(input_file)
if err != nil {
log.Fatal("Error: %s", err)
}
r := csv.NewReader(rf)Now, discard the header....
if _, err := r.Read(); err != nil {
log.Fatalf("Unable to read the header: %v", err)
}Then, have your loop, and, filter the records... But, let's get our aggregation system set up first, too....
results := make(map[Coordinates]SConc)
for {
record, err := r.Read()
if err == io.EOF {
break
}
if err != nil {
log.Fatal(err)
}
if !KeepRecord(record, filters) {
continue
}
X, _ := strconv.Atoi(record[1])
Y, _ := strconv.Atoi(record[2])
Conc, _ := strconv.ParseFloat(record[7], 32)
ToxConc, _ := strconv.ParseFloat(record[8], 32)
var key Coordinates
key.X = X
key.Y = Y
conc := results[key]
conc.Conc += Conc
conc.ToxConc += ToxConc
results[key] = conc
}That's pretty much how it should look...
Code Snippets
usage := `Chemical Sum.
Usage:
chemical
chemical <input_file>
chemical <input_file> <output_file> [--chemical=<ch>] [--facility=<fc>] [--media=<md>] [--restore=<file>]
chemical --help
Options:
--chemical=<ch> Query by chemical number
--facility=<fc> Query by facility number
--media=<md> Query by media
--restore=<file> Restore from file
--help Show this screen
`
arguments, _ := docopt.Parse(usage, nil, true, "Naval Fate 2.0", false)func handleCommandLine() (map[string]interface{}, error) {
....
}func main() {
arguments, err := handleCommandLine()
if err != nil {
log.Fatal(err)
}type Filter func([]string) boolfunc BuildFilter(option string, field int, arguments map[string]interface{}) Filter {
toadd := arguments[option]
if len(toadd) == 0 {
// nothing specified, no filtering.
return nil
}
mapall := map[string]bool
for _, val := range toadd {
mapall[val] = true
}
// OK, now for the magic. Return a function!
return func(row []string) bool {
data := row[field]
return mapall[data]
}
}Context
StackExchange Code Review Q#124444, answer score: 6
Revisions (0)
No revisions yet.