HiveBrain v1.2.0
Get Started
← Back to all entries
patterngoMinor

Summing chemical properties in a CSV stream

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
streamsummingpropertiescsvchemical

Problem

I have written my first program in go. It's rewritten from python.

Program takes csv file, emulates query SELECT X, Y, SUM(Conc), SUM(ToxConc) FROM very_big_table GROUP BY X, Y. And runs on terabytes of data.

It's written messy, but runs 100x faster than same program written in python.

Here is full code.

As you can see, there are a lot of repeating constructions.

What is reasonable way to rid off from constructions like?

if has_chemical_numbers {
    CN := record[4]
    i, _ := strconv.Atoi(CN)
    if !take_it {
        take_it = chemical_numbers[i]
    }
}


And.. How efficient this part in terms of memory usage?

//Aggregating results
for {
    record, err := r2.Read()
    if err == io.EOF {
        break
    }

    X, _ := strconv.Atoi(record[0])
    Y, _ := strconv.Atoi(record[1])
    Conc, _ := strconv.ParseFloat(record[2], 32)
    ToxConc, _ := strconv.ParseFloat(record[3], 32)

    var key Coordinates
    key.X = X
    key.Y = Y

    conc := results[key]
    conc.Conc += Conc
    conc.ToxConc += ToxConc

    results[key] = conc
}


Full code list:

``
package main

import (
"encoding/csv"
"fmt"
"github.com/docopt/docopt-go"
"io"
"log"
"os"
"strconv"
"strings"
)

type Coordinates struct {
X int
Y int
}

type SConc struct {
Conc float64
ToxConc float64
}

func raw_to_map(param interface{}) map[int]bool {
numbers := make(map[int]bool)
if param == nil {
return numbers
}
_numbers := strings.Split(param.(string), ",")
for _, i := range _numbers {
j, err := strconv.Atoi(i)
if err != nil {
panic(err)
}
numbers[j] = true
}
fmt.Println(numbers)
return numbers
}

func main() {

usage :=
Chemical Sum.
Usage:
chemical
chemical
chemical [--chemical=] [--facility=] [--media=] [--restore=]
chemical --help

Options:
--chemical= Query by chemical number
--facility= Query by

Solution

Structure

Functions..... you need more of them.

Your large, monolithic main method makes your code hard to follow, and separating out functions would help a lot.

Oh, and errors ... you need to handle them.

For a start, take this block:

usage := `Chemical Sum.
Usage:
    chemical
    chemical 
    chemical   [--chemical=] [--facility=] [--media=] [--restore=]
    chemical --help

Options:
    --chemical=      Query by chemical number
    --facility=      Query by facility number
    --media=         Query by media
    --restore=     Restore from file
    --help               Show this screen
    `
    arguments, _ := docopt.Parse(usage, nil, true, "Naval Fate 2.0", false)


And put it all in to a function:

func handleCommandLine() (map[string]interface{}, error) {
    ....
}


Now your main function is (with error handling ....):

func main() {
    arguments, err := handleCommandLine()
    if err != nil {
        log.Fatal(err)
    }


Right, options gone...
Filtering

Let's look at the filtering now.... and, remember, in Go, functions are "first class" which means they can be a type, or variable.

type Filter func([]string) bool


Create a filter type, which is a function that takes a string slice input, and returns a bool. "Do we filter this string-slice?" Let's take our arguments, and create a system that returns a Filter for each type of supplied argument...

func BuildFilter(option string, field int, arguments map[string]interface{}) Filter {
    toadd := arguments[option]
    if len(toadd) == 0 {
        // nothing specified, no filtering.
        return nil
    }
    mapall := map[string]bool
    for _, val := range toadd {
        mapall[val] = true
    }

    // OK, now for the magic. Return a function!

    return func(row []string) bool {
        data := row[field]
        return mapall[data]
    }
}


OK, so we have a method, that takes your arguments, and returns a function that filters a CSV record if the right field is matching an argument.

Now, our main method can have:

// start with an empty slice of filters.
filters := []Filter{}
if filt := BuildFilter("--chemical", 4, arguments); filt != nil {
    filters = append(filters, filt)
}
if filt := BuildFilter("--facility", 5, arguments); filt != nil {
    filters = append(filters, filt)
}
if filt := BuildFilter("--media", 6, arguments); filt != nil {
    filters = append(filters, filt)
}


Even the bove can be simplified with a function to conditionally do the append...

func CheckFilter(filt Filter, filters []Filter) []Filter {
    if filt != nil {
        return append (filters, filt)
    }
    return filters
}


Now, your main method can be:

filters := []Filter{}
filters = CheckFilter(BuildFilter("--chemical", 4, arguments), filters)
filters = CheckFilter(BuildFilter("--facility", 5, arguments), filters)
filters = CheckFilter(BuildFilter("--media", 6, arguments), filters)


OK we have some filters (perhaps none, though - it may be empty). Let's make a function that simplifies some logic:

func KeepRecord(record []string, filters []Filter) bool {
    if len(filters) == 0 {
        return true
    }
    for _, filt := range filters {
        if filt(record) {
            return true;
        }
    }
    return false
}


Time for a change in direction...
Double-Read

Why are you performing 2 read loops? One loop filters the data, and stores it in a temp file. The other loop reads the temp file, and aggregates the data.

You only need 1 loop (and then it really will be "streaming"...)

Let's simplify the loops. First up, discarding the header line... don't do that inside the loop, do it as a special case, outside the loop. Also, put your error handling after the statement that produces an error, don't put other logic in between.... this:

rf, err := os.Open(input_file)
r := csv.NewReader(rf)

if err != nil {
    log.Fatal("Error: %s", err)
}


should be:

rf, err := os.Open(input_file)
if err != nil {
    log.Fatal("Error: %s", err)
}

r := csv.NewReader(rf)


Now, discard the header....

if _, err := r.Read(); err != nil {
    log.Fatalf("Unable to read the header: %v", err)
}


Then, have your loop, and, filter the records... But, let's get our aggregation system set up first, too....

results := make(map[Coordinates]SConc)
for {
    record, err := r.Read()
    if err == io.EOF {
        break
    }
    if err != nil {
        log.Fatal(err)
    }

    if !KeepRecord(record, filters) {
        continue
    }

    X, _ := strconv.Atoi(record[1])
    Y, _ := strconv.Atoi(record[2])
    Conc, _ := strconv.ParseFloat(record[7], 32)
    ToxConc, _ := strconv.ParseFloat(record[8], 32)

    var key Coordinates
    key.X = X
    key.Y = Y

    conc := results[key]
    conc.Conc += Conc
    conc.ToxConc += ToxConc

    results[key] = conc
}


That's pretty much how it should look...

Code Snippets

usage := `Chemical Sum.
Usage:
    chemical
    chemical <input_file>
    chemical <input_file> <output_file> [--chemical=<ch>] [--facility=<fc>] [--media=<md>] [--restore=<file>]
    chemical --help

Options:
    --chemical=<ch>      Query by chemical number
    --facility=<fc>      Query by facility number
    --media=<md>         Query by media
    --restore=<file>     Restore from file
    --help               Show this screen
    `
    arguments, _ := docopt.Parse(usage, nil, true, "Naval Fate 2.0", false)
func handleCommandLine() (map[string]interface{}, error) {
    ....
}
func main() {
    arguments, err := handleCommandLine()
    if err != nil {
        log.Fatal(err)
    }
type Filter func([]string) bool
func BuildFilter(option string, field int, arguments map[string]interface{}) Filter {
    toadd := arguments[option]
    if len(toadd) == 0 {
        // nothing specified, no filtering.
        return nil
    }
    mapall := map[string]bool
    for _, val := range toadd {
        mapall[val] = true
    }

    // OK, now for the magic. Return a function!

    return func(row []string) bool {
        data := row[field]
        return mapall[data]
    }
}

Context

StackExchange Code Review Q#124444, answer score: 6

Revisions (0)

No revisions yet.