HiveBrain v1.2.0
Get Started
← Back to all entries
patterngoMinor

Print all lines of a text file containing the same duplicated word

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
containingsamefilethealltextduplicatedprintwordlines

Problem

I'm implementing my very first Go application and I would like to receive an hint on how to make a certain check faster.

I have a huge txt file in which each line is structured like the following:

key text
key text
   .
   .
key text


Where key is ALWAYS 6 HEX digits long and text is ALWAYS 16 HEX digits long.


I need to find and print all the lines which have the same text value.
For instance, suppose we have 2 lines like the following



  • 000000 1234567890123456



  • 111111 1234567890123456





They should be both printed.

Here's my code:

r, _ := os.Open("store.txt")
    scanner := bufio.NewScanner(r)
    for scanner.Scan() {
        line := scanner.Text()
        text := line[7:]
        if !contains(duplicates, text) {
            duplicates = append(duplicates, text)
        } else {
                t, _:= os.Open("store.txt")
                dupScan := bufio.NewScanner(t)
                    //currLine := dupScan.Text()
                for dupScan.Scan() {
                    currLine   := dupScan.Text()
                    currCipher := currLine[7:23]
                    if( currCipher == text ){
                    fmt.Println(currLine)
                    }
                }
                t.Close()
            }
    }
fmt.Println("Done Check")
//Close File
r.Close()

func contains(s []string, e string) bool {
    for _, a := range s {
        if a == e {
            return true
        }
    }
    return false
}


It works just fine but, since I have million lines, it's really slow. I'm only capable of doing it with a single thread and not using goroutines.

Have you any suggestion on how I should change this code to speed up the check?

Solution

You have a bug

I don't think your program works as intended. If a line exists \$N > 1\$ times, it will be printed \$N (N - 1)\$ times. Because:

  • The first time it's seen, it will be added to duplicates and not printed. So far so good.



  • The 2nd, 3rd, ... Nth time it's seen, all \$N\$ occurrences will be printed.



When \$N = 2\$, the program will appear to work fine.
But when \$N > 2\$ there will be excess output of duplicates, which is probably not the way you intended.

Performance

There are several inefficient operations.

The worst is re-reading the file for every duplicate. If the million lines contain pairs of 500000 values, the file will be read 500001 times.

Another inefficient operation is storing the unique values in a slice, instead of a map. The contains implementation based on a slice has \$O(n)\$ time complexity, which would be \$O(1)\$ using a map.

To improve performance you could use a map of counts. Read the entire file once to build this map. Unique values will have a count of 1, duplicates will have > 1. Read the file again, checking the count in the map, if it's > 1 then print the line.

Resource management and error handling

In Go, it's recommended to use defer to close resources soon after you opened them, so you don't forget later. So instead of this:

fh, err := os.Open(path)

// large block of code

fh.Close()


It's recommended to write like this:

fh, err := os.Open(path)
if err != nil {
    panic("could not open file: " + path)
}
defer fh.Close()

// rest of the code


Closely related is error handling. You ignored the error value returned by os.Open, which is not wise. The rest of the program will make little sense if opening the file fails. Always remember and consider to handle errors.

Coding style

It's good to run go fmt on your program before asking for a review.
It automatically reformats it following the standard of Go.

Suggested implementation

Putting the above tips together:

func main() {
    path := "/tmp/store.txt"
    counts := readCounts(path)
    check(path, counts)
}

func check(path string, counts map[string]int) {
    fh, _ := os.Open(path)
    defer fh.Close()

    scanner := bufio.NewScanner(fh)
    for scanner.Scan() {
        line := scanner.Text()
        text := line[7:]
        if counts[text] > 1 {
            fmt.Println(line)
        }
    }
}

func readCounts(path string) map[string]int {
    fh, err := os.Open(path)
    if err != nil {
        panic("could not open file: " + path)
    }
    defer fh.Close()

    scanner := bufio.NewScanner(fh)
    counts := make(map[string]int)

    for scanner.Scan() {
        line := scanner.Text()
        text := line[7:]
        counts[text] += 1
    }

    return counts
}

Code Snippets

fh, err := os.Open(path)

// large block of code

fh.Close()
fh, err := os.Open(path)
if err != nil {
    panic("could not open file: " + path)
}
defer fh.Close()

// rest of the code
func main() {
    path := "/tmp/store.txt"
    counts := readCounts(path)
    check(path, counts)
}

func check(path string, counts map[string]int) {
    fh, _ := os.Open(path)
    defer fh.Close()

    scanner := bufio.NewScanner(fh)
    for scanner.Scan() {
        line := scanner.Text()
        text := line[7:]
        if counts[text] > 1 {
            fmt.Println(line)
        }
    }
}

func readCounts(path string) map[string]int {
    fh, err := os.Open(path)
    if err != nil {
        panic("could not open file: " + path)
    }
    defer fh.Close()

    scanner := bufio.NewScanner(fh)
    counts := make(map[string]int)

    for scanner.Scan() {
        line := scanner.Text()
        text := line[7:]
        counts[text] += 1
    }

    return counts
}

Context

StackExchange Code Review Q#145590, answer score: 6

Revisions (0)

No revisions yet.