patterngoMinor
Print all lines of a text file containing the same duplicated word
Viewed 0 times
containingsamefilethealltextduplicatedprintwordlines
Problem
I'm implementing my very first Go application and I would like to receive an hint on how to make a certain check faster.
I have a huge
Where
I need to find and print all the lines which have the same
For instance, suppose we have 2 lines like the following
They should be both printed.
Here's my code:
It works just fine but, since I have million lines, it's really slow. I'm only capable of doing it with a single thread and not using goroutines.
Have you any suggestion on how I should change this code to speed up the check?
I have a huge
txt file in which each line is structured like the following:key text
key text
.
.
key textWhere
key is ALWAYS 6 HEX digits long and text is ALWAYS 16 HEX digits long.I need to find and print all the lines which have the same
text value.For instance, suppose we have 2 lines like the following
- 000000 1234567890123456
- 111111 1234567890123456
They should be both printed.
Here's my code:
r, _ := os.Open("store.txt")
scanner := bufio.NewScanner(r)
for scanner.Scan() {
line := scanner.Text()
text := line[7:]
if !contains(duplicates, text) {
duplicates = append(duplicates, text)
} else {
t, _:= os.Open("store.txt")
dupScan := bufio.NewScanner(t)
//currLine := dupScan.Text()
for dupScan.Scan() {
currLine := dupScan.Text()
currCipher := currLine[7:23]
if( currCipher == text ){
fmt.Println(currLine)
}
}
t.Close()
}
}
fmt.Println("Done Check")
//Close File
r.Close()
func contains(s []string, e string) bool {
for _, a := range s {
if a == e {
return true
}
}
return false
}It works just fine but, since I have million lines, it's really slow. I'm only capable of doing it with a single thread and not using goroutines.
Have you any suggestion on how I should change this code to speed up the check?
Solution
You have a bug
I don't think your program works as intended. If a line exists \$N > 1\$ times, it will be printed \$N (N - 1)\$ times. Because:
When \$N = 2\$, the program will appear to work fine.
But when \$N > 2\$ there will be excess output of duplicates, which is probably not the way you intended.
Performance
There are several inefficient operations.
The worst is re-reading the file for every duplicate. If the million lines contain pairs of 500000 values, the file will be read 500001 times.
Another inefficient operation is storing the unique values in a slice, instead of a map. The
To improve performance you could use a map of counts. Read the entire file once to build this map. Unique values will have a count of 1, duplicates will have > 1. Read the file again, checking the count in the map, if it's > 1 then print the line.
Resource management and error handling
In Go, it's recommended to use
It's recommended to write like this:
Closely related is error handling. You ignored the
Coding style
It's good to run
It automatically reformats it following the standard of Go.
Suggested implementation
Putting the above tips together:
I don't think your program works as intended. If a line exists \$N > 1\$ times, it will be printed \$N (N - 1)\$ times. Because:
- The first time it's seen, it will be added to
duplicatesand not printed. So far so good.
- The 2nd, 3rd, ... Nth time it's seen, all \$N\$ occurrences will be printed.
When \$N = 2\$, the program will appear to work fine.
But when \$N > 2\$ there will be excess output of duplicates, which is probably not the way you intended.
Performance
There are several inefficient operations.
The worst is re-reading the file for every duplicate. If the million lines contain pairs of 500000 values, the file will be read 500001 times.
Another inefficient operation is storing the unique values in a slice, instead of a map. The
contains implementation based on a slice has \$O(n)\$ time complexity, which would be \$O(1)\$ using a map.To improve performance you could use a map of counts. Read the entire file once to build this map. Unique values will have a count of 1, duplicates will have > 1. Read the file again, checking the count in the map, if it's > 1 then print the line.
Resource management and error handling
In Go, it's recommended to use
defer to close resources soon after you opened them, so you don't forget later. So instead of this:fh, err := os.Open(path)
// large block of code
fh.Close()It's recommended to write like this:
fh, err := os.Open(path)
if err != nil {
panic("could not open file: " + path)
}
defer fh.Close()
// rest of the codeClosely related is error handling. You ignored the
error value returned by os.Open, which is not wise. The rest of the program will make little sense if opening the file fails. Always remember and consider to handle errors.Coding style
It's good to run
go fmt on your program before asking for a review.It automatically reformats it following the standard of Go.
Suggested implementation
Putting the above tips together:
func main() {
path := "/tmp/store.txt"
counts := readCounts(path)
check(path, counts)
}
func check(path string, counts map[string]int) {
fh, _ := os.Open(path)
defer fh.Close()
scanner := bufio.NewScanner(fh)
for scanner.Scan() {
line := scanner.Text()
text := line[7:]
if counts[text] > 1 {
fmt.Println(line)
}
}
}
func readCounts(path string) map[string]int {
fh, err := os.Open(path)
if err != nil {
panic("could not open file: " + path)
}
defer fh.Close()
scanner := bufio.NewScanner(fh)
counts := make(map[string]int)
for scanner.Scan() {
line := scanner.Text()
text := line[7:]
counts[text] += 1
}
return counts
}Code Snippets
fh, err := os.Open(path)
// large block of code
fh.Close()fh, err := os.Open(path)
if err != nil {
panic("could not open file: " + path)
}
defer fh.Close()
// rest of the codefunc main() {
path := "/tmp/store.txt"
counts := readCounts(path)
check(path, counts)
}
func check(path string, counts map[string]int) {
fh, _ := os.Open(path)
defer fh.Close()
scanner := bufio.NewScanner(fh)
for scanner.Scan() {
line := scanner.Text()
text := line[7:]
if counts[text] > 1 {
fmt.Println(line)
}
}
}
func readCounts(path string) map[string]int {
fh, err := os.Open(path)
if err != nil {
panic("could not open file: " + path)
}
defer fh.Close()
scanner := bufio.NewScanner(fh)
counts := make(map[string]int)
for scanner.Scan() {
line := scanner.Text()
text := line[7:]
counts[text] += 1
}
return counts
}Context
StackExchange Code Review Q#145590, answer score: 6
Revisions (0)
No revisions yet.