patterngoMinor
Link checker using Go channels
Viewed 0 times
channelscheckerusinglink
Problem
I've started to learn Golang and channels in it.
I decided to write simple application - recursive link checker. Given some URL it tries to retrieve pages, parses them and goes deeper.
Here's a code of first version.
Some questions:
-
I use counter
-
I launch manually several parsers and fetchers. Should I use here WaitGroups?
-
Not sure about error treatment. How to do it better?
-
How effective is variable transmitting between goroutines? Are there some unnecessary copying?
-
Has it some race conditions? For example can program got message from channel
```
package main
import (
"fmt"
"golang.org/x/net/html"
"io/ioutil"
"log"
"net/http"
"strings"
)
type url string
type TaskStatus int
const (
_ TaskStatus = iota
TaskStatusNew
TaskStatusToParse
)
type Task struct {
url
depth int
resp *http.Response
//body *[]byte
}
var chanTasksToFetch = make(chan Task)
var chanTasksToParse = make(chan Task)
var chanFetchersIn = make(chan Task)
var chanTasksFinished = make(chan Task)
var mainWait = make(chan interface{})
var data = make(map[url]Task)
func getHref(t html.Token) (ok bool, href string) {
// iterate over all of the token's attribs
for _, a := range t.Attr {
if a.Key == "href" {
return true, a.Val
}
}
return
}
func extractLinks(page string) (urls []url) {
// extracts links from page and sends them into the url_channel
tokenizer := html.NewTokenizer(strings.NewReader(page))
for {
token_type := tokenizer.Next()
switch {
case token_type == html.ErrorToken:
log.Println(fmt.Sprintf("Error token_type: %s. Error: %s", token_type, tokenizer.Err()))
return
case token_type == html.St
I decided to write simple application - recursive link checker. Given some URL it tries to retrieve pages, parses them and goes deeper.
Here's a code of first version.
Some questions:
-
I use counter
urlsInProcess to understand when all tasks are done. But it looks a little bit awkward, isn't it?-
I launch manually several parsers and fetchers. Should I use here WaitGroups?
-
Not sure about error treatment. How to do it better?
-
How effective is variable transmitting between goroutines? Are there some unnecessary copying?
-
Has it some race conditions? For example can program got message from channel
chanTasksFinished before than from channel chanTasksToFetch? In this case we will exit before all tasks are done.```
package main
import (
"fmt"
"golang.org/x/net/html"
"io/ioutil"
"log"
"net/http"
"strings"
)
type url string
type TaskStatus int
const (
_ TaskStatus = iota
TaskStatusNew
TaskStatusToParse
)
type Task struct {
url
depth int
resp *http.Response
//body *[]byte
}
var chanTasksToFetch = make(chan Task)
var chanTasksToParse = make(chan Task)
var chanFetchersIn = make(chan Task)
var chanTasksFinished = make(chan Task)
var mainWait = make(chan interface{})
var data = make(map[url]Task)
func getHref(t html.Token) (ok bool, href string) {
// iterate over all of the token's attribs
for _, a := range t.Attr {
if a.Key == "href" {
return true, a.Val
}
}
return
}
func extractLinks(page string) (urls []url) {
// extracts links from page and sends them into the url_channel
tokenizer := html.NewTokenizer(strings.NewReader(page))
for {
token_type := tokenizer.Next()
switch {
case token_type == html.ErrorToken:
log.Println(fmt.Sprintf("Error token_type: %s. Error: %s", token_type, tokenizer.Err()))
return
case token_type == html.St
Solution
Instead of reading the whole body to transform it later to a Reader (again):
You could simply do
When you parse the URL, you just take the
Instead of an
You could then return this waitgroup, for the caller to do
bytes, _ := ioutil.ReadAll(task.resp.Body)
defer task.resp.Body.Close()
var urls []url = extractLinks(string(bytes))
(...)
func extractLinks(page string) (urls []url) {
// extracts links from page and sends them into the url_channel
tokenizer := html.NewTokenizer(strings.NewReader(page))You could simply do
var urls []url = extractLinks(resp.Body)
(...)
func extractLinks(body io.ReadCloser) (urls []url) {
defer body.Close()
// extracts links from page and sends them into the url_channel
tokenizer := html.NewTokenizer(body)When you parse the URL, you just take the
href. If it is a relative link (or a /question/123), you won't be able to parse it further. I might need to use the url.Parse function (beware of names collision with your url type)Instead of an
int for urlsInProcess, you could use a WaitGroup with wg.Add(1) and .Done() instead of ++ and --.You could then return this waitgroup, for the caller to do
wg.Wait() (but then you won't be able to know how many urls are currently in the work).Code Snippets
bytes, _ := ioutil.ReadAll(task.resp.Body)
defer task.resp.Body.Close()
var urls []url = extractLinks(string(bytes))
(...)
func extractLinks(page string) (urls []url) {
// extracts links from page and sends them into the url_channel
tokenizer := html.NewTokenizer(strings.NewReader(page))var urls []url = extractLinks(resp.Body)
(...)
func extractLinks(body io.ReadCloser) (urls []url) {
defer body.Close()
// extracts links from page and sends them into the url_channel
tokenizer := html.NewTokenizer(body)Context
StackExchange Code Review Q#157666, answer score: 3
Revisions (0)
No revisions yet.