HiveBrain v1.2.0
Get Started
← Back to all entries
patterngoMinor

Link checker using Go channels

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
channelscheckerusinglink

Problem

I've started to learn Golang and channels in it.
I decided to write simple application - recursive link checker. Given some URL it tries to retrieve pages, parses them and goes deeper.

Here's a code of first version.

Some questions:

-
I use counter urlsInProcess to understand when all tasks are done. But it looks a little bit awkward, isn't it?

-
I launch manually several parsers and fetchers. Should I use here WaitGroups?

-
Not sure about error treatment. How to do it better?

-
How effective is variable transmitting between goroutines? Are there some unnecessary copying?

-
Has it some race conditions? For example can program got message from channel chanTasksFinished before than from channel chanTasksToFetch? In this case we will exit before all tasks are done.

```
package main

import (
"fmt"
"golang.org/x/net/html"
"io/ioutil"
"log"
"net/http"
"strings"
)

type url string
type TaskStatus int

const (
_ TaskStatus = iota
TaskStatusNew
TaskStatusToParse
)

type Task struct {
url
depth int
resp *http.Response
//body *[]byte
}

var chanTasksToFetch = make(chan Task)
var chanTasksToParse = make(chan Task)
var chanFetchersIn = make(chan Task)
var chanTasksFinished = make(chan Task)

var mainWait = make(chan interface{})

var data = make(map[url]Task)

func getHref(t html.Token) (ok bool, href string) {
// iterate over all of the token's attribs
for _, a := range t.Attr {
if a.Key == "href" {
return true, a.Val
}
}
return
}

func extractLinks(page string) (urls []url) {
// extracts links from page and sends them into the url_channel
tokenizer := html.NewTokenizer(strings.NewReader(page))
for {
token_type := tokenizer.Next()
switch {
case token_type == html.ErrorToken:
log.Println(fmt.Sprintf("Error token_type: %s. Error: %s", token_type, tokenizer.Err()))
return
case token_type == html.St

Solution

Instead of reading the whole body to transform it later to a Reader (again):

bytes, _ := ioutil.ReadAll(task.resp.Body)
    defer task.resp.Body.Close()
    var urls []url = extractLinks(string(bytes))

(...)

func extractLinks(page string) (urls []url) {
    // extracts links from page and sends them into the url_channel
    tokenizer := html.NewTokenizer(strings.NewReader(page))


You could simply do

var urls []url = extractLinks(resp.Body)

(...)

func extractLinks(body io.ReadCloser) (urls []url) {
    defer body.Close()
    // extracts links from page and sends them into the url_channel
    tokenizer := html.NewTokenizer(body)


When you parse the URL, you just take the href. If it is a relative link (or a /question/123), you won't be able to parse it further. I might need to use the url.Parse function (beware of names collision with your url type)

Instead of an int for urlsInProcess, you could use a WaitGroup with wg.Add(1) and .Done() instead of ++ and --.
You could then return this waitgroup, for the caller to do wg.Wait() (but then you won't be able to know how many urls are currently in the work).

Code Snippets

bytes, _ := ioutil.ReadAll(task.resp.Body)
    defer task.resp.Body.Close()
    var urls []url = extractLinks(string(bytes))

(...)

func extractLinks(page string) (urls []url) {
    // extracts links from page and sends them into the url_channel
    tokenizer := html.NewTokenizer(strings.NewReader(page))
var urls []url = extractLinks(resp.Body)

(...)

func extractLinks(body io.ReadCloser) (urls []url) {
    defer body.Close()
    // extracts links from page and sends them into the url_channel
    tokenizer := html.NewTokenizer(body)

Context

StackExchange Code Review Q#157666, answer score: 3

Revisions (0)

No revisions yet.