HiveBrain v1.2.0
Get Started
← Back to all entries
patterngoMinor

Web Crawler (A Tour of Go #71)

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
crawlertourweb

Problem

So I was trying to do this exercise from Tour of Go. I managed to get it working. I am not at all sure that it is correctly concurrent or idiomatic (I started learning Go, like, 4 hours ago).

I would really appreciate feedback about my solution. Here's the relevant meat of the code. Full code available below.

type urlToFetch struct {
  Url   string
  Depth int
}

// Group stuff together, so that it's easier to pass around.
type fetchContext struct {
  Queue   chan urlToFetch
  History map[string]int
  Quit    chan int
  Fetcher Fetcher
}

func Crawl(url string, depth int, fetcher Fetcher) {
  context := fetchContext{
    Queue:   make(chan urlToFetch, 500),
    History: make(map[string]int),
    Quit:    make(chan int, 50),
    Fetcher: fetcher,
  }

  go fetchOne(&context)
  context.Queue  0 {
      body, urls, err := ctx.Fetcher.Fetch(utf.Url)
      ctx.History[utf.Url] = 1

      if err != nil {
        fmt.Println(err)
        return
      }
      fmt.Printf("found: %s %q\n", utf.Url, body)

      for _, u := range urls {
        go fetchOne(ctx)
        if _, ok := ctx.History[u]; !ok {
          ctx.Queue <- urlToFetch{u, utf.Depth - 1}
        } else {
        }
      }
    }
  case <-timeout:
    ctx.Quit <- 1
  }
}


The full source is at: http://pastie.org/8443147

Solution

There are two things I see.

-
Don't store a reference to a time out. I think the count down starts when After() is called.

timeout := time.After(2000 * time.Millisecond)
...
case <- timeout:


Most code I've seen writes it like this.

case <- time.After(2000 * time.Millisecond):


-
Use a lock when accessing a map from multiple goroutines because maps aren't thread safe.

Better solution:

func Crawl(url string, depth int, fetcher Fetcher) {
    m := map[string]bool{url: true}
    var mx sync.Mutex
    var wg sync.WaitGroup
    var c2 func(string, int)
    c2 = func(url string, depth int) {
        defer wg.Done()
        if depth <= 0 {
            return
        }
        body, urls, err := fetcher.Fetch(url)
        if err != nil {
            fmt.Println(err)
            return
        }
        fmt.Printf("found: %s %q\n", url, body)
        mx.Lock()
        for _, u := range urls {
            if !m[u] {
                m[u] = true
                wg.Add(1)
                go c2(u, depth-1)
            }
        }
        mx.Unlock()
    }
    wg.Add(1)
    c2(url, depth)
    wg.Wait()
}


Useful link:

https://stackoverflow.com/questions/12224962/exercise-web-crawler-concurrency-not-working

Code Snippets

timeout := time.After(2000 * time.Millisecond)
...
case <- timeout:
case <- time.After(2000 * time.Millisecond):
func Crawl(url string, depth int, fetcher Fetcher) {
    m := map[string]bool{url: true}
    var mx sync.Mutex
    var wg sync.WaitGroup
    var c2 func(string, int)
    c2 = func(url string, depth int) {
        defer wg.Done()
        if depth <= 0 {
            return
        }
        body, urls, err := fetcher.Fetch(url)
        if err != nil {
            fmt.Println(err)
            return
        }
        fmt.Printf("found: %s %q\n", url, body)
        mx.Lock()
        for _, u := range urls {
            if !m[u] {
                m[u] = true
                wg.Add(1)
                go c2(u, depth-1)
            }
        }
        mx.Unlock()
    }
    wg.Add(1)
    c2(url, depth)
    wg.Wait()
}

Context

StackExchange Code Review Q#33524, answer score: 3

Revisions (0)

No revisions yet.