HiveBrain v1.2.0
Get Started
← Back to all entries
patterngoMinor

Extract title with regex

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
titleregexwithextract

Problem

I'm quite new to Go and I feel like this code could be smaller and cleaner.

I would love any suggestions and/or hints about mistakes and conventional go things!

func getBookTitle(client *http.Client) {
    rsp, err := client.Get(bookSite)
    if err != nil {
        panic(err)
    }
    html, _ := ioutil.ReadAll(rsp.Body)

    //Get div with title
    regTitle := regexp.MustCompile("()[\n+\\s]*()[a-zA-Z–\\-\n\\s:]*()[\n+\\sdd]*()")
    //remove linebreaks regex
    regFormatTitle := regexp.MustCompile("[\r\n]*")
    //apply regex
    title := regFormatTitle.ReplaceAllString(string(regTitle.Find(html)),"")
    //Remove remove html tags and remove whitespaces
    title =  strings.TrimSpace(title[strings.Index(title,"")+len(""):strings.Index(title,"")])

    fmt.Printf("Book title:%s\n",title)

    rsp.Body.Close()
}

Solution

Go practices

It's recommended to close things using defer as early as possible,
so that you won't forget them later.
For example you have:

html, _ := ioutil.ReadAll(rsp.Body)

// some code

defer rsp.Body.Close()


Using defer, you should write like this instead:

html, _ := ioutil.ReadAll(rsp.Body)
defer rsp.Body.Close()

// some code


Another thing, it's recommended to not ignore errors.
This example ignores the err value returned by ioutil.ReadAll,
and that may lead to crashes later in the program.

Single responsibility principle

It's recommended for functions to do just one thing.
The getBookTitle function does do many things:

  • Download a web page from bookSite (a free variable defined elsewhere)



  • Read content from a stream



  • Extract title from an HTML string



The name implies only the 3rd task. It would be better if it did only that,
and the other tasks should be performed in other functions.

Strange regex

This regex looks strange:

()[\n+\\s]*()[a-zA-Z–\\-\n\\s:]*()[\n+\\sdd]*()


For several reasons:

-
[\n+\\s] is odd. \\s already includes \n. You could write simpler as [+\\s]. But are you really expecting a literal + between the starting ` and ? I seriously doubt it. In which case the expression could be reduced to just \\s.

-
[\n+\\sdd] is similarly odd. Again, the \n is unnecessary. Two d are also unnecessary, and if there are literal d between and , that would be invalid HTML.

-
Why the groupings with
(...)`? Those parentheses don't serve any purpose in this program.

All in all, it seems the regex is mistakenly over-complicated,
this simpler regex is probably good enough and much more clear:

\\s*[a-zA-Z–\\s:]*\\s*

Code Snippets

html, _ := ioutil.ReadAll(rsp.Body)

// some code

defer rsp.Body.Close()
html, _ := ioutil.ReadAll(rsp.Body)
defer rsp.Body.Close()

// some code
(<div class=\"dotd-title\">)[\n+\\s]*(<h2>)[a-zA-Z–\\-\n\\s:]*(</h2>)[\n+\\sdd]*(</div>)
<div class=\"dotd-title\">\\s*<h2>[a-zA-Z–\\s:]*</h2>\\s*</div>

Context

StackExchange Code Review Q#158837, answer score: 2

Revisions (0)

No revisions yet.