HiveBrain v1.2.0
Get Started
← Back to all entries
patternrubyMinor

Web scraper for job listings

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
listingsforwebjobscraper

Problem

Is there any room for improvement on this code?

I use mechanize to get the links of a job listing web site. There are pages with pagination (when jobs > 25) and pages without.

If there is, then the link number 5 is labelled after "Next >" and link number 6 is "Last >>".
In order to identify them, I make assertions on their name and act accordingly.
I then check how many pages the pagination occurs, and then, taking the numbers, I scrape the pages one by one.

def scrape_duty_station
    agent = Mechanize.new  
    page=agent.get(self.web_link)

    if page.links[5].text =~ /^next/i
      pagestart=page.links[5].href.match(/\d+/).to_s.to_i
      pageend=page.links[6].href.match(/\d+/).to_s.to_i
      @page=page.links[7...31]
      scrape_single_page
      (pagestart..pageend).each do |i|
        @page=agent.get(self.web_link+"/#{i}")
        @page=@page.links[7...31]
        scrape_single_page
      end  
    else
      @page=page.links[5...30]
      scrape_single_page
    end
  end

  def scrape_single_page
    @page.each do |i|
      p=Position.new
      p.title = i.text
      p.web_link = i.href
      p.job_id = i.href.match(/\d+/).to_s
      p.station = Station.where(:web_link => self.web_link).first
      p.save
    end
  end

Solution

Here are some comments.
  • If you are only @page as a way to communicate with scarpe_single_page, why not make it a parameter?



def scrape_duty_station
    agent = Mechanize.new  
    page = agent.get(web_link)


Avoid extra indent when one of the cases is simpler and can be dealt with easily.

Also, about your ranges. They should really be named constants.

return scrape_single_page(page.links[5...30]) unless page.links[5].text =~ /^next/i
    arr = [page] + (l_i(page,5)..l_i(page,6)).collect{|i|agent.get(web_link+"/#{i}")}
    arr.each {|p| scrape_single_page p.links[7...31] }
  end


If you are using any statement more than once, it may be worthwhile to facter it out to a function.

def l_i(page, link)
    page.links[link].href.match(/\d+/).to_s.to_i
  end


Does position have a constructor? If not, it would be nice to create one for it. If yes, it would be better to use it.

def scrape_single_page(pages)
    pages.each do |p|
      Position.new(p.text,p.href,p.href.match(/\d+/).to_s,
            Station.where(:web_link => self.web_link).first).save
    end
  end

Code Snippets

def scrape_duty_station
    agent = Mechanize.new  
    page = agent.get(web_link)
return scrape_single_page(page.links[5...30]) unless page.links[5].text =~ /^next/i
    arr = [page] + (l_i(page,5)..l_i(page,6)).collect{|i|agent.get(web_link+"/#{i}")}
    arr.each {|p| scrape_single_page p.links[7...31] }
  end
def l_i(page, link)
    page.links[link].href.match(/\d+/).to_s.to_i
  end
def scrape_single_page(pages)
    pages.each do |p|
      Position.new(p.text,p.href,p.href.match(/\d+/).to_s,
            Station.where(:web_link => self.web_link).first).save
    end
  end

Context

StackExchange Code Review Q#14562, answer score: 3

Revisions (0)

No revisions yet.