patternrubyMinor
Web scraper for job listings
Viewed 0 times
listingsforwebjobscraper
Problem
Is there any room for improvement on this code?
I use mechanize to get the links of a job listing web site. There are pages with pagination (when jobs > 25) and pages without.
If there is, then the link number 5 is labelled after "Next >" and link number 6 is "Last >>".
In order to identify them, I make assertions on their name and act accordingly.
I then check how many pages the pagination occurs, and then, taking the numbers, I scrape the pages one by one.
I use mechanize to get the links of a job listing web site. There are pages with pagination (when jobs > 25) and pages without.
If there is, then the link number 5 is labelled after "Next >" and link number 6 is "Last >>".
In order to identify them, I make assertions on their name and act accordingly.
I then check how many pages the pagination occurs, and then, taking the numbers, I scrape the pages one by one.
def scrape_duty_station
agent = Mechanize.new
page=agent.get(self.web_link)
if page.links[5].text =~ /^next/i
pagestart=page.links[5].href.match(/\d+/).to_s.to_i
pageend=page.links[6].href.match(/\d+/).to_s.to_i
@page=page.links[7...31]
scrape_single_page
(pagestart..pageend).each do |i|
@page=agent.get(self.web_link+"/#{i}")
@page=@page.links[7...31]
scrape_single_page
end
else
@page=page.links[5...30]
scrape_single_page
end
end
def scrape_single_page
@page.each do |i|
p=Position.new
p.title = i.text
p.web_link = i.href
p.job_id = i.href.match(/\d+/).to_s
p.station = Station.where(:web_link => self.web_link).first
p.save
end
endSolution
Here are some comments.
Avoid extra indent when one of the cases is simpler and can be dealt with easily.
Also, about your ranges. They should really be named constants.
If you are using any statement more than once, it may be worthwhile to facter it out to a function.
Does position have a constructor? If not, it would be nice to create one for it. If yes, it would be better to use it.
- If you are only @page as a way to communicate with
scarpe_single_page, why not make it a parameter?
def scrape_duty_station
agent = Mechanize.new
page = agent.get(web_link)Avoid extra indent when one of the cases is simpler and can be dealt with easily.
Also, about your ranges. They should really be named constants.
return scrape_single_page(page.links[5...30]) unless page.links[5].text =~ /^next/i
arr = [page] + (l_i(page,5)..l_i(page,6)).collect{|i|agent.get(web_link+"/#{i}")}
arr.each {|p| scrape_single_page p.links[7...31] }
endIf you are using any statement more than once, it may be worthwhile to facter it out to a function.
def l_i(page, link)
page.links[link].href.match(/\d+/).to_s.to_i
endDoes position have a constructor? If not, it would be nice to create one for it. If yes, it would be better to use it.
def scrape_single_page(pages)
pages.each do |p|
Position.new(p.text,p.href,p.href.match(/\d+/).to_s,
Station.where(:web_link => self.web_link).first).save
end
endCode Snippets
def scrape_duty_station
agent = Mechanize.new
page = agent.get(web_link)return scrape_single_page(page.links[5...30]) unless page.links[5].text =~ /^next/i
arr = [page] + (l_i(page,5)..l_i(page,6)).collect{|i|agent.get(web_link+"/#{i}")}
arr.each {|p| scrape_single_page p.links[7...31] }
enddef l_i(page, link)
page.links[link].href.match(/\d+/).to_s.to_i
enddef scrape_single_page(pages)
pages.each do |p|
Position.new(p.text,p.href,p.href.match(/\d+/).to_s,
Station.where(:web_link => self.web_link).first).save
end
endContext
StackExchange Code Review Q#14562, answer score: 3
Revisions (0)
No revisions yet.