snippetrubyModerate
Parse webpage and save fetched images to a directory
Viewed 0 times
directorywebpagefetchedparsesaveandimages
Problem
I had a task for a programmer position which I failed. I am a newbie programmer and I accept that. The only problem is that employer never told me what the actual problem with the code is. So maybe community will be able to give few hints on how to improve it.
The task was to write code which would parse a given webpage, fetch all the images and save it to given directory. Webpage address and directory are command line parameters. Performance was a critical issue for this task.
The task was to write code which would parse a given webpage, fetch all the images and save it to given directory. Webpage address and directory are command line parameters. Performance was a critical issue for this task.
require 'open-uri'
require 'nokogiri'
class Grab
def runner(url, path)
threads = []
doc = Nokogiri::HTML(open("http://#{url}"))
img_srcs = doc.css('img').map{ |i| i['src'] }.uniq
img_srcs = rel_to_abs(url, img_srcs)
img_srcs.each do |img_src|
threads << Thread.new(img_src) do
name = img_src.match(/^http:\/\/.*\/(.*)$/)[1]
image = fetch img_src
save(image, name, path)
end
end
threads.each{ |thread| thread.join }
end
def fetch(img_src)
puts "Fetching #{img_src}\n"
image = open(img_src)
end
def save(image, name, path)
File.open("#{path}/#{name}", "wb"){ |file| file.write(image.read) }
end
def rel_to_abs(url, img_srcs)
img_srcs.each_with_index do |img_src, index|
img_srcs[index] = "http://#{url}/#{img_src}" unless img_src.match(/http:\/\//)
end
img_srcs
end
endSolution
Few minor issues:
-
It will be cool if you support URL parameter with protocol (HTTP/HTTPS):
-
For URL manipulations you can use standard tools - URI::HTTP/URI:HTTPS instead regexps:
-
This is a little unsecure operation, you should escape name:
-
You ignore links with HTTPS protocol:
-
This is not the Ruby way:
A little simpler
-
You load page uneffiсiently:
So for every image temp file created. This may not be an efficient way.
-
It will be cool if you support URL parameter with protocol (HTTP/HTTPS):
doc = Nokogiri::HTML(open("http://#{url}"))-
For URL manipulations you can use standard tools - URI::HTTP/URI:HTTPS instead regexps:
name = img_src.match(/^http:\/\/.*\/(.*)$/)
img_srcs[index] = "http://#{url}/#{img_src}" unless img_src.match(/http:\/\//)-
This is a little unsecure operation, you should escape name:
File.open("#{path}/#{name}", "wb")-
You ignore links with HTTPS protocol:
img_src.match(/http:\/\//)-
This is not the Ruby way:
img_srcs.each_with_index do |img_src, index|
img_srcs[index] = "http://#{url}/#{img_src}" unless img_src.match(/http:\/\//)
end
img_srcsA little simpler
img_srcs.map do |src|
src.match(/http:\/\//) ? "http://#{url}/#{src}" : src
end-
You load page uneffiсiently:
image = open(img_src)So for every image temp file created. This may not be an efficient way.
open("http://stackoverflow.com/questions/1878891/how-to-load-a-web-page-and-search-for-a-word-in-ruby").class
=> TempfileCode Snippets
doc = Nokogiri::HTML(open("http://#{url}"))name = img_src.match(/^http:\/\/.*\/(.*)$/)
img_srcs[index] = "http://#{url}/#{img_src}" unless img_src.match(/http:\/\//)File.open("#{path}/#{name}", "wb")img_src.match(/http:\/\//)img_srcs.each_with_index do |img_src, index|
img_srcs[index] = "http://#{url}/#{img_src}" unless img_src.match(/http:\/\//)
end
img_srcsContext
StackExchange Code Review Q#12593, answer score: 12
Revisions (0)
No revisions yet.