HiveBrain v1.2.0
Get Started
← Back to all entries
snippetrubyModerate

Parse webpage and save fetched images to a directory

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
directorywebpagefetchedparsesaveandimages

Problem

I had a task for a programmer position which I failed. I am a newbie programmer and I accept that. The only problem is that employer never told me what the actual problem with the code is. So maybe community will be able to give few hints on how to improve it.

The task was to write code which would parse a given webpage, fetch all the images and save it to given directory. Webpage address and directory are command line parameters. Performance was a critical issue for this task.

require 'open-uri'
require 'nokogiri'

class Grab
    def runner(url, path)
        threads = []
        doc = Nokogiri::HTML(open("http://#{url}"))
        img_srcs = doc.css('img').map{ |i| i['src'] }.uniq
        img_srcs = rel_to_abs(url, img_srcs)
        img_srcs.each do |img_src|
            threads << Thread.new(img_src) do
                name = img_src.match(/^http:\/\/.*\/(.*)$/)[1]
                image = fetch img_src
                save(image, name, path)
            end
        end
        threads.each{ |thread| thread.join }
    end

    def fetch(img_src)
        puts "Fetching #{img_src}\n"
        image = open(img_src)
    end

    def save(image, name, path)
        File.open("#{path}/#{name}", "wb"){ |file| file.write(image.read) }
    end

    def rel_to_abs(url, img_srcs)
        img_srcs.each_with_index do |img_src, index|
          img_srcs[index] = "http://#{url}/#{img_src}" unless   img_src.match(/http:\/\//)
        end
        img_srcs
    end
end

Solution

Few minor issues:

-
It will be cool if you support URL parameter with protocol (HTTP/HTTPS):

doc = Nokogiri::HTML(open("http://#{url}"))


-
For URL manipulations you can use standard tools - URI::HTTP/URI:HTTPS instead regexps:

name = img_src.match(/^http:\/\/.*\/(.*)$/)
img_srcs[index] = "http://#{url}/#{img_src}" unless   img_src.match(/http:\/\//)


-
This is a little unsecure operation, you should escape name:

File.open("#{path}/#{name}", "wb")


-
You ignore links with HTTPS protocol:

img_src.match(/http:\/\//)


-
This is not the Ruby way:

img_srcs.each_with_index do |img_src, index|
  img_srcs[index] = "http://#{url}/#{img_src}" unless   img_src.match(/http:\/\//)
end
img_srcs


A little simpler

img_srcs.map do |src|
  src.match(/http:\/\//) ? "http://#{url}/#{src}" : src
end


-
You load page uneffiсiently:

image = open(img_src)


So for every image temp file created. This may not be an efficient way.

open("http://stackoverflow.com/questions/1878891/how-to-load-a-web-page-and-search-for-a-word-in-ruby").class
=> Tempfile

Code Snippets

doc = Nokogiri::HTML(open("http://#{url}"))
name = img_src.match(/^http:\/\/.*\/(.*)$/)
img_srcs[index] = "http://#{url}/#{img_src}" unless   img_src.match(/http:\/\//)
File.open("#{path}/#{name}", "wb")
img_src.match(/http:\/\//)
img_srcs.each_with_index do |img_src, index|
  img_srcs[index] = "http://#{url}/#{img_src}" unless   img_src.match(/http:\/\//)
end
img_srcs

Context

StackExchange Code Review Q#12593, answer score: 12

Revisions (0)

No revisions yet.