Refactoring a Crawler

Problem

I've recently ported an old project and made it object-oriented. However, I've noticed that rubocop points out the following status: Assignment Branch Condition size for fetch_images is too high. [22.41/15]. Can I make this method be made more efficient, perhaps?

Snippit

```
class WebCrawler
attr_reader :agent, :tumblr, :images

def initialize
@agent = Mechanize.new
@agent.user_agent_alias = Mechanize::AGENT_ALIASES.keys
.reject { |agent| agent == 'Mechanize' }.sample
@tumblr = URI('http://www.tumblr.com')
@images = []
end

def fetch_images(blog, proxy)
fail Exceptions::InvalidBlog unless valid_blog?(blog)
fail Exceptions::InvalidProxy unless valid_ip?(proxy)

tokens = proxy.split(':')
proxy_addr = { ip: tokens[0], port: tokens[1].to_i }
@agent.set_proxy(proxy_addr[:ip], proxy_addr[:port])

@tumblr.host.gsub!(/^w{3}/, '').insert(0, "#{blog}")
page = @agent.get(@tumblr)
page.search('img[src$="jpg"], img[src$="png"]').each do |img|
@images << img['src']
end

Solution

Assignment Branch Condition size for fetch_images is too high. [22.41/15]

What that warning is saying is that, according to rubocop's standard, you have too many assignments, branches, and conditionals (ABCs) in that method. ABCs are associated with code that is complex and harder to reason about. It can also indicate that a method is trying to do too much, as I would say it is telling you about this method.

Looking over the method in question, we can clearly see it has several responsibilities:

Guard clauses for validity

Tokenizing a proxy string into a hash data structure

Setting a proxy with the data from that data structure

Tearing apart and rebuilding a URL

Fetching a page over HTTP

Parsing markup for image tags, and shoving them on an array

Filtering that array of images

Returning that array

That's a lot of responsibilities for one method. It's dealing with several different types of objects/behaviors at several different levels of abstraction. We can, and should, break it up to make it easier to manage.

Before I do that, I'm going to look at the rest of the class (the initialize method). In the initializer, I see a lot of state assignment, and I see some work beyond just state assignment.

@agent = Mechanize.new

Here you are getting a new instance of Mechanize, but you are doing it directly in the constructor. I would recommend injecting this as a parameter instead. You could even use a default parameter to give you flexibility:

def initialize(agent = Mechanize.new)
  @agent = agent
  ...

This gives you the ability to change this dependency at runtime, and to mock or stub this dependency in your tests.

@agent.user_agent_alias = Mechanize::AGENT_ALIASES.keys                       
  .reject { |agent| agent == 'Mechanize' }.sample

Here is some more of the extra work being done in the initialize method. Ideally, you should get a fully initialized object injected, and you should not need to to any extra configuration with it. In this case, you might want to create a MechanizeFactory class to encapsulate the construction and initialization of Mechanize objects so you don't have to worry about any of this in your constructor:

class MechanizeFactory
  def self.get
    mechanize = Mechanize.new
    mechanize.user_agent_alias = Mechanize::AGENT_ALIASES.keys                       
      .reject { |agent| agent == 'Mechanize' }.sample
    mechanize
  end
end

...

class WebCrawler
  def initialize(agent = MechanizeFactory.get)
    @agent = agent
    ...

@tumblr = URI('http://www.tumblr.com')

Here you are setting the URL to the Tumblr main site as an instance variable. The way you use it does not really make it appropriate to be an instance variable. I will discuss why and alternatives further down, but for now I am going to say that this line is unnecessary.

@images = []

You are creating an instance variable to hold the list of returned images. I don't see any reason that this object needs to hold this state; the list of images can simply be returned from the fetch_images method. This line is also unnecessary.

Now onto the fetch_images method.

fail Exceptions::InvalidBlog unless valid_blog?(blog)
fail Exceptions::InvalidProxy unless valid_ip?(proxy)

Here are your guard clauses. For now, I'm going to just leave them as they are.

tokens = proxy.split(':')                                                     
proxy_addr = { ip: tokens[0], port: tokens[1].to_i }                          
@agent.set_proxy(proxy_addr[:ip], proxy_addr[:port])

Here is one place where a lot is happening. We are doing string tokenization, data structure juggling, and setting more properties on @agent. I would first have to ask if the proxy really changes with each request. I think it would probably be better to simply set the proxy in the MechanizeFactory.get method by passing it in by parameter. If it does need to change, there are other strategies to deal with it, but for now I'm going to go on my assumption. This means those three lines are effectively gone, with their functionality relocated. The proxy parameter and proxy validation guard clause are also gone.

@tumblr.host.gsub!(/^w{3}/, '').insert(0, "#{blog}")

Now we are statefully changing that URL instance variable, which I said we no longer need. We can use a local variable, but this still deals with a lower level of abstraction than the rest of the method. We should extract it to a private helper and call that from our code. We can also just combine it with the following line to avoid the need for a temporary local variable.

`class MechanizeFactory
def self.get(proxy = '') # Or some logical default for proxy
fail Exceptions::InvalidProxy unless valid_ip?(proxy)

mechanize = Mechanize.new

mechanize.user_agent_alias = Mechanize::AGENT_ALIASES.keys
.reject { |agent| agent == 'Mechanize' }.sample

proxy_ip, proxy_port = proxy.split(':')
mechaniz

Context

StackExchange Code Review Q#69800, answer score: 16

Revisions (0)

No revisions yet.