patternrubyMinor
Crawling for emails on websites given by Google API
Viewed 0 times
googlewebsitescrawlingforemailsapigiven
Problem
I'm trying to build an app which crawls a website to find the emails that it has and prints them. I also want to allow the user to type "false" into the console when they want to skip the website (maybe the user has already found 2 emails and doesn't need any more).
Is the way I'm approaching this the best way? If not, then how can I improve, and what am I missing?
Is the way I'm approaching this the best way? If not, then how can I improve, and what am I missing?
require "nokogiri"
require "json"
require 'mechanize'
require 'anemone'
require "typhoeus"
require "timeout"
class String
def to_bool()
return true if self == "true"
return false if self == "false"
return nil
end
end
class Query
def initialize(keyword)
10.times do |n|
num = (n * 10 + 1).to_s
p num
req = Typhoeus::Request.new("https://www.googleapis.com/customsearch/v1?key=[my_key]&cx=018020274830505137072:utuofm0ugh0&q=" + keyword + "&start=" + num, followlocation: true)
res = req.run
File.open("file.json","w") do |file|
file.write(res.body)
end
continue = "true"
fs = File.read("file.json");
string = JSON.parse(fs);
string["items"].each do |item|
p continue.to_s + " e
p e
continue = "true"
next
end
p "---------------------------------------------------------"
end
p "new request"
end
end
end
qs = Query.new("texas+web+development")Solution
As @tokland pointed out, the major thing that jumps out when reading your code is the indentation problems. In most Ruby code you see, the standard indentation is 2 spaces. It looks like you are using hard-tabs, which are generally a bad idea. You should explore the settings of your text editor -- most have an option to insert spaces instead of tabs when you press the "tab" key. Here is your code would look like if indented properly:
For your
The URL you're passing to
Or, if you want to be really diligent about not having long lines, you could do this:
You have a part of your code that goes
As a general note, you could make your code more concise by using less "intermediate" or "throw-away" variables. Take advantage of Ruby's method chaining and try condensing your code like this:
You might consider changing
Lastly, this is just my 2 cents, but I think you could simplify the way you're using
This would save you from having to keep doing
I admit, though, that I don't really understand the use of
Anyway, hope this helps!
require "nokogiri"
require "json"
require 'mechanize'
require 'anemone'
require "typhoeus"
require "timeout"
class String
def to_bool()
return true if self == "true"
return false if self == "false"
return nil
end
end
class Query
def initialize(keyword)
10.times do |n|
num = (n * 10 + 1).to_s
p num
req = Typhoeus::Request.new("https://www.googleapis.com/customsearch/v1?key=[my_key]&cx=018020274830505137072:utuofm0ugh0&q=" + keyword + "&start=" + num, followlocation: true)
res = req.run
File.open("file.json","w") do |file|
file.write(res.body)
end
continue = "true"
fs = File.read("file.json");
string = JSON.parse(fs);
string["items"].each do |item|
p continue.to_s + " e
p e
continue = "true"
next
end
p "---------------------------------------------------------"
end
p "new request"
end
end
end
qs = Query.new("texas+web+development")For your
to_bool() function, this might be a little nitpicky, but I would rewrite it as a case statement like this:class String
def to_bool()
case self
when "true"; true
when "false"; false
else; nil
end
end
endThe URL you're passing to
Typhoeus::Request.new is ratehr long. You might consider doing something like this to shorten your line lengths a little:base_url = "https://www.googleapis.com/customsearch/v1?key=[my_key]&cx=018020274830505137072:utuofm0ugh0&q="
req_url = base_url + keyword + "&start=" + num
req = Typhoeus::Request.new(req_url, followlocation: true)Or, if you want to be really diligent about not having long lines, you could do this:
base_url = "https://www.googleapis.com/customsearch/v1"
key = "[my_key]"
cx = "018020274830505137072:utuofm0ugh0"
req_url = "#{base_url}?key=#{key}&cx=#{cx}&q=#{keyword}&start=#{num}"You have a part of your code that goes
if email.nil? (nothing) else (something). A better way to put this would be: unless email.nil?
p email
# etc.As a general note, you could make your code more concise by using less "intermediate" or "throw-away" variables. Take advantage of Ruby's method chaining and try condensing your code like this:
...
# No need to define a variable res; req.run is short enough
req = Typhoeus::Request.new(req_url, followlocation: true)
File.open("file.json", "w") do |file|
file.write(req.run.body)
end
...
# You can get rid of the fs and string variables and do this:
JSON.parse(File.read("file.json"))["items"].each do |item|
p continue.to_s + "<- item"
# etc.
...
# You could change the name of the variable from request to req,
# for consistency -- you named another Typhoeus request "req"
# earlier in the code, and it doesn't look like you still need
# that variable, so there's no harm in re-using the name "req."
# Notice how you can eliminate the need for the variables "response"
# "email" like this:
req = Typhoeus::Request.new(page.url, followlocation: true)
email_pattern = /[-0-9a-zA-Z.+_]+@[-0-9a-zA-Z.+_]+\.[a-zA-Z]{2,4}/
unless email_pattern.match(req.run.body).nil?
# etc.You might consider changing
if continue.chomp.to_bool == false to unless continue.chomp.to_bool, or even if continue.chomp == "false". In fact, I think I like the last way the best -- you could totally do away with monkey-patching a String#to_bool method and just compare continue.chomp to "true" or "false". It's your call, of course. :)Lastly, this is just my 2 cents, but I think you could simplify the way you're using
continue. If I'm understanding correctly, it starts as "true" and you want the program to keep running unless the user types "false" when prompted. I would consider doing away with your String#to_bool method and just comparing whether or not continue.chomp.downcase equals "stop", "exit" or "quit". You could do something like this:stop_words = ["stop", "exit", "quit"]
continue = ""
...
if stop_words.include? continue.chomp.downcase
raise "no more please"
endThis would save you from having to keep doing
continue = "true" to make sure the program doesn't stop. As long as the value of continue.chomp.downcase is not one of the stop words, the program will keep running.I admit, though, that I don't really understand the use of
continue in p continue.to_s + "<- item", so maybe there is something I'm missing.Anyway, hope this helps!
Code Snippets
require "nokogiri"
require "json"
require 'mechanize'
require 'anemone'
require "typhoeus"
require "timeout"
class String
def to_bool()
return true if self == "true"
return false if self == "false"
return nil
end
end
class Query
def initialize(keyword)
10.times do |n|
num = (n * 10 + 1).to_s
p num
req = Typhoeus::Request.new("https://www.googleapis.com/customsearch/v1?key=[my_key]&cx=018020274830505137072:utuofm0ugh0&q=" + keyword + "&start=" + num, followlocation: true)
res = req.run
File.open("file.json","w") do |file|
file.write(res.body)
end
continue = "true"
fs = File.read("file.json");
string = JSON.parse(fs);
string["items"].each do |item|
p continue.to_s + "<- item"
begin
Anemone.crawl("http://" + item["displayLink"] + "/") do |anemone|
anemone.on_every_page do |page|
if continue.chomp.to_bool == false
raise "no more please"
end
request = Typhoeus::Request.new(page.url, followlocation: true)
response = request.run
email = /[-0-9a-zA-Z.+_]+@[-0-9a-zA-Z.+_]+\.[a-zA-Z]{2,4}/.match(response.body)
if email.nil?
else
p email
begin
continue = Timeout::timeout(2) do
p "insert now false/nothing"
gets
end
rescue Timeout::Error
continue = "true"
end
end
end
end
rescue Exception => e
p e
continue = "true"
next
end
p "---------------------------------------------------------"
end
p "new request"
end
end
end
qs = Query.new("texas+web+development")class String
def to_bool()
case self
when "true"; true
when "false"; false
else; nil
end
end
endbase_url = "https://www.googleapis.com/customsearch/v1?key=[my_key]&cx=018020274830505137072:utuofm0ugh0&q="
req_url = base_url + keyword + "&start=" + num
req = Typhoeus::Request.new(req_url, followlocation: true)base_url = "https://www.googleapis.com/customsearch/v1"
key = "[my_key]"
cx = "018020274830505137072:utuofm0ugh0"
req_url = "#{base_url}?key=#{key}&cx=#{cx}&q=#{keyword}&start=#{num}"unless email.nil?
p email
# etc.Context
StackExchange Code Review Q#46108, answer score: 4
Revisions (0)
No revisions yet.