HiveBrain v1.2.0
Get Started
← Back to all entries
patternrubyMinor

Nokogiri crawler

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
nokogiricrawlerstackoverflow

Problem

The following code works but is a mess. But being totally new to Ruby I have had big problems trying to refactor it into something resembling clean OOP code. Could you help with this and explain what you are doing?

```
require 'nokogiri'
require 'open-uri'
require_relative 'db.rb' # Not used
require 'mechanize'
require 'digest/md5'
require 'colorize'
require 'win32console'

# Get the Base URL for Groups
# baseURL = ''
# The Initial Search Point for Groups

#Base URL
baseURL = ''
pageURL = ''
doc = Nokogiri::HTML(open(pageURL))

# Search for nodes by css
pages = doc.css('.pages')
thumbs = doc.css('.thumb')
startingPages = pages.css('a').first.text
totalPages = pages.css('a').last.text.to_i
currentPage = 1

## DO THE FIRST PAGE
thumbs.each do |thumb|
imgSrc = thumb.css('.t_img').first['src']
file = open(imgSrc).read.unpack('H*').first
agent = Mechanize.new
image = agent.get(imgSrc)
md = Digest::MD5.hexdigest(*[file.to_s])
sha = Digest::SHA2.hexdigest(*[file.to_s])
title = thumb.css('.t_img').first['alt']
fileType = image.header["content-type"].to_s
lastModified = image.header["last-modified"].to_s
fileSize = image.header["content-length"].to_s
driver = "Prowler"
source = $pagesPath.to_s
binary = file
fileName = image.filename

# ||, -, *, /, <>, , ,(comma), =, =, ~=, !=, ^=, (, ) Illegal characters for sql queries. Check title for these before inserting into the database
title = title.gsub("'", "")

begin
# connect to the MySQL server
dbh = DBI.connect('dbi:ODBC:DamoclesProwler', 'user', 'pw')
sqlText = "INSERT INTO prowlerRunningResults(ProwlerDriver, SourceURL, Title, DestinationURL, FileSource, FileType, FileName, LastModified, FileSize, HashMD5, HashSHA256, ImageBinary)
VALUES ('#{driver}', '#{source}', '#{title}', 'NA', '#{imgSrc}', '#{fileType}', '#{fileName}', '#{lastModified}', #{fileSize.to_i}, '#{md}', '#{sha}', '#{binary}' )"

dbh.do( sqlText )
puts "Record has been created for".green + "\t" + sha.yellow
dbh.commit
rescue DBI:

Solution

I can't speak to Ruby but I see an improvement that could be done to your MySQL code. Since this operation loops multiple times and passes a query to MySQL each time, you would considerably improve your database execution by using a stored PROCEDURE and passing parameters to it. The purpose of this (besides simplifying the code) is that MySQL would store the execution plan with the procedure after it runs for the first time.

So instead of this:

sqlText =   "INSERT INTO prowlerRunningResults(ProwlerDriver, SourceURL, Title, DestinationURL, FileSource, FileType, FileName, LastModified, FileSize, HashMD5, HashSHA256, ImageBinary) 
VALUES ('#{driver}', '#{source}', '#{title}', 'NA', '#{imgSrc}',  '#{fileType}',  '#{fileName}',  '#{lastModified}', #{fileSize.to_i},   '#{md}',   '#{sha}',   '#{binary}'   )"


You could do this (once) in MySQL (or pass it from Ruby, doesn't matter):

DELIMITER |
DROP PROCEDURE IF EXISTS MyProcedure| -- Use a relevant name
CREATE PROCEDURE MyProcedure -- Use a relevant name
  (
  IN driver VARCHAR, -- Note you can specify character limit if you use VAR(50) or other number
  IN source VARCHAR,
  IN title VARCHAR,
  IN imgSrc VARCHAR,
  IN fileType VARCHAR,
  IN fileName VARCHAR,
  IN lastModified VARCHAR,
  IN [fileSize.to_i] VARCHAR, -- Using brackets due to period character
  IN md VARCHAR,
  IN sha VARCHAR,
  IN binary VARCHAR
  );
AS
BEGIN

INSERT INTO prowlerRunningResults(ProwlerDriver, SourceURL, Title, DestinationURL, FileSource, FileType, FileName, LastModified, FileSize, HashMD5, HashSHA256, ImageBinary)
VALUES (driver, source, title, 'NA', imgSrc,  fileType,  fileName,  lastModified, [fileSize.to_i], md, sha, binary);
END|
DELIMITER ;


Note you can use something else as DELIMITER it's a wild card. I've seen // or $$ used frequently, or you can even use GO as long as it is not ;.

Then in your Ruby script you would just do this instead of what you have now:

sqlText =   "CALL MyProcedure('#{driver}', '#{source}', '#{title}', 'NA', '#{imgSrc}',  '#{fileType}',  '#{fileName}',  '#{lastModified}', #{fileSize.to_i},   '#{md}',   '#{sha}',   '#{binary}'   )"

Code Snippets

sqlText =   "INSERT INTO prowlerRunningResults(ProwlerDriver, SourceURL, Title, DestinationURL, FileSource, FileType, FileName, LastModified, FileSize, HashMD5, HashSHA256, ImageBinary) 
VALUES ('#{driver}', '#{source}', '#{title}', 'NA', '#{imgSrc}',  '#{fileType}',  '#{fileName}',  '#{lastModified}', #{fileSize.to_i},   '#{md}',   '#{sha}',   '#{binary}'   )"
DELIMITER |
DROP PROCEDURE IF EXISTS MyProcedure| -- Use a relevant name
CREATE PROCEDURE MyProcedure -- Use a relevant name
  (
  IN driver VARCHAR, -- Note you can specify character limit if you use VAR(50) or other number
  IN source VARCHAR,
  IN title VARCHAR,
  IN imgSrc VARCHAR,
  IN fileType VARCHAR,
  IN fileName VARCHAR,
  IN lastModified VARCHAR,
  IN [fileSize.to_i] VARCHAR, -- Using brackets due to period character
  IN md VARCHAR,
  IN sha VARCHAR,
  IN binary VARCHAR
  );
AS
BEGIN

INSERT INTO prowlerRunningResults(ProwlerDriver, SourceURL, Title, DestinationURL, FileSource, FileType, FileName, LastModified, FileSize, HashMD5, HashSHA256, ImageBinary)
VALUES (driver, source, title, 'NA', imgSrc,  fileType,  fileName,  lastModified, [fileSize.to_i], md, sha, binary);
END|
DELIMITER ;
sqlText =   "CALL MyProcedure('#{driver}', '#{source}', '#{title}', 'NA', '#{imgSrc}',  '#{fileType}',  '#{fileName}',  '#{lastModified}', #{fileSize.to_i},   '#{md}',   '#{sha}',   '#{binary}'   )"

Context

StackExchange Code Review Q#56081, answer score: 5

Revisions (0)

No revisions yet.