patternrubyMinor
Nokogiri crawler
Viewed 0 times
nokogiricrawlerstackoverflow
Problem
The following code works but is a mess. But being totally new to Ruby I have had big problems trying to refactor it into something resembling clean OOP code. Could you help with this and explain what you are doing?
```
require 'nokogiri'
require 'open-uri'
require_relative 'db.rb' # Not used
require 'mechanize'
require 'digest/md5'
require 'colorize'
require 'win32console'
# Get the Base URL for Groups
# baseURL = ''
# The Initial Search Point for Groups
#Base URL
baseURL = ''
pageURL = ''
doc = Nokogiri::HTML(open(pageURL))
# Search for nodes by css
pages = doc.css('.pages')
thumbs = doc.css('.thumb')
startingPages = pages.css('a').first.text
totalPages = pages.css('a').last.text.to_i
currentPage = 1
## DO THE FIRST PAGE
thumbs.each do |thumb|
imgSrc = thumb.css('.t_img').first['src']
file = open(imgSrc).read.unpack('H*').first
agent = Mechanize.new
image = agent.get(imgSrc)
md = Digest::MD5.hexdigest(*[file.to_s])
sha = Digest::SHA2.hexdigest(*[file.to_s])
title = thumb.css('.t_img').first['alt']
fileType = image.header["content-type"].to_s
lastModified = image.header["last-modified"].to_s
fileSize = image.header["content-length"].to_s
driver = "Prowler"
source = $pagesPath.to_s
binary = file
fileName = image.filename
# ||, -, *, /, <>, , ,(comma), =, =, ~=, !=, ^=, (, ) Illegal characters for sql queries. Check title for these before inserting into the database
title = title.gsub("'", "")
begin
# connect to the MySQL server
dbh = DBI.connect('dbi:ODBC:DamoclesProwler', 'user', 'pw')
sqlText = "INSERT INTO prowlerRunningResults(ProwlerDriver, SourceURL, Title, DestinationURL, FileSource, FileType, FileName, LastModified, FileSize, HashMD5, HashSHA256, ImageBinary)
VALUES ('#{driver}', '#{source}', '#{title}', 'NA', '#{imgSrc}', '#{fileType}', '#{fileName}', '#{lastModified}', #{fileSize.to_i}, '#{md}', '#{sha}', '#{binary}' )"
dbh.do( sqlText )
puts "Record has been created for".green + "\t" + sha.yellow
dbh.commit
rescue DBI:
```
require 'nokogiri'
require 'open-uri'
require_relative 'db.rb' # Not used
require 'mechanize'
require 'digest/md5'
require 'colorize'
require 'win32console'
# Get the Base URL for Groups
# baseURL = ''
# The Initial Search Point for Groups
#Base URL
baseURL = ''
pageURL = ''
doc = Nokogiri::HTML(open(pageURL))
# Search for nodes by css
pages = doc.css('.pages')
thumbs = doc.css('.thumb')
startingPages = pages.css('a').first.text
totalPages = pages.css('a').last.text.to_i
currentPage = 1
## DO THE FIRST PAGE
thumbs.each do |thumb|
imgSrc = thumb.css('.t_img').first['src']
file = open(imgSrc).read.unpack('H*').first
agent = Mechanize.new
image = agent.get(imgSrc)
md = Digest::MD5.hexdigest(*[file.to_s])
sha = Digest::SHA2.hexdigest(*[file.to_s])
title = thumb.css('.t_img').first['alt']
fileType = image.header["content-type"].to_s
lastModified = image.header["last-modified"].to_s
fileSize = image.header["content-length"].to_s
driver = "Prowler"
source = $pagesPath.to_s
binary = file
fileName = image.filename
# ||, -, *, /, <>, , ,(comma), =, =, ~=, !=, ^=, (, ) Illegal characters for sql queries. Check title for these before inserting into the database
title = title.gsub("'", "")
begin
# connect to the MySQL server
dbh = DBI.connect('dbi:ODBC:DamoclesProwler', 'user', 'pw')
sqlText = "INSERT INTO prowlerRunningResults(ProwlerDriver, SourceURL, Title, DestinationURL, FileSource, FileType, FileName, LastModified, FileSize, HashMD5, HashSHA256, ImageBinary)
VALUES ('#{driver}', '#{source}', '#{title}', 'NA', '#{imgSrc}', '#{fileType}', '#{fileName}', '#{lastModified}', #{fileSize.to_i}, '#{md}', '#{sha}', '#{binary}' )"
dbh.do( sqlText )
puts "Record has been created for".green + "\t" + sha.yellow
dbh.commit
rescue DBI:
Solution
I can't speak to Ruby but I see an improvement that could be done to your MySQL code. Since this operation loops multiple times and passes a query to MySQL each time, you would considerably improve your database execution by using a stored
So instead of this:
You could do this (once) in MySQL (or pass it from Ruby, doesn't matter):
Note you can use something else as
Then in your Ruby script you would just do this instead of what you have now:
PROCEDURE and passing parameters to it. The purpose of this (besides simplifying the code) is that MySQL would store the execution plan with the procedure after it runs for the first time.So instead of this:
sqlText = "INSERT INTO prowlerRunningResults(ProwlerDriver, SourceURL, Title, DestinationURL, FileSource, FileType, FileName, LastModified, FileSize, HashMD5, HashSHA256, ImageBinary)
VALUES ('#{driver}', '#{source}', '#{title}', 'NA', '#{imgSrc}', '#{fileType}', '#{fileName}', '#{lastModified}', #{fileSize.to_i}, '#{md}', '#{sha}', '#{binary}' )"You could do this (once) in MySQL (or pass it from Ruby, doesn't matter):
DELIMITER |
DROP PROCEDURE IF EXISTS MyProcedure| -- Use a relevant name
CREATE PROCEDURE MyProcedure -- Use a relevant name
(
IN driver VARCHAR, -- Note you can specify character limit if you use VAR(50) or other number
IN source VARCHAR,
IN title VARCHAR,
IN imgSrc VARCHAR,
IN fileType VARCHAR,
IN fileName VARCHAR,
IN lastModified VARCHAR,
IN [fileSize.to_i] VARCHAR, -- Using brackets due to period character
IN md VARCHAR,
IN sha VARCHAR,
IN binary VARCHAR
);
AS
BEGIN
INSERT INTO prowlerRunningResults(ProwlerDriver, SourceURL, Title, DestinationURL, FileSource, FileType, FileName, LastModified, FileSize, HashMD5, HashSHA256, ImageBinary)
VALUES (driver, source, title, 'NA', imgSrc, fileType, fileName, lastModified, [fileSize.to_i], md, sha, binary);
END|
DELIMITER ;Note you can use something else as
DELIMITER it's a wild card. I've seen // or $$ used frequently, or you can even use GO as long as it is not ;.Then in your Ruby script you would just do this instead of what you have now:
sqlText = "CALL MyProcedure('#{driver}', '#{source}', '#{title}', 'NA', '#{imgSrc}', '#{fileType}', '#{fileName}', '#{lastModified}', #{fileSize.to_i}, '#{md}', '#{sha}', '#{binary}' )"Code Snippets
sqlText = "INSERT INTO prowlerRunningResults(ProwlerDriver, SourceURL, Title, DestinationURL, FileSource, FileType, FileName, LastModified, FileSize, HashMD5, HashSHA256, ImageBinary)
VALUES ('#{driver}', '#{source}', '#{title}', 'NA', '#{imgSrc}', '#{fileType}', '#{fileName}', '#{lastModified}', #{fileSize.to_i}, '#{md}', '#{sha}', '#{binary}' )"DELIMITER |
DROP PROCEDURE IF EXISTS MyProcedure| -- Use a relevant name
CREATE PROCEDURE MyProcedure -- Use a relevant name
(
IN driver VARCHAR, -- Note you can specify character limit if you use VAR(50) or other number
IN source VARCHAR,
IN title VARCHAR,
IN imgSrc VARCHAR,
IN fileType VARCHAR,
IN fileName VARCHAR,
IN lastModified VARCHAR,
IN [fileSize.to_i] VARCHAR, -- Using brackets due to period character
IN md VARCHAR,
IN sha VARCHAR,
IN binary VARCHAR
);
AS
BEGIN
INSERT INTO prowlerRunningResults(ProwlerDriver, SourceURL, Title, DestinationURL, FileSource, FileType, FileName, LastModified, FileSize, HashMD5, HashSHA256, ImageBinary)
VALUES (driver, source, title, 'NA', imgSrc, fileType, fileName, lastModified, [fileSize.to_i], md, sha, binary);
END|
DELIMITER ;sqlText = "CALL MyProcedure('#{driver}', '#{source}', '#{title}', 'NA', '#{imgSrc}', '#{fileType}', '#{fileName}', '#{lastModified}', #{fileSize.to_i}, '#{md}', '#{sha}', '#{binary}' )"Context
StackExchange Code Review Q#56081, answer score: 5
Revisions (0)
No revisions yet.