HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavascriptMinor

Node.js web crawler

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
crawlernodeweb

Problem

I've been writing a node.js web crawler, to get it running, I've found myself having to string together really rather a lot of different NPM modules.

I've done my best to keep the code DRY and well designed, but unfortunately, it's turned into a bit of a tangled mess, and in places I feel like I'm forced to use global variables to communicate between different functions, and that makes me very uncomfortable.

I'd really appreciate advice on how to structure this program more sensibly. Since I've spent hours trying to refactor it, and I've hit a plateau that my limited skills can't get me over. I can tell this is bad, but not how to improve it.

```
/ Modules /
//Load all imported modules
var Crawler = require("crawler").Crawler;
var Redis = require("redis");
var _ = require("underscore");
var url = require("url");
var express =require("express");
var app = express();
var httpServer = require('http').createServer(app);
var io = require('socket.io').listen(httpServer);
var robots = require('robots');

//Define inline - i.e. custom - modules

function CreateDataStore(storePort, storeUrl, passwd) {
var client;
//create the redis client
client = Redis.createClient(storePort, storeUrl);

//Set client password and create logging function for on connect event.
client.auth(passwd, function(err, msg) {
if (err) {
console.log("redis-error: " + err);
}
console.log("redis: " + msg);
console.log("redis: Connected");
});

function createSiteUpdater(site) {

var testResults = {};

function CreateUpdaterFunction(redisKey, redisCommand) {
/*This function creates a property on the dataStore object
that regularly updates itself with state of that key on the redis
server.

These keys can then later be queried by the view layer to see the state
of the crawl. It requires two arguments, redisKey (string) which is
the key that we

Solution

Isaacs, node.js current maintainer, recently wrote node.js' philosophy in this blog post: http://blog.izs.me/post/48281998870/unix-philosophy-and-node-js


In Node, the basic building block that people share and interact with is not a binary on the command line, but rather a module loaded in by require().

Use files. Use modules. Don't let your files grow. 100 lines is already too much.

Clearly, in your code, there is at least 3 entities that could be separated in their own module:

  • A data store



  • An analytics tester



  • A crawler



  • Eventually, a socket handler



I'd understand that you want to keep it all in a module, but you should at least split it up into different files, and use require() to your heart's content.

Context

StackExchange Code Review Q#25218, answer score: 2

Revisions (0)

No revisions yet.