patternjavascriptMinor
Refactoring a phantom server called Swayzee
Viewed 0 times
refactoringswayzeephantomcalledserver
Problem
I have made a tool that is pretty useful for me. I have made it open source on github. It is a server that listens for requests and returns an HTML static version of a single page app. It works with a convention that when the title of the page changes means that the new page is rendered.
The way it works is very simple but the actual code its very ugly. Here is the main file of the project. I plan to refactor it soon, and it would be great to hear your ideas to improve modularity, response handling and concurrency.
`'use strict';
const
phantom = require('phantom'),
cluster = require('cluster'),
express = require('express'),
async = require('async'),
app = express();
var SCRIPT_REGEX = /)/gi;
var ASSET_REGEX = /\.(jpg|jpeg|png|gif|css|js|woff|\/img|\/css|\/js)/g;
var ESCAPED_REGEX = /escaped_fragment_=/g;
var ERR404_REGEX = /name="prerender-status-code/g;
var ORIGIN = 'http://localhost:8080/';
var pharguments = ["--load-images=false", "--ignore-ssl-errors=true", "--ssl-protocol=tlsv1"];
var processingQueue = [];
phantom.create(function(ph) {
ph.createPage(function(page) {
page.open(ORIGIN, function(status) {
console.log("Swayzee has awakened and is looking for " + ORIGIN + "...");
app.get('*', function(req, res) {
console.log("Receiving :", req.url)
var hasEscapedFragment = req.url.match(ESCAPED_REGEX);
var isAsset = req.url.match(ASSET_REGEX)
if (hasEscapedFragment && !isAsset) {
var hash = '#!' + req.url.split("_escaped_fragment_=")[1];
processingQueue.push({
hash: hash,
response: res,
page: page,
running: false
})
processNext();
} else {
var splited = req.url.split('/');
var redirect_url = ORIGIN + (splited[splited.length - 2]) + "/" + (splited[splited.length - 1])
res.redirect(redirect_url);
}
});
app.listen(1333, function() {
console.lo
The way it works is very simple but the actual code its very ugly. Here is the main file of the project. I plan to refactor it soon, and it would be great to hear your ideas to improve modularity, response handling and concurrency.
`'use strict';
const
phantom = require('phantom'),
cluster = require('cluster'),
express = require('express'),
async = require('async'),
app = express();
var SCRIPT_REGEX = /)/gi;
var ASSET_REGEX = /\.(jpg|jpeg|png|gif|css|js|woff|\/img|\/css|\/js)/g;
var ESCAPED_REGEX = /escaped_fragment_=/g;
var ERR404_REGEX = /name="prerender-status-code/g;
var ORIGIN = 'http://localhost:8080/';
var pharguments = ["--load-images=false", "--ignore-ssl-errors=true", "--ssl-protocol=tlsv1"];
var processingQueue = [];
phantom.create(function(ph) {
ph.createPage(function(page) {
page.open(ORIGIN, function(status) {
console.log("Swayzee has awakened and is looking for " + ORIGIN + "...");
app.get('*', function(req, res) {
console.log("Receiving :", req.url)
var hasEscapedFragment = req.url.match(ESCAPED_REGEX);
var isAsset = req.url.match(ASSET_REGEX)
if (hasEscapedFragment && !isAsset) {
var hash = '#!' + req.url.split("_escaped_fragment_=")[1];
processingQueue.push({
hash: hash,
response: res,
page: page,
running: false
})
processNext();
} else {
var splited = req.url.split('/');
var redirect_url = ORIGIN + (splited[splited.length - 2]) + "/" + (splited[splited.length - 1])
res.redirect(redirect_url);
}
});
app.listen(1333, function() {
console.lo
Solution
So if I read your code correctly, you're using PhantomJS to scrape a page. But the twist is, you don't just want the static HTML (otherwise, we'd be using a simple GET request), but you want the HTML after all the scripts that potentially influence the page run.
when the title of the page changes means that the new page is rendered
A weird choice for checking if the page has loaded. Why not listen for
Also wondering why you'd constrain yourself to one instance of a page. Your queue will only bottleneck your requests. You can always create more page objects. If you want rate limiting, I also suggest you pool up a few instances rather than having just one instance.
Additionally, it appears like you're only after the
If the setup hasn't resolved, the request will still wait. Your server is ready to receive requests, it's just not ready to service them until the setup finishes.
If you plan to spawn multiple instances, you can also use Promises. You can spawn say 10 instances, making that 10 promises, then use
Instead of managing extensions which can vary for the same thing (like
For simple cases, this is fine. However, this won't be enough to prevent scripts from passing through. There will always be ways to inject scripts into a page. I suggest you look into a parser that adheres to the HTML spec to parse, find and remove remove scripts properly. But again, it's probably overkill.
when the title of the page changes means that the new page is rendered
function isRenderReady() {
// console.log(iteration+") Title : ",previousTitle, ' vs ',document.title);
if (previousTitle === document.title && iteration < 40) {
iteration++;
setTimeout(isRenderReady, 50);A weird choice for checking if the page has loaded. Why not listen for
DOMContentReady? Or even better, use PhantomJS's onLoadFinished event. You can listen to onLoadFinished then run evaluate.phantom.create(function(ph) {
ph.createPage(function(page) {
page.open(ORIGIN, function(status) {
console.log("Swayzee has awakened and is looking for " + ORIGIN + "...");Also wondering why you'd constrain yourself to one instance of a page. Your queue will only bottleneck your requests. You can always create more page objects. If you want rate limiting, I also suggest you pool up a few instances rather than having just one instance.
Additionally, it appears like you're only after the
page object. You can abstract all this setup with a Promise object. That way, you don't delay creating your request handlers. var pagePromise = new Promise((resolve, reject) => {
phantom.create(ph => {
ph.createPage(page => {
resolve(page);
}
}
// Call reject when create and createPage fails.
});
app.get('*', (req,res) => {
pagePromise.then(page => {
// Do stuff to page
});
});
app.listen(1333, function() {
console.log("Swayzee is ready to receive requests.");
});If the setup hasn't resolved, the request will still wait. Your server is ready to receive requests, it's just not ready to service them until the setup finishes.
// Assume createPhantomPage is the same promise creator code above
var tenPagePromises = Array(10).fill(0).map(v => createPhantomPage());
Promise.all(tenPagePromises).then(pages => {
// all 10 pages are ready
var page1 = page[0];
});If you plan to spawn multiple instances, you can also use Promises. You can spawn say 10 instances, making that 10 promises, then use
Promise.all to listen if all 10 have resolved. When they all resolve, you have 10 page objects to play with.var ASSET_REGEX = /\.(jpg|jpeg|png|gif|css|js|woff|\/img|\/css|\/js)/g;
var isAsset = req.url.match(ASSET_REGEX)Instead of managing extensions which can vary for the same thing (like
.jpg and .jpeg for a JPEG image) and managing the match yourself, why not use a mimetype module? That way, you keep a list of mimetypes and use the module to get the mimetype of the path. Then you match.var SCRIPT_REGEX = /)/gi;
while (SCRIPT_REGEX.test(html)) {
html = html.replace(SCRIPT_REGEX, "");
}For simple cases, this is fine. However, this won't be enough to prevent scripts from passing through. There will always be ways to inject scripts into a page. I suggest you look into a parser that adheres to the HTML spec to parse, find and remove remove scripts properly. But again, it's probably overkill.
Code Snippets
function isRenderReady() {
// console.log(iteration+") Title : ",previousTitle, ' vs ',document.title);
if (previousTitle === document.title && iteration < 40) {
iteration++;
setTimeout(isRenderReady, 50);phantom.create(function(ph) {
ph.createPage(function(page) {
page.open(ORIGIN, function(status) {
console.log("Swayzee has awakened and is looking for " + ORIGIN + "...");var pagePromise = new Promise((resolve, reject) => {
phantom.create(ph => {
ph.createPage(page => {
resolve(page);
}
}
// Call reject when create and createPage fails.
});
app.get('*', (req,res) => {
pagePromise.then(page => {
// Do stuff to page
});
});
app.listen(1333, function() {
console.log("Swayzee is ready to receive requests.");
});// Assume createPhantomPage is the same promise creator code above
var tenPagePromises = Array(10).fill(0).map(v => createPhantomPage());
Promise.all(tenPagePromises).then(pages => {
// all 10 pages are ready
var page1 = page[0];
});var ASSET_REGEX = /\.(jpg|jpeg|png|gif|css|js|woff|\/img|\/css|\/js)/g;
var isAsset = req.url.match(ASSET_REGEX)Context
StackExchange Code Review Q#114891, answer score: 5
Revisions (0)
No revisions yet.