HiveBrain v1.2.0
Get Started
← Back to all entries
patternrustMinor

Parsing HTML from multiple webpages simultaneously

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
simultaneouslyparsingmultiplefromhtmlwebpages

Problem

My friend wrote a scraper in Go that takes the results from a house listing webpage and finds listings for houses that he's interested in. The initial search returns listings, they are filtered by city, and then the details of each listing is acquired from a separate HTTP request and more details are checked to see if it's an interesting listing. I also ported this over to Node.js.

I've written a version in Rust (version 1.6) that works, but I'm interested in a couple things:

  • Am I doing this in a Rusty way or are there things I should be doing differently?



  • Am I properly taking advantage of concurrency, specifically being able to parse multiple details pages at once? Particularly wondering how large to make the Pool.



  • This version is several times slower than the Go/Node.js versions (they are both about the same speed). Is there something I'm missing that's taking up a lot of additional time? (using cargo run --release or just building for release).



  • I am creating and passing in the redis client and http client to the check_block_and_parking function each time but it would make more sense not to pass these in but declare them globally or perhaps make them properties of an object. Suggestions?



  • Is there any way to limit the number of simultaneous connections from the connection pool?



https://github.com/ajcrites/gainesvillemls-scraper-rust

```
extern crate scoped_threadpool;
extern crate hyper;
extern crate kuchiki;
extern crate redis;
extern crate num_cpus;

use hyper::Client;
use hyper::header::ContentType;
use kuchiki::traits::ParserExt;
use scoped_threadpool::Pool;
use redis::Commands;
use redis::Client as RedisClient;
use std::env;

const KEY: &'static str = "52633f4973cf845e55b18c8e22ab08d5";
const SEARCH_HOST: &'static str = "http://www.gainesvillemls.com";

fn main() {
let http_client = &Client::new();
let body = &format!("key={}&LM_MST_prop_fmtYNNT=1&LM_MST_prop_cdYYNT=1,9,10,11,12,13,14&LM_MST_mls_noYYNT=&LM_MST_list_prcYNNB=&LM_M

Solution

Overall, the code looks close to how I would write it. A few smaller points:

  • Use contains(...) instead of find(...) != None. It has the potential to be faster but more importantly it's more understandable and slightly shorter.



  • You can use let () = instead of let _: () =. Both are pretty strange to see, but that appears to be an artifact of how flexible Redis is.



  • The lines to look at the following span are really long and hard to see and are duplicated. I'd recommend creating a helper function.



  • There are many places with unwrap. I can't really judge how valid it is to abort this entire program if things are missing, but the density of unwrap per line is much higher than I'm used to.



  • You could combine items with functions like and_then. Unfotunately, the code often has to combine Result and Option, which makes this uglier.



  • You could make more code conditional using match or if let statements.




Am I properly taking advantage of concurrency, specifically being able to parse multiple details pages at once? Particularly wondering how large to make the Pool.

Yes, it looks quite reasonable. In this application, I'd expect it to be more IO bound than CPU bound. There are good guides out there, but generally you'd want to have more working threads than you have CPUs as each thread is likely to be waiting for IO to happen.


This version is several times slower than the Go/Node.js versions (they are both about the same speed). Is there something I'm missing that's taking up a lot of additional time? (using cargo run --release or just building for release).

You didn't say how much time either of your versions took, so there's not much I can say here. Running the program locally took:

real    0m5.146s
user    0m1.139s
sys     0m0.678s


Which seems pretty fast.


I am creating and passing in the Redis client and HTTP client to the check_block_and_parking function each time but it would make more sense not to pass these in but declare them globally or perhaps make them properties of an object.

No, it would not make sense to make them global. A giant strength of Rust is that you can see where items are valid and have a very good sense of where memory will be allocated and deallocated. When you make globals, you lose that.

Additionally, global mutable items (which may not be needed here) are even more of a pain because they require a mutex or equivalent.

Making a new struct that holds other things could be useful though.


Is there any way to limit the number of simultaneous connections from the connection pool?

I'm not sure what you are asking here. You don't have a connection pool; you have a thread pool. Each thread makes HTTP connections, but these connections are not saved or reused. Currently, your maximum connection count is the number of threads.

```
extern crate scoped_threadpool;
extern crate hyper;
extern crate kuchiki;
extern crate redis;
extern crate num_cpus;

use hyper::Client;
use hyper::header::ContentType;
use kuchiki::{NodeDataRef, ElementData};
use kuchiki::traits::ParserExt;
use scoped_threadpool::Pool;
use redis::Commands;
use redis::Client as RedisClient;
use std::env;

const KEY: &'static str = "52633f4973cf845e55b18c8e22ab08d5";
const SEARCH_HOST: &'static str = "http://www.gainesvillemls.com";

fn main() {
let http_client = &Client::new();
let body = &format!("key={}&LM_MST_prop_fmtYNNT=1&LM_MST_prop_cdYYNT=1,9,10,11,12,13,14&LM_MST_mls_noYYNT=&LM_MST_list_prcYNNB=&LM_MST_list_prcYNNE=175000&LM_MST_prop_cdYNNL[]=9&LM_MST_sqft_nYNNB=&LM_MST_sqft_nYNNE=&LM_MST_yr_bltYNNB=&LM_MST_yr_bltYNNE=&LM_MST_bdrmsYNNB=3&LM_MST_bdrmsYNNE=&LM_MST_bathsYNNB=2&LM_MST_bathsYNNE=&LM_MST_hbathYNNB=&LM_MST_hbathYNNE=&LM_MST_countyYNCL[]=ALA&LM_MST_str_noY1CS=&LM_MST_str_namY1VZ=&LM_MST_remarksY1VZ=&openHouseStartDt_B=&openHouseStartDt_E=&ve_info=&ve_rgns=1&LM_MST_LATXX6I=&poi=&count=1&isLink=0&custom=", KEY);

let redis_client = &redis::Client::open(&*env::var("REDIS_DSN").unwrap()).unwrap();

let res = http_client.post(&format!("{}/gan/idx/search.php", SEARCH_HOST))
.header(ContentType::form_url_encoded())
.body(body)
.send()
.unwrap();

let document = kuchiki::parse_html().from_http(res).unwrap();

let cpus = num_cpus::get() * 4;
let mut pool = Pool::new(cpus as u32);
pool.scoped(|scope| {
for listing in document.select("table.listings").unwrap() {
let elem = listing.as_node();
let text = elem.select("tr:nth-of-type(3)").unwrap().next().unwrap().text_contents();
if text.to_lowercase().contains("gainesville, fl") {
let mls = elem.select("span.mls").unwrap().next();
let price = elem.select("span.price").unwrap().next();

if let (Some(mls), Some(price)) = (mls, price) {
let mls = mls.text_contents();
let price = price.text_contents();

s

Code Snippets

real    0m5.146s
user    0m1.139s
sys     0m0.678s
extern crate scoped_threadpool;
extern crate hyper;
extern crate kuchiki;
extern crate redis;
extern crate num_cpus;

use hyper::Client;
use hyper::header::ContentType;
use kuchiki::{NodeDataRef, ElementData};
use kuchiki::traits::ParserExt;
use scoped_threadpool::Pool;
use redis::Commands;
use redis::Client as RedisClient;
use std::env;

const KEY: &'static str = "52633f4973cf845e55b18c8e22ab08d5";
const SEARCH_HOST: &'static str = "http://www.gainesvillemls.com";

fn main() {
    let http_client = &Client::new();
    let body = &format!("key={}&LM_MST_prop_fmtYNNT=1&LM_MST_prop_cdYYNT=1,9,10,11,12,13,14&LM_MST_mls_noYYNT=&LM_MST_list_prcYNNB=&LM_MST_list_prcYNNE=175000&LM_MST_prop_cdYNNL[]=9&LM_MST_sqft_nYNNB=&LM_MST_sqft_nYNNE=&LM_MST_yr_bltYNNB=&LM_MST_yr_bltYNNE=&LM_MST_bdrmsYNNB=3&LM_MST_bdrmsYNNE=&LM_MST_bathsYNNB=2&LM_MST_bathsYNNE=&LM_MST_hbathYNNB=&LM_MST_hbathYNNE=&LM_MST_countyYNCL[]=ALA&LM_MST_str_noY1CS=&LM_MST_str_namY1VZ=&LM_MST_remarksY1VZ=&openHouseStartDt_B=&openHouseStartDt_E=&ve_info=&ve_rgns=1&LM_MST_LATXX6I=&poi=&count=1&isLink=0&custom=", KEY);

    let redis_client = &redis::Client::open(&*env::var("REDIS_DSN").unwrap()).unwrap();

    let res = http_client.post(&format!("{}/gan/idx/search.php", SEARCH_HOST))
        .header(ContentType::form_url_encoded())
        .body(body)
        .send()
        .unwrap();

    let document = kuchiki::parse_html().from_http(res).unwrap();

    let cpus = num_cpus::get() * 4;
    let mut pool = Pool::new(cpus as u32);
    pool.scoped(|scope| {
        for listing in document.select("table.listings").unwrap() {
            let elem = listing.as_node();
            let text = elem.select("tr:nth-of-type(3)").unwrap().next().unwrap().text_contents();
            if text.to_lowercase().contains("gainesville, fl") {
                let mls = elem.select("span.mls").unwrap().next();
                let price = elem.select("span.price").unwrap().next();

                if let (Some(mls), Some(price)) = (mls, price) {
                    let mls = mls.text_contents();
                    let price = price.text_contents();

                    scope.execute(move || {
                        check_block_and_parking(mls, price, http_client, redis_client);
                    });
                }
            }
        }
    });
}

fn check_block_and_parking(mls: String, price: String, http_client: &Client, redis_client: &RedisClient) {
    let redis_conn = redis_client.get_connection().unwrap();

    if redis_conn.hexists("mls", &*mls).unwrap() {
        return;
    }

    let () = redis_conn.hset("mls", &*mls, &*price).unwrap();

    let res = http_client.post(&format!("{}/gan/idx/detail.php", SEARCH_HOST))
        .header(ContentType::form_url_encoded())
        .body(&format!("key={}&mls={}&gallery=false&custom=", KEY, mls))
        .send()
        .unwrap();

    let document = kuchiki::parse_html().from_http(res).unwrap();
    let mut has_parking = true;
    let mut has_block = false;

   
extern crate scoped_threadpool;
extern crate hyper;
extern crate kuchiki;
extern crate redis;
extern crate num_cpus;

use hyper::Client;
use hyper::header::ContentType;
use kuchiki::{NodeDataRef, ElementData};
use kuchiki::traits::ParserExt;
use scoped_threadpool::Pool;
use redis::Commands;
use redis::Client as RedisClient;
use std::env;
use std::error::Error;

const KEY: &'static str = "52633f4973cf845e55b18c8e22ab08d5";
const SEARCH_HOST: &'static str = "http://www.gainesvillemls.com";

fn main() {
    if let Err(e) = inner_main() {
        println!("Processing failed: {}", e);
        std::process::exit(1);
    }
}

fn inner_main() -> Result<(), Box<Error>> {
    let http_client = &Client::new();
    let body = &format!("key={}&LM_MST_prop_fmtYNNT=1&LM_MST_prop_cdYYNT=1,9,10,11,12,13,14&LM_MST_mls_noYYNT=&LM_MST_list_prcYNNB=&LM_MST_list_prcYNNE=175000&LM_MST_prop_cdYNNL[]=9&LM_MST_sqft_nYNNB=&LM_MST_sqft_nYNNE=&LM_MST_yr_bltYNNB=&LM_MST_yr_bltYNNE=&LM_MST_bdrmsYNNB=3&LM_MST_bdrmsYNNE=&LM_MST_bathsYNNB=2&LM_MST_bathsYNNE=&LM_MST_hbathYNNB=&LM_MST_hbathYNNE=&LM_MST_countyYNCL[]=ALA&LM_MST_str_noY1CS=&LM_MST_str_namY1VZ=&LM_MST_remarksY1VZ=&openHouseStartDt_B=&openHouseStartDt_E=&ve_info=&ve_rgns=1&LM_MST_LATXX6I=&poi=&count=1&isLink=0&custom=", KEY);

    let dsn = try!(env::var("REDIS_DSN"));
    let redis_client = &try!(redis::Client::open(&*dsn));

    let res = try!(http_client.post(&format!("{}/gan/idx/search.php", SEARCH_HOST))
        .header(ContentType::form_url_encoded())
        .body(body)
        .send());

    let document = try!(kuchiki::parse_html().from_http(res));

    let cpus = num_cpus::get() * 4;
    let mut pool = Pool::new(cpus as u32);

    pool.scoped(|scope| {
        let listings = try!(document.select("table.listings").map_err(|_| "Could not select listings"));
        for listing in listings {
            let elem = listing.as_node();
            if let Some(text) = try!(elem.select("tr:nth-of-type(3)").map_err(|_| "Could not select text")).next() {
                let text = text.text_contents();

                if text.to_lowercase().contains("gainesville, fl") {
                    let mls = try!(elem.select("span.mls").map_err(|_| "Could not select mls")).next();
                    let price = try!(elem.select("span.price").map_err(|_| "Could not select price")).next();

                    // Using if let
                    if let (Some(mls), Some(price)) = (mls, price) {
                        let mls = mls.text_contents();
                        let price = price.text_contents();

                        scope.execute(move || {
                            check_block_and_parking(mls, price, http_client, redis_client).expect("The inner thread failed");
                        });
                    }
                }
            }
        }

        Ok(())
    })
}

fn check_block_and_parking(mls: String, price: String, http_client: &Client, redis_client: &RedisClient) -> Result<(), Box<Error>> {
    l

Context

StackExchange Code Review Q#121292, answer score: 3

Revisions (0)

No revisions yet.