patternrustMinor
Parsing HTML from multiple webpages simultaneously
Viewed 0 times
simultaneouslyparsingmultiplefromhtmlwebpages
Problem
My friend wrote a scraper in Go that takes the results from a house listing webpage and finds listings for houses that he's interested in. The initial search returns listings, they are filtered by city, and then the details of each listing is acquired from a separate HTTP request and more details are checked to see if it's an interesting listing. I also ported this over to Node.js.
I've written a version in Rust (version 1.6) that works, but I'm interested in a couple things:
https://github.com/ajcrites/gainesvillemls-scraper-rust
```
extern crate scoped_threadpool;
extern crate hyper;
extern crate kuchiki;
extern crate redis;
extern crate num_cpus;
use hyper::Client;
use hyper::header::ContentType;
use kuchiki::traits::ParserExt;
use scoped_threadpool::Pool;
use redis::Commands;
use redis::Client as RedisClient;
use std::env;
const KEY: &'static str = "52633f4973cf845e55b18c8e22ab08d5";
const SEARCH_HOST: &'static str = "http://www.gainesvillemls.com";
fn main() {
let http_client = &Client::new();
let body = &format!("key={}&LM_MST_prop_fmtYNNT=1&LM_MST_prop_cdYYNT=1,9,10,11,12,13,14&LM_MST_mls_noYYNT=&LM_MST_list_prcYNNB=&LM_M
I've written a version in Rust (version 1.6) that works, but I'm interested in a couple things:
- Am I doing this in a Rusty way or are there things I should be doing differently?
- Am I properly taking advantage of concurrency, specifically being able to parse multiple details pages at once? Particularly wondering how large to make the
Pool.
- This version is several times slower than the Go/Node.js versions (they are both about the same speed). Is there something I'm missing that's taking up a lot of additional time? (using
cargo run --releaseor just building for release).
- I am creating and passing in the redis client and http client to the
check_block_and_parkingfunction each time but it would make more sense not to pass these in but declare them globally or perhaps make them properties of an object. Suggestions?
- Is there any way to limit the number of simultaneous connections from the connection pool?
https://github.com/ajcrites/gainesvillemls-scraper-rust
```
extern crate scoped_threadpool;
extern crate hyper;
extern crate kuchiki;
extern crate redis;
extern crate num_cpus;
use hyper::Client;
use hyper::header::ContentType;
use kuchiki::traits::ParserExt;
use scoped_threadpool::Pool;
use redis::Commands;
use redis::Client as RedisClient;
use std::env;
const KEY: &'static str = "52633f4973cf845e55b18c8e22ab08d5";
const SEARCH_HOST: &'static str = "http://www.gainesvillemls.com";
fn main() {
let http_client = &Client::new();
let body = &format!("key={}&LM_MST_prop_fmtYNNT=1&LM_MST_prop_cdYYNT=1,9,10,11,12,13,14&LM_MST_mls_noYYNT=&LM_MST_list_prcYNNB=&LM_M
Solution
Overall, the code looks close to how I would write it. A few smaller points:
Am I properly taking advantage of concurrency, specifically being able to parse multiple details pages at once? Particularly wondering how large to make the
Yes, it looks quite reasonable. In this application, I'd expect it to be more IO bound than CPU bound. There are good guides out there, but generally you'd want to have more working threads than you have CPUs as each thread is likely to be waiting for IO to happen.
This version is several times slower than the Go/Node.js versions (they are both about the same speed). Is there something I'm missing that's taking up a lot of additional time? (using
You didn't say how much time either of your versions took, so there's not much I can say here. Running the program locally took:
Which seems pretty fast.
I am creating and passing in the Redis client and HTTP client to the
No, it would not make sense to make them global. A giant strength of Rust is that you can see where items are valid and have a very good sense of where memory will be allocated and deallocated. When you make globals, you lose that.
Additionally, global mutable items (which may not be needed here) are even more of a pain because they require a mutex or equivalent.
Making a new struct that holds other things could be useful though.
Is there any way to limit the number of simultaneous connections from the connection pool?
I'm not sure what you are asking here. You don't have a connection pool; you have a thread pool. Each thread makes HTTP connections, but these connections are not saved or reused. Currently, your maximum connection count is the number of threads.
```
extern crate scoped_threadpool;
extern crate hyper;
extern crate kuchiki;
extern crate redis;
extern crate num_cpus;
use hyper::Client;
use hyper::header::ContentType;
use kuchiki::{NodeDataRef, ElementData};
use kuchiki::traits::ParserExt;
use scoped_threadpool::Pool;
use redis::Commands;
use redis::Client as RedisClient;
use std::env;
const KEY: &'static str = "52633f4973cf845e55b18c8e22ab08d5";
const SEARCH_HOST: &'static str = "http://www.gainesvillemls.com";
fn main() {
let http_client = &Client::new();
let body = &format!("key={}&LM_MST_prop_fmtYNNT=1&LM_MST_prop_cdYYNT=1,9,10,11,12,13,14&LM_MST_mls_noYYNT=&LM_MST_list_prcYNNB=&LM_MST_list_prcYNNE=175000&LM_MST_prop_cdYNNL[]=9&LM_MST_sqft_nYNNB=&LM_MST_sqft_nYNNE=&LM_MST_yr_bltYNNB=&LM_MST_yr_bltYNNE=&LM_MST_bdrmsYNNB=3&LM_MST_bdrmsYNNE=&LM_MST_bathsYNNB=2&LM_MST_bathsYNNE=&LM_MST_hbathYNNB=&LM_MST_hbathYNNE=&LM_MST_countyYNCL[]=ALA&LM_MST_str_noY1CS=&LM_MST_str_namY1VZ=&LM_MST_remarksY1VZ=&openHouseStartDt_B=&openHouseStartDt_E=&ve_info=&ve_rgns=1&LM_MST_LATXX6I=&poi=&count=1&isLink=0&custom=", KEY);
let redis_client = &redis::Client::open(&*env::var("REDIS_DSN").unwrap()).unwrap();
let res = http_client.post(&format!("{}/gan/idx/search.php", SEARCH_HOST))
.header(ContentType::form_url_encoded())
.body(body)
.send()
.unwrap();
let document = kuchiki::parse_html().from_http(res).unwrap();
let cpus = num_cpus::get() * 4;
let mut pool = Pool::new(cpus as u32);
pool.scoped(|scope| {
for listing in document.select("table.listings").unwrap() {
let elem = listing.as_node();
let text = elem.select("tr:nth-of-type(3)").unwrap().next().unwrap().text_contents();
if text.to_lowercase().contains("gainesville, fl") {
let mls = elem.select("span.mls").unwrap().next();
let price = elem.select("span.price").unwrap().next();
if let (Some(mls), Some(price)) = (mls, price) {
let mls = mls.text_contents();
let price = price.text_contents();
s
- Use
contains(...)instead offind(...) != None. It has the potential to be faster but more importantly it's more understandable and slightly shorter.
- You can use
let () =instead oflet _: () =. Both are pretty strange to see, but that appears to be an artifact of how flexible Redis is.
- The lines to look at the following span are really long and hard to see and are duplicated. I'd recommend creating a helper function.
- There are many places with
unwrap. I can't really judge how valid it is to abort this entire program if things are missing, but the density ofunwrapper line is much higher than I'm used to.
- You could combine items with functions like
and_then. Unfotunately, the code often has to combineResultandOption, which makes this uglier.
- You could make more code conditional using
matchorif letstatements.
Am I properly taking advantage of concurrency, specifically being able to parse multiple details pages at once? Particularly wondering how large to make the
Pool.Yes, it looks quite reasonable. In this application, I'd expect it to be more IO bound than CPU bound. There are good guides out there, but generally you'd want to have more working threads than you have CPUs as each thread is likely to be waiting for IO to happen.
This version is several times slower than the Go/Node.js versions (they are both about the same speed). Is there something I'm missing that's taking up a lot of additional time? (using
cargo run --release or just building for release).You didn't say how much time either of your versions took, so there's not much I can say here. Running the program locally took:
real 0m5.146s
user 0m1.139s
sys 0m0.678sWhich seems pretty fast.
I am creating and passing in the Redis client and HTTP client to the
check_block_and_parking function each time but it would make more sense not to pass these in but declare them globally or perhaps make them properties of an object.No, it would not make sense to make them global. A giant strength of Rust is that you can see where items are valid and have a very good sense of where memory will be allocated and deallocated. When you make globals, you lose that.
Additionally, global mutable items (which may not be needed here) are even more of a pain because they require a mutex or equivalent.
Making a new struct that holds other things could be useful though.
Is there any way to limit the number of simultaneous connections from the connection pool?
I'm not sure what you are asking here. You don't have a connection pool; you have a thread pool. Each thread makes HTTP connections, but these connections are not saved or reused. Currently, your maximum connection count is the number of threads.
```
extern crate scoped_threadpool;
extern crate hyper;
extern crate kuchiki;
extern crate redis;
extern crate num_cpus;
use hyper::Client;
use hyper::header::ContentType;
use kuchiki::{NodeDataRef, ElementData};
use kuchiki::traits::ParserExt;
use scoped_threadpool::Pool;
use redis::Commands;
use redis::Client as RedisClient;
use std::env;
const KEY: &'static str = "52633f4973cf845e55b18c8e22ab08d5";
const SEARCH_HOST: &'static str = "http://www.gainesvillemls.com";
fn main() {
let http_client = &Client::new();
let body = &format!("key={}&LM_MST_prop_fmtYNNT=1&LM_MST_prop_cdYYNT=1,9,10,11,12,13,14&LM_MST_mls_noYYNT=&LM_MST_list_prcYNNB=&LM_MST_list_prcYNNE=175000&LM_MST_prop_cdYNNL[]=9&LM_MST_sqft_nYNNB=&LM_MST_sqft_nYNNE=&LM_MST_yr_bltYNNB=&LM_MST_yr_bltYNNE=&LM_MST_bdrmsYNNB=3&LM_MST_bdrmsYNNE=&LM_MST_bathsYNNB=2&LM_MST_bathsYNNE=&LM_MST_hbathYNNB=&LM_MST_hbathYNNE=&LM_MST_countyYNCL[]=ALA&LM_MST_str_noY1CS=&LM_MST_str_namY1VZ=&LM_MST_remarksY1VZ=&openHouseStartDt_B=&openHouseStartDt_E=&ve_info=&ve_rgns=1&LM_MST_LATXX6I=&poi=&count=1&isLink=0&custom=", KEY);
let redis_client = &redis::Client::open(&*env::var("REDIS_DSN").unwrap()).unwrap();
let res = http_client.post(&format!("{}/gan/idx/search.php", SEARCH_HOST))
.header(ContentType::form_url_encoded())
.body(body)
.send()
.unwrap();
let document = kuchiki::parse_html().from_http(res).unwrap();
let cpus = num_cpus::get() * 4;
let mut pool = Pool::new(cpus as u32);
pool.scoped(|scope| {
for listing in document.select("table.listings").unwrap() {
let elem = listing.as_node();
let text = elem.select("tr:nth-of-type(3)").unwrap().next().unwrap().text_contents();
if text.to_lowercase().contains("gainesville, fl") {
let mls = elem.select("span.mls").unwrap().next();
let price = elem.select("span.price").unwrap().next();
if let (Some(mls), Some(price)) = (mls, price) {
let mls = mls.text_contents();
let price = price.text_contents();
s
Code Snippets
real 0m5.146s
user 0m1.139s
sys 0m0.678sextern crate scoped_threadpool;
extern crate hyper;
extern crate kuchiki;
extern crate redis;
extern crate num_cpus;
use hyper::Client;
use hyper::header::ContentType;
use kuchiki::{NodeDataRef, ElementData};
use kuchiki::traits::ParserExt;
use scoped_threadpool::Pool;
use redis::Commands;
use redis::Client as RedisClient;
use std::env;
const KEY: &'static str = "52633f4973cf845e55b18c8e22ab08d5";
const SEARCH_HOST: &'static str = "http://www.gainesvillemls.com";
fn main() {
let http_client = &Client::new();
let body = &format!("key={}&LM_MST_prop_fmtYNNT=1&LM_MST_prop_cdYYNT=1,9,10,11,12,13,14&LM_MST_mls_noYYNT=&LM_MST_list_prcYNNB=&LM_MST_list_prcYNNE=175000&LM_MST_prop_cdYNNL[]=9&LM_MST_sqft_nYNNB=&LM_MST_sqft_nYNNE=&LM_MST_yr_bltYNNB=&LM_MST_yr_bltYNNE=&LM_MST_bdrmsYNNB=3&LM_MST_bdrmsYNNE=&LM_MST_bathsYNNB=2&LM_MST_bathsYNNE=&LM_MST_hbathYNNB=&LM_MST_hbathYNNE=&LM_MST_countyYNCL[]=ALA&LM_MST_str_noY1CS=&LM_MST_str_namY1VZ=&LM_MST_remarksY1VZ=&openHouseStartDt_B=&openHouseStartDt_E=&ve_info=&ve_rgns=1&LM_MST_LATXX6I=&poi=&count=1&isLink=0&custom=", KEY);
let redis_client = &redis::Client::open(&*env::var("REDIS_DSN").unwrap()).unwrap();
let res = http_client.post(&format!("{}/gan/idx/search.php", SEARCH_HOST))
.header(ContentType::form_url_encoded())
.body(body)
.send()
.unwrap();
let document = kuchiki::parse_html().from_http(res).unwrap();
let cpus = num_cpus::get() * 4;
let mut pool = Pool::new(cpus as u32);
pool.scoped(|scope| {
for listing in document.select("table.listings").unwrap() {
let elem = listing.as_node();
let text = elem.select("tr:nth-of-type(3)").unwrap().next().unwrap().text_contents();
if text.to_lowercase().contains("gainesville, fl") {
let mls = elem.select("span.mls").unwrap().next();
let price = elem.select("span.price").unwrap().next();
if let (Some(mls), Some(price)) = (mls, price) {
let mls = mls.text_contents();
let price = price.text_contents();
scope.execute(move || {
check_block_and_parking(mls, price, http_client, redis_client);
});
}
}
}
});
}
fn check_block_and_parking(mls: String, price: String, http_client: &Client, redis_client: &RedisClient) {
let redis_conn = redis_client.get_connection().unwrap();
if redis_conn.hexists("mls", &*mls).unwrap() {
return;
}
let () = redis_conn.hset("mls", &*mls, &*price).unwrap();
let res = http_client.post(&format!("{}/gan/idx/detail.php", SEARCH_HOST))
.header(ContentType::form_url_encoded())
.body(&format!("key={}&mls={}&gallery=false&custom=", KEY, mls))
.send()
.unwrap();
let document = kuchiki::parse_html().from_http(res).unwrap();
let mut has_parking = true;
let mut has_block = false;
extern crate scoped_threadpool;
extern crate hyper;
extern crate kuchiki;
extern crate redis;
extern crate num_cpus;
use hyper::Client;
use hyper::header::ContentType;
use kuchiki::{NodeDataRef, ElementData};
use kuchiki::traits::ParserExt;
use scoped_threadpool::Pool;
use redis::Commands;
use redis::Client as RedisClient;
use std::env;
use std::error::Error;
const KEY: &'static str = "52633f4973cf845e55b18c8e22ab08d5";
const SEARCH_HOST: &'static str = "http://www.gainesvillemls.com";
fn main() {
if let Err(e) = inner_main() {
println!("Processing failed: {}", e);
std::process::exit(1);
}
}
fn inner_main() -> Result<(), Box<Error>> {
let http_client = &Client::new();
let body = &format!("key={}&LM_MST_prop_fmtYNNT=1&LM_MST_prop_cdYYNT=1,9,10,11,12,13,14&LM_MST_mls_noYYNT=&LM_MST_list_prcYNNB=&LM_MST_list_prcYNNE=175000&LM_MST_prop_cdYNNL[]=9&LM_MST_sqft_nYNNB=&LM_MST_sqft_nYNNE=&LM_MST_yr_bltYNNB=&LM_MST_yr_bltYNNE=&LM_MST_bdrmsYNNB=3&LM_MST_bdrmsYNNE=&LM_MST_bathsYNNB=2&LM_MST_bathsYNNE=&LM_MST_hbathYNNB=&LM_MST_hbathYNNE=&LM_MST_countyYNCL[]=ALA&LM_MST_str_noY1CS=&LM_MST_str_namY1VZ=&LM_MST_remarksY1VZ=&openHouseStartDt_B=&openHouseStartDt_E=&ve_info=&ve_rgns=1&LM_MST_LATXX6I=&poi=&count=1&isLink=0&custom=", KEY);
let dsn = try!(env::var("REDIS_DSN"));
let redis_client = &try!(redis::Client::open(&*dsn));
let res = try!(http_client.post(&format!("{}/gan/idx/search.php", SEARCH_HOST))
.header(ContentType::form_url_encoded())
.body(body)
.send());
let document = try!(kuchiki::parse_html().from_http(res));
let cpus = num_cpus::get() * 4;
let mut pool = Pool::new(cpus as u32);
pool.scoped(|scope| {
let listings = try!(document.select("table.listings").map_err(|_| "Could not select listings"));
for listing in listings {
let elem = listing.as_node();
if let Some(text) = try!(elem.select("tr:nth-of-type(3)").map_err(|_| "Could not select text")).next() {
let text = text.text_contents();
if text.to_lowercase().contains("gainesville, fl") {
let mls = try!(elem.select("span.mls").map_err(|_| "Could not select mls")).next();
let price = try!(elem.select("span.price").map_err(|_| "Could not select price")).next();
// Using if let
if let (Some(mls), Some(price)) = (mls, price) {
let mls = mls.text_contents();
let price = price.text_contents();
scope.execute(move || {
check_block_and_parking(mls, price, http_client, redis_client).expect("The inner thread failed");
});
}
}
}
}
Ok(())
})
}
fn check_block_and_parking(mls: String, price: String, http_client: &Client, redis_client: &RedisClient) -> Result<(), Box<Error>> {
lContext
StackExchange Code Review Q#121292, answer score: 3
Revisions (0)
No revisions yet.