patternjavaMajor
Spliterator implementation
Viewed 0 times
implementationspliteratorstackoverflow
Problem
I'm trying to post a little tutorial on the new
I decided to implement a infinite URL stream coming from a starting address. What I did is scraping the initial page and providing client code with URLs I found into that page. Then I repeat the scraping with new URLs. URLs that are returned to client are not repeated.
UrlScraper
```
package com.stackexchange.codereview.webscrapingwithstream;
import java.io.IOException;
import java.util.*;
import java.util.function.Consumer;
import java.util.stream.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.*;
import org.jsoup.select.Elements;
public class UrlScraper {
private String url;
private Set index = Collections.synchronizedSet(new TreeSet<>());
private List startingReferences = new ArrayList<>();
public UrlScraper(String url) {
this.url = url;
}
public Stream stream() {
startingReferences.add(url);
index.add(url);
return StreamSupport.stream(new UrlSpliterator(startingReferences, index), false);
}
public Stream parallelStream() {
startingReferences.add(url);
index.add(url);
return StreamSupport.stream(new UrlSpliterator(startingReferences, index), true);
}
private static class UrlSpliterator implements Spliterator {
private static final int THRESHOLD = 10;
// variable list of urls in the spliterator
List refs;
// distinct set of already known urls
Set index;
// # urls returned by the stream
int examined = 0;
// position of last scraped url
int scraped = -1;
UrlSpliterator(List refs, Set index) {
this.refs = refs;
this.index = index;
}
@Override
// t
Spliterator class. There are many tutorials these days on using stream starting from a standard Java collection, but I want to explore the creation of a stream using data coming from the web (so no stream fully backed by a collection).I decided to implement a infinite URL stream coming from a starting address. What I did is scraping the initial page and providing client code with URLs I found into that page. Then I repeat the scraping with new URLs. URLs that are returned to client are not repeated.
UrlScraper
```
package com.stackexchange.codereview.webscrapingwithstream;
import java.io.IOException;
import java.util.*;
import java.util.function.Consumer;
import java.util.stream.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.*;
import org.jsoup.select.Elements;
public class UrlScraper {
private String url;
private Set index = Collections.synchronizedSet(new TreeSet<>());
private List startingReferences = new ArrayList<>();
public UrlScraper(String url) {
this.url = url;
}
public Stream stream() {
startingReferences.add(url);
index.add(url);
return StreamSupport.stream(new UrlSpliterator(startingReferences, index), false);
}
public Stream parallelStream() {
startingReferences.add(url);
index.add(url);
return StreamSupport.stream(new UrlSpliterator(startingReferences, index), true);
}
private static class UrlSpliterator implements Spliterator {
private static final int THRESHOLD = 10;
// variable list of urls in the spliterator
List refs;
// distinct set of already known urls
Set index;
// # urls returned by the stream
int examined = 0;
// position of last scraped url
int scraped = -1;
UrlSpliterator(List refs, Set index) {
this.refs = refs;
this.index = index;
}
@Override
// t
Solution
I see a few, actually, a number of problems in this code. I am afraid this will be something of a scathing review, in part because you mention this is intended to be for a tutorial....
I see a number of call them 'critical' issues. Then also a number of lesser issues. The critical ones first:
Use Case
Why not use an Iterator? Since an Iterator would do this job, and is the traditional 'vehicle', you should indicate why a Spliterator is the right tool for the job. Note, Iterators are relatively well understood, and there is a clear wrapper-path from Iterator to Spliterator using Spliterators.spliteratorUnknownSize()
The obvious answer to "Why not use an Iterator?" is either:
In each of these cases, the code should be 'exemplar', and you should clarify exactly why this code is needed... but, if the answer includes the 'parallel execution' reason, then the concurrency had better be accurate.
forEachRemaining()
one of the performance advantages of Spliterator is the
Queues are Queues
Your code really boils down to being a queue processing tool. The
Use a Queue (Deque).
Specifically, you should probably use an LinkedList (which is a Deque).
If you use a queue, there is no need for the
Something simple like:
Concurrency bugs
-
The custom wrapper class introduces a significant concurrency bug. The
This code can be very easily broken by calling
I would expect the second stream to produce the same results as the first, but the second stream will produce..... from what I can tell, the same content, plus an extra copy of the start URL. It will also re-scrape the last THRESHOLD number of pages (but assuming the pages did not change, it will ignore the scraped URLS).
Odder things will happen if you call
-
You have a race condition in your analyze code. The following is broken:
You are not supposed to add a
Both threads may cal
You can use this to your advantage with the
Code Style
You should declare variables where they are used.
should be:
When you have 'early-return' methods (which are a good thing...), you should have conditionals without else-blocks.... The code:
```
@Override
I see a number of call them 'critical' issues. Then also a number of lesser issues. The critical ones first:
Use Case
Why not use an Iterator? Since an Iterator would do this job, and is the traditional 'vehicle', you should indicate why a Spliterator is the right tool for the job. Note, Iterators are relatively well understood, and there is a clear wrapper-path from Iterator to Spliterator using Spliterators.spliteratorUnknownSize()
The obvious answer to "Why not use an Iterator?" is either:
- it's a tutorial
- for better control of parallel execution
In each of these cases, the code should be 'exemplar', and you should clarify exactly why this code is needed... but, if the answer includes the 'parallel execution' reason, then the concurrency had better be accurate.
forEachRemaining()
one of the performance advantages of Spliterator is the
forEachRemaining() method, and you have chosen not to implement it. Why?Queues are Queues
Your code really boils down to being a queue processing tool. The
refs is a queue, you have content on it, and you process it from the beginning, and add stuff to the end. Occasionally you split the queue.Use a Queue (Deque).
Specifically, you should probably use an LinkedList (which is a Deque).
If you use a queue, there is no need for the
examined variable, and code like the trySplit becomes (here, refs is an LinkedList):
@Override
public Spliterator trySplit() {
int mid = refs.size() / 2;
if (mid toclear = refs.subList(mid, refs.size());
LinkedList newrefs = new LinkedList<>(toclear);
toclear.clear();
return new UrlSpliterator(newrefs, index);
}
Threshold Logic
There is no apparent reason to have the weird logic in the tryAdvance() for the 'scrape() call.Something simple like:
if (examined < refs.size()) {
String url = refs.get(examined++);
analyze(url);
consumer.accept(url);
return true;
}
return false;Concurrency bugs
-
The custom wrapper class introduces a significant concurrency bug. The
UrlScraper contains the control information for the entire spliteration. The use case you present is:UrlScraper urlScraper = new UrlScraper("http://www.wikipedia.org");
urlScraper.parallelStream()
.forEach(System.out::println);This code can be very easily broken by calling
stream() and then parallelStream().....UrlScraper urlScraper = new UrlScraper("http://www.wikipedia.org");
urlScraper.stream()
.forEach(System.out::println);
urlScraper.parallelStream()
.forEach(System.out::println);I would expect the second stream to produce the same results as the first, but the second stream will produce..... from what I can tell, the same content, plus an extra copy of the start URL. It will also re-scrape the last THRESHOLD number of pages (but assuming the pages did not change, it will ignore the scraped URLS).
Odder things will happen if you call
parallelStream multiple times, from multiple different threads...-
You have a race condition in your analyze code. The following is broken:
for (Element link : links) {
String newUrl = (String) link.attr("abs:href");
if (!index.contains(newUrl)) {
refs.add(newUrl);
index.add(newUrl);
}
}You are not supposed to add a
newUrl to the refs multiple times. But, it is possible that two threads, each processing a page, both pages with a link to the same URL, will hit this code at the same time.Both threads may cal
!index.contains(newURL) at 'the same time', and both threads will add the newUrl to the refs (even though the refs are different lists in each thread, the index is not. Only one of the threads will successfully add the newUrl to the index though.You can use this to your advantage with the
Set.add() return value:for (Element link : links) {
String newUrl = (String) link.attr("abs:href");
if (index.add(newUrl)) {
refs.add(newUrl);
}
}Code Style
You should declare variables where they are used.
try ... catch blocks sometimes mess things up, but in this case, they do not. The following code:Document doc;
Elements links = null;
try {
doc = Jsoup.connect(aUrl).get();
links = doc.select("a[href]");should be:
try {
Document doc = Jsoup.connect(aUrl).get();
Elements links = doc.select("a[href]");When you have 'early-return' methods (which are a good thing...), you should have conditionals without else-blocks.... The code:
```
@Override
Code Snippets
@Override
public Spliterator<String> trySplit() {
int mid = refs.size() / 2;
if (mid < 1) {
return null;
}
List<String> toclear = refs.subList(mid, refs.size());
LinkedList<String> newrefs = new LinkedList<>(toclear);
toclear.clear();
return new UrlSpliterator(newrefs, index);
}if (examined < refs.size()) {
String url = refs.get(examined++);
analyze(url);
consumer.accept(url);
return true;
}
return false;UrlScraper urlScraper = new UrlScraper("http://www.wikipedia.org");
urlScraper.parallelStream()
.forEach(System.out::println);UrlScraper urlScraper = new UrlScraper("http://www.wikipedia.org");
urlScraper.stream()
.forEach(System.out::println);
urlScraper.parallelStream()
.forEach(System.out::println);for (Element link : links) {
String newUrl = (String) link.attr("abs:href");
if (!index.contains(newUrl)) {
refs.add(newUrl);
index.add(newUrl);
}
}Context
StackExchange Code Review Q#52050, answer score: 22
Revisions (0)
No revisions yet.