HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMajor

Spliterator implementation

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
implementationspliteratorstackoverflow

Problem

I'm trying to post a little tutorial on the new Spliterator class. There are many tutorials these days on using stream starting from a standard Java collection, but I want to explore the creation of a stream using data coming from the web (so no stream fully backed by a collection).

I decided to implement a infinite URL stream coming from a starting address. What I did is scraping the initial page and providing client code with URLs I found into that page. Then I repeat the scraping with new URLs. URLs that are returned to client are not repeated.

UrlScraper

```
package com.stackexchange.codereview.webscrapingwithstream;

import java.io.IOException;
import java.util.*;
import java.util.function.Consumer;
import java.util.stream.*;

import org.jsoup.Jsoup;
import org.jsoup.nodes.*;
import org.jsoup.select.Elements;

public class UrlScraper {

private String url;
private Set index = Collections.synchronizedSet(new TreeSet<>());
private List startingReferences = new ArrayList<>();

public UrlScraper(String url) {
this.url = url;
}

public Stream stream() {
startingReferences.add(url);
index.add(url);
return StreamSupport.stream(new UrlSpliterator(startingReferences, index), false);
}

public Stream parallelStream() {
startingReferences.add(url);
index.add(url);
return StreamSupport.stream(new UrlSpliterator(startingReferences, index), true);
}

private static class UrlSpliterator implements Spliterator {

private static final int THRESHOLD = 10;
// variable list of urls in the spliterator
List refs;
// distinct set of already known urls
Set index;
// # urls returned by the stream
int examined = 0;
// position of last scraped url
int scraped = -1;

UrlSpliterator(List refs, Set index) {
this.refs = refs;
this.index = index;

}

@Override
// t

Solution

I see a few, actually, a number of problems in this code. I am afraid this will be something of a scathing review, in part because you mention this is intended to be for a tutorial....

I see a number of call them 'critical' issues. Then also a number of lesser issues. The critical ones first:

Use Case

Why not use an Iterator? Since an Iterator would do this job, and is the traditional 'vehicle', you should indicate why a Spliterator is the right tool for the job. Note, Iterators are relatively well understood, and there is a clear wrapper-path from Iterator to Spliterator using Spliterators.spliteratorUnknownSize()

The obvious answer to "Why not use an Iterator?" is either:

  • it's a tutorial



  • for better control of parallel execution



In each of these cases, the code should be 'exemplar', and you should clarify exactly why this code is needed... but, if the answer includes the 'parallel execution' reason, then the concurrency had better be accurate.

forEachRemaining()

one of the performance advantages of Spliterator is the forEachRemaining() method, and you have chosen not to implement it. Why?

Queues are Queues

Your code really boils down to being a queue processing tool. The refs is a queue, you have content on it, and you process it from the beginning, and add stuff to the end. Occasionally you split the queue.

Use a Queue (Deque).

Specifically, you should probably use an LinkedList (which is a Deque).

If you use a queue, there is no need for the examined variable, and code like the trySplit becomes (here, refs is an LinkedList):

@Override
    public Spliterator trySplit() {
        int mid = refs.size() / 2;
        if (mid  toclear = refs.subList(mid, refs.size());
        LinkedList newrefs = new LinkedList<>(toclear);
        toclear.clear();
        return new UrlSpliterator(newrefs, index);
    }


Threshold Logic

There is no apparent reason to have the weird logic in the
tryAdvance() for the 'scrape() call.

Something simple like:

if (examined < refs.size()) {
            String url = refs.get(examined++);
            analyze(url);
            consumer.accept(url);
            return true;
        }
        return false;


Concurrency bugs

-
The custom wrapper class introduces a significant concurrency bug. The UrlScraper contains the control information for the entire spliteration. The use case you present is:

UrlScraper urlScraper = new UrlScraper("http://www.wikipedia.org");
urlScraper.parallelStream()
         .forEach(System.out::println);


This code can be very easily broken by calling stream() and then parallelStream().....

UrlScraper urlScraper = new UrlScraper("http://www.wikipedia.org");
 urlScraper.stream()
          .forEach(System.out::println);
 urlScraper.parallelStream()
          .forEach(System.out::println);


I would expect the second stream to produce the same results as the first, but the second stream will produce..... from what I can tell, the same content, plus an extra copy of the start URL. It will also re-scrape the last THRESHOLD number of pages (but assuming the pages did not change, it will ignore the scraped URLS).

Odder things will happen if you call parallelStream multiple times, from multiple different threads...

-
You have a race condition in your analyze code. The following is broken:

for (Element link : links) {
                String newUrl = (String) link.attr("abs:href");
                if (!index.contains(newUrl)) {
                    refs.add(newUrl);
                    index.add(newUrl);
                }
            }


You are not supposed to add a newUrl to the refs multiple times. But, it is possible that two threads, each processing a page, both pages with a link to the same URL, will hit this code at the same time.

Both threads may cal !index.contains(newURL) at 'the same time', and both threads will add the newUrl to the refs (even though the refs are different lists in each thread, the index is not. Only one of the threads will successfully add the newUrl to the index though.

You can use this to your advantage with the Set.add() return value:

for (Element link : links) {
                 String newUrl = (String) link.attr("abs:href");
                 if (index.add(newUrl)) {
                     refs.add(newUrl);
                 }
             }


Code Style

You should declare variables where they are used. try ... catch blocks sometimes mess things up, but in this case, they do not. The following code:

Document doc;
        Elements links = null;
        try {

            doc = Jsoup.connect(aUrl).get();

            links = doc.select("a[href]");


should be:

try {
            Document doc = Jsoup.connect(aUrl).get();
            Elements links = doc.select("a[href]");


When you have 'early-return' methods (which are a good thing...), you should have conditionals without else-blocks.... The code:

```
@Override

Code Snippets

@Override
    public Spliterator<String> trySplit() {
        int mid = refs.size() / 2;
        if (mid < 1) {
            return null;
        }

        List<String> toclear = refs.subList(mid, refs.size());
        LinkedList<String> newrefs = new LinkedList<>(toclear);
        toclear.clear();
        return new UrlSpliterator(newrefs, index);
    }
if (examined < refs.size()) {
            String url = refs.get(examined++);
            analyze(url);
            consumer.accept(url);
            return true;
        }
        return false;
UrlScraper urlScraper = new UrlScraper("http://www.wikipedia.org");
urlScraper.parallelStream()
         .forEach(System.out::println);
UrlScraper urlScraper = new UrlScraper("http://www.wikipedia.org");
 urlScraper.stream()
          .forEach(System.out::println);
 urlScraper.parallelStream()
          .forEach(System.out::println);
for (Element link : links) {
                String newUrl = (String) link.attr("abs:href");
                if (!index.contains(newUrl)) {
                    refs.add(newUrl);
                    index.add(newUrl);
                }
            }

Context

StackExchange Code Review Q#52050, answer score: 22

Revisions (0)

No revisions yet.