HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Parsing an HTML table using jsoup

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
jsoupparsingusinghtmltable

Problem

I am trying to parse HTML using "jsoup". This is my first time working with "jsoup" and I read some tutorials on it as well.

If you see my table, it has three ` as of now (I have shortened it down to have three table rows just for understanding purpose, but in general it will be more). Now I would like to extract Cluster Name from my table and its corresponding host name; so for example, I would extract Titan as cluster name and all its hostname whose status are down.

For
Titan cluster name, I have two hostnames machineA.abc.com and machineB.abc.com in which machineA status is up but machineB status is down.

So, I will print out
Titan as cluster name and print out machineB.abc.com as the hostname, since it is down.

I have two cluster names: one is
Titan and the other is Goldy; so I want to find all the machines which are down for Titan cluster name only.



 
 
Alert
Cluster Name
IP addr
Host Name
Type
Status
Free
Version
Restart Time
UpTime(Days)
Last probed
Last up


Hist
VI
&nbsp
Titan
10.100.111.77
machineA.abc.com

up
88%
2.0.5-SNAPSHOT
2014-07-04 01:49:08,220
381
07-14 20:01:59
07-14 20:01:59


Hist
VI
&nbsp

10.200.192.99
machineB.abc.com

down
85%
2.0.5-SNAPSHOT
2014-07-04 01:52:20,613
103
07-14 20:01:59
07-14 20:01:59


Hist
VI
&nbsp
Goldy
10.100.111.77
machineH.pqr.com

up
88%
2.0.5-SNAPSHOT
2014-07-04 01:49:08,220
381
07-14 20:01:59
07-14 20:01:59




I'd like to know if I can improve this slightly:

public static void main(String[] args) throws JSONException, IOException {
URL url = new URL("some_url");
Document doc = Jsoup.parse(url, 3000

Solution

If you don't need implementation specific features, declare variables using interface types. That is, instead of:

ArrayList downServers = new ArrayList<>();


Use the List interface:

List downServers = new ArrayList<>();


You can remove the JSONException throws declaration from the main, as it will never happen.

Your main loop can be written a bit simpler, for example:

Iterator rowIterator = rows.iterator();
rowIterator.next();
boolean wasMatch = false;

while (rowIterator.hasNext()) {
    Element row = rowIterator.next();
    Elements cols = row.select("td");
    String clusterName = cols.get(3).text();

    if (wasMatch && clusterName.isEmpty() || clusterName.equals("Titan")) {
        wasMatch = true;
        if (cols.get(7).text().equals("down")) {
            downServers.add(cols.get(5).text());
        }
    } else {
        wasMatch = false;
    }
}


I would also wrap a row in a class with convenient accessors like this:

class ServerInfo {
    final Elements cols;

    ServerRowWrapper(Elements cols) {
        this.cols = cols;
    }

    String getClusterName() {
        return cols.get(3).text();
    }

    String getServerName() {
        return cols.get(5).text();
    }

    boolean isDown() {
        return cols.get(7).text().equals("down");
    }
}


which will make the main loop much more intuitive and free from hardcoded numbers like 3, 5, 7:

while (rowIterator.hasNext()) {
    ServerInfo serverInfo = new ServerInfo(rowIterator.next().select("td"));
    String clusterName = serverInfo.getClusterName();

    if (wasMatch && clusterName.isEmpty() || clusterName.equals("Titan")) {
        wasMatch = true;
        if (serverInfo.isDown()) {
            downServers.add(serverInfo.getServerName());
        }
    } else {
        wasMatch = false;
    }
}


Finally, to make future changes easier to test, add some unit tests.

As the first step, extract the middle part from your main method, to take a Document as parameter and return the list of servers that are down:

static List findServersDown(Document document) {
    List downServers = new ArrayList<>();
    // ... rest of the code from main
    return downServers;
}


Change the main method to use the new findServersDown method:

public static void mainx(String[] args) throws IOException {
    URL url = new URL("some_url");
    Document doc = Jsoup.parse(url, 3000);
    System.out.println(ServerInfoParser.findServersDown(doc));
}


Add a unit test (or more!), for example:

public class ServerInfoParserTest {
    @Test
    public void testSecondServerDown() throws IOException {
        Document doc = Jsoup.parse(new File("src/test/resources/test-data/table1.html"), "utf-8");
        List downServers = ServerInfoParser.findServersDown(doc);
        assertEquals(Arrays.asList("machineB.abc.com"), downServers);
    }
}


UPDATE (in response to your comment)

Instead of hardcoding Titan, it's better to generalize. It's easy enough to do.

Pass in a set of names to findServersDown and adjust the if condition in the loop:

static List findServersDown(Document document, Set names) {
    // ...

    while (rowIterator.hasNext()) {
        // ...

        if (wasMatch && clusterName.isEmpty() || names.contains(clusterName)) {
            // ...


Then in the unit test:

Set names = new HashSet<>(Arrays.asList("Titan"));
List downServers = ServerInfoParser.findServersDown(doc, names);


This way the method will return the list of server names in any of the clusters you specified.

If instead of a flat list of server names, if you want them grouped by the name of the cluster, that should be easy to do as well, just rework the return type to a Map> instead.

Code Snippets

ArrayList<String> downServers = new ArrayList<>();
List<String> downServers = new ArrayList<>();
Iterator<Element> rowIterator = rows.iterator();
rowIterator.next();
boolean wasMatch = false;

while (rowIterator.hasNext()) {
    Element row = rowIterator.next();
    Elements cols = row.select("td");
    String clusterName = cols.get(3).text();

    if (wasMatch && clusterName.isEmpty() || clusterName.equals("Titan")) {
        wasMatch = true;
        if (cols.get(7).text().equals("down")) {
            downServers.add(cols.get(5).text());
        }
    } else {
        wasMatch = false;
    }
}
class ServerInfo {
    final Elements cols;

    ServerRowWrapper(Elements cols) {
        this.cols = cols;
    }

    String getClusterName() {
        return cols.get(3).text();
    }

    String getServerName() {
        return cols.get(5).text();
    }

    boolean isDown() {
        return cols.get(7).text().equals("down");
    }
}
while (rowIterator.hasNext()) {
    ServerInfo serverInfo = new ServerInfo(rowIterator.next().select("td"));
    String clusterName = serverInfo.getClusterName();

    if (wasMatch && clusterName.isEmpty() || clusterName.equals("Titan")) {
        wasMatch = true;
        if (serverInfo.isDown()) {
            downServers.add(serverInfo.getServerName());
        }
    } else {
        wasMatch = false;
    }
}

Context

StackExchange Code Review Q#57436, answer score: 5

Revisions (0)

No revisions yet.