patternjavaMinor
Parsing an HTML table using jsoup
Viewed 0 times
jsoupparsingusinghtmltable
Problem
I am trying to parse HTML using "jsoup". This is my first time working with "jsoup" and I read some tutorials on it as well.
If you see my table, it has three `
Alert
Cluster Name
IP addr
Host Name
Type
Status
Free
Version
Restart Time
UpTime(Days)
Last probed
Last up
Hist
VI
 
Titan
10.100.111.77
machineA.abc.com
up
88%
2.0.5-SNAPSHOT
2014-07-04 01:49:08,220
381
07-14 20:01:59
07-14 20:01:59
Hist
VI
 
10.200.192.99
machineB.abc.com
down
85%
2.0.5-SNAPSHOT
2014-07-04 01:52:20,613
103
07-14 20:01:59
07-14 20:01:59
Hist
VI
 
Goldy
10.100.111.77
machineH.pqr.com
up
88%
2.0.5-SNAPSHOT
2014-07-04 01:49:08,220
381
07-14 20:01:59
07-14 20:01:59
URL url = new URL("some_url");
Document doc = Jsoup.parse(url, 3000
If you see my table, it has three `
as of now (I have shortened it down to have three table rows just for understanding purpose, but in general it will be more). Now I would like to extract Cluster Name from my table and its corresponding host name; so for example, I would extract Titan as cluster name and all its hostname whose status are down.
For Titan cluster name, I have two hostnames machineA.abc.com and machineB.abc.com in which machineA status is up but machineB status is down.
So, I will print out Titan as cluster name and print out machineB.abc.com as the hostname, since it is down.
I have two cluster names: one is Titan and the other is Goldy; so I want to find all the machines which are down for Titan cluster name only.
Alert
Cluster Name
IP addr
Host Name
Type
Status
Free
Version
Restart Time
UpTime(Days)
Last probed
Last up
Hist
VI
 
Titan
10.100.111.77
machineA.abc.com
up
88%
2.0.5-SNAPSHOT
2014-07-04 01:49:08,220
381
07-14 20:01:59
07-14 20:01:59
Hist
VI
 
10.200.192.99
machineB.abc.com
down
85%
2.0.5-SNAPSHOT
2014-07-04 01:52:20,613
103
07-14 20:01:59
07-14 20:01:59
Hist
VI
 
Goldy
10.100.111.77
machineH.pqr.com
up
88%
2.0.5-SNAPSHOT
2014-07-04 01:49:08,220
381
07-14 20:01:59
07-14 20:01:59
I'd like to know if I can improve this slightly:
public static void main(String[] args) throws JSONException, IOException {URL url = new URL("some_url");
Document doc = Jsoup.parse(url, 3000
Solution
If you don't need implementation specific features, declare variables using interface types. That is, instead of:
Use the
You can remove the
Your main loop can be written a bit simpler, for example:
I would also wrap a row in a class with convenient accessors like this:
which will make the main loop much more intuitive and free from hardcoded numbers like 3, 5, 7:
Finally, to make future changes easier to test, add some unit tests.
As the first step, extract the middle part from your
Change the
Add a unit test (or more!), for example:
UPDATE (in response to your comment)
Instead of hardcoding
Pass in a set of names to
Then in the unit test:
This way the method will return the list of server names in any of the clusters you specified.
If instead of a flat list of server names, if you want them grouped by the name of the cluster, that should be easy to do as well, just rework the return type to a
ArrayList downServers = new ArrayList<>();Use the
List interface:List downServers = new ArrayList<>();You can remove the
JSONException throws declaration from the main, as it will never happen.Your main loop can be written a bit simpler, for example:
Iterator rowIterator = rows.iterator();
rowIterator.next();
boolean wasMatch = false;
while (rowIterator.hasNext()) {
Element row = rowIterator.next();
Elements cols = row.select("td");
String clusterName = cols.get(3).text();
if (wasMatch && clusterName.isEmpty() || clusterName.equals("Titan")) {
wasMatch = true;
if (cols.get(7).text().equals("down")) {
downServers.add(cols.get(5).text());
}
} else {
wasMatch = false;
}
}I would also wrap a row in a class with convenient accessors like this:
class ServerInfo {
final Elements cols;
ServerRowWrapper(Elements cols) {
this.cols = cols;
}
String getClusterName() {
return cols.get(3).text();
}
String getServerName() {
return cols.get(5).text();
}
boolean isDown() {
return cols.get(7).text().equals("down");
}
}which will make the main loop much more intuitive and free from hardcoded numbers like 3, 5, 7:
while (rowIterator.hasNext()) {
ServerInfo serverInfo = new ServerInfo(rowIterator.next().select("td"));
String clusterName = serverInfo.getClusterName();
if (wasMatch && clusterName.isEmpty() || clusterName.equals("Titan")) {
wasMatch = true;
if (serverInfo.isDown()) {
downServers.add(serverInfo.getServerName());
}
} else {
wasMatch = false;
}
}Finally, to make future changes easier to test, add some unit tests.
As the first step, extract the middle part from your
main method, to take a Document as parameter and return the list of servers that are down:static List findServersDown(Document document) {
List downServers = new ArrayList<>();
// ... rest of the code from main
return downServers;
}Change the
main method to use the new findServersDown method:public static void mainx(String[] args) throws IOException {
URL url = new URL("some_url");
Document doc = Jsoup.parse(url, 3000);
System.out.println(ServerInfoParser.findServersDown(doc));
}Add a unit test (or more!), for example:
public class ServerInfoParserTest {
@Test
public void testSecondServerDown() throws IOException {
Document doc = Jsoup.parse(new File("src/test/resources/test-data/table1.html"), "utf-8");
List downServers = ServerInfoParser.findServersDown(doc);
assertEquals(Arrays.asList("machineB.abc.com"), downServers);
}
}UPDATE (in response to your comment)
Instead of hardcoding
Titan, it's better to generalize. It's easy enough to do. Pass in a set of names to
findServersDown and adjust the if condition in the loop:static List findServersDown(Document document, Set names) {
// ...
while (rowIterator.hasNext()) {
// ...
if (wasMatch && clusterName.isEmpty() || names.contains(clusterName)) {
// ...Then in the unit test:
Set names = new HashSet<>(Arrays.asList("Titan"));
List downServers = ServerInfoParser.findServersDown(doc, names);This way the method will return the list of server names in any of the clusters you specified.
If instead of a flat list of server names, if you want them grouped by the name of the cluster, that should be easy to do as well, just rework the return type to a
Map> instead.Code Snippets
ArrayList<String> downServers = new ArrayList<>();List<String> downServers = new ArrayList<>();Iterator<Element> rowIterator = rows.iterator();
rowIterator.next();
boolean wasMatch = false;
while (rowIterator.hasNext()) {
Element row = rowIterator.next();
Elements cols = row.select("td");
String clusterName = cols.get(3).text();
if (wasMatch && clusterName.isEmpty() || clusterName.equals("Titan")) {
wasMatch = true;
if (cols.get(7).text().equals("down")) {
downServers.add(cols.get(5).text());
}
} else {
wasMatch = false;
}
}class ServerInfo {
final Elements cols;
ServerRowWrapper(Elements cols) {
this.cols = cols;
}
String getClusterName() {
return cols.get(3).text();
}
String getServerName() {
return cols.get(5).text();
}
boolean isDown() {
return cols.get(7).text().equals("down");
}
}while (rowIterator.hasNext()) {
ServerInfo serverInfo = new ServerInfo(rowIterator.next().select("td"));
String clusterName = serverInfo.getClusterName();
if (wasMatch && clusterName.isEmpty() || clusterName.equals("Titan")) {
wasMatch = true;
if (serverInfo.isDown()) {
downServers.add(serverInfo.getServerName());
}
} else {
wasMatch = false;
}
}Context
StackExchange Code Review Q#57436, answer score: 5
Revisions (0)
No revisions yet.