patternjavaMinor
Simplifying HTML parsing
Viewed 0 times
parsinghtmlsimplifying
Problem
I'm working on an app for my school (not homework, an app that's going to be used by students), that's supposed to display our week schedules. I get the data from a webapp, but it has no API that I can use. So I have to parse the HTML. This HTML is a serious mess (and on top of that, they use Finnish in their pages), and it seems my code has become such a mess too. I'd like to simplify this code, but I don't know jsoup well enough to use little tricks that make code simple. Most of the code is probably fine, but I'm concerned about the
It's an android app, and I use jsoup to parse the HTML. It's an AsyncTask, the compiler required me to do that. I've translated all variable and function names, so they might be inaccurate descriptions. In dutch they seem fine to me though. I also hope the comments tell enough about the code to be decipherable.
Here is an example page that I parse. And here is the relevant part of the HTML, with a cleaner structure, on pastebin since the HTML was too long.
And here is the relevant code:
```
public class DownloadScheduleTask extends AsyncTask
{
private LesroostersActivity activity;
private Schedule schedule;
private String debugTag = "Pxl App";
public DownloadLesroosterTask(LesroostersActivity activity)
{
//keep a reference to the activity
this.activity = activity;
}
@Override
protected Void doInBackground(String... URLs)
{
//doInBrackground required varargs parameter, but I only ever pass one URL, so I select the first.
String URL = URLs[0];
try
{
//Download the HTML document.
Document document = Jsoup.connect(URL).get();
schedule = new Schedule();
//Get the first day of this week out of the table
SimpleDateTime firstDay = SimpleDateTime.parseDate(document.select("table th span.hdr_date font").first().text());
schedule.setFirstDay(firstDay);
parseRows() method.It's an android app, and I use jsoup to parse the HTML. It's an AsyncTask, the compiler required me to do that. I've translated all variable and function names, so they might be inaccurate descriptions. In dutch they seem fine to me though. I also hope the comments tell enough about the code to be decipherable.
Here is an example page that I parse. And here is the relevant part of the HTML, with a cleaner structure, on pastebin since the HTML was too long.
And here is the relevant code:
```
public class DownloadScheduleTask extends AsyncTask
{
private LesroostersActivity activity;
private Schedule schedule;
private String debugTag = "Pxl App";
public DownloadLesroosterTask(LesroostersActivity activity)
{
//keep a reference to the activity
this.activity = activity;
}
@Override
protected Void doInBackground(String... URLs)
{
//doInBrackground required varargs parameter, but I only ever pass one URL, so I select the first.
String URL = URLs[0];
try
{
//Download the HTML document.
Document document = Jsoup.connect(URL).get();
schedule = new Schedule();
//Get the first day of this week out of the table
SimpleDateTime firstDay = SimpleDateTime.parseDate(document.select("table th span.hdr_date font").first().text());
schedule.setFirstDay(firstDay);
Solution
At some point everyone ends up trying to parse HTML. And, in some of those cases, it is even unavoidable......
... but, some suggestions (in order or preference):
OK, so you decide to dive in to the HTML, I have some recommendations for that too:
Abstract things away
HTML and web pages in general are not an API... they are a display mechanism. Believing that the web interface will stay constant is a failing belief.... so, you need to create your own API.
Your application needs to access this API to get the data. The API needs to be as simple as possible, but likely something like:
Then, create an implementation of the factory for the current version of the web site (because you will need a new version when it changes).
OK, so now you have a simple factory implementation for the data interface. The benefits of this are:
The HTML Parser
OK, so you have a Factory class to implement which will access your web-page for the data you need. What will this look like?
For this, I strongly recommend a multi-level approach. You do not want to be changing the code every time the web-page changes... so, create a 'dictionary' for your web-page. It is a 'resource locator'. Using JSoup you have a query language (
In this way you can abstract the actual location data from the code, and have 'simple' methods that return simple abstractions like
As an aside (and this is not a recommendation, necessarily....):
I maintain the JDOM XML library. As a result my natural inclination is to use XPath/XQuery for accessing data. It is more expressive, it seems, than the JSoup (CSS) select. There are ways to convert JSoup to a DOM document, and with DOM, you can use XPath (and, if you want, you can convert the DOM to JDOM, and from that, XPath is easy (but that is a lot of conversion to do)).
So, I recommend you create a 'HTMLScheduleFactory', and that HTMLScheduleFactory has a special configuration file for each version of your source-page formats.
That configuration file tells the HTMLScheduleFactory where in the HTML to locate all the things you may need.
Conclusion
I recommend that:
Each of these layers can be put together relatively independently, and can be tested in isolation.
As the web-site changes, if you do it right, you can:
... but, some suggestions (in order or preference):
- communicate with the server-side developers and try to create a better API for accessing the data (perhaps direct read-only access to their database even?)
- Decide whether the mobile application is really worth it.... if the data is available on the web, can you not just browse to it using your device's web-browser, and not do anything:
- save yourself a lot of headaches
- give you some time in your life to answer questions on CodeReview!
- bonus is that it is iPhone compatible too!
- Perhaps collaborate with the actual host-side of the system and extend the functionality of the web side of the equation is extended and covers the features you need. Special bonuses here are:
- you don't need to maintain the application, it is 'theirs'.
- it is also multi-platform compatible (including iOS, PC, and anything in the future)
- Suck it up and do the hard-work of parsing the HTML....
OK, so you decide to dive in to the HTML, I have some recommendations for that too:
Abstract things away
HTML and web pages in general are not an API... they are a display mechanism. Believing that the web interface will stay constant is a failing belief.... so, you need to create your own API.
Your application needs to access this API to get the data. The API needs to be as simple as possible, but likely something like:
public abstract class ScheduleFactory {
public static ScheduleFactory newInstance(String datasource) {
// datasource may be something like a URL, whatever....
}
public abstract Course[] getCourses();
public abstract Timetable getTimeTable(Course course);
}Then, create an implementation of the factory for the current version of the web site (because you will need a new version when it changes).
OK, so now you have a simple factory implementation for the data interface. The benefits of this are:
- You can easily test your real application/system with a simple (non-web-based) test factory
- your code is in distinct chunks, application layer, and data layer, and you don't need to change the application unless there are functionality changes....
The HTML Parser
OK, so you have a Factory class to implement which will access your web-page for the data you need. What will this look like?
For this, I strongly recommend a multi-level approach. You do not want to be changing the code every time the web-page changes... so, create a 'dictionary' for your web-page. It is a 'resource locator'. Using JSoup you have a query language (
select) so I recommend creating a property file that identifies data points, and the select statement required to get them.In this way you can abstract the actual location data from the code, and have 'simple' methods that return simple abstractions like
String, or List for JSoup queries like document.select("table th span.hdr_date font").first().text() and table.asio_basic > tbody > trAs an aside (and this is not a recommendation, necessarily....):
I maintain the JDOM XML library. As a result my natural inclination is to use XPath/XQuery for accessing data. It is more expressive, it seems, than the JSoup (CSS) select. There are ways to convert JSoup to a DOM document, and with DOM, you can use XPath (and, if you want, you can convert the DOM to JDOM, and from that, XPath is easy (but that is a lot of conversion to do)).
So, I recommend you create a 'HTMLScheduleFactory', and that HTMLScheduleFactory has a special configuration file for each version of your source-page formats.
That configuration file tells the HTMLScheduleFactory where in the HTML to locate all the things you may need.
Conclusion
I recommend that:
- you separate the data layer from the application layer
- you create a factory implementation for the data layer, so that you can get data from different/multiple sources
- you create a generic HTML-based factory.
- you create a 'mapping' configuration for the HTML-based factory that maps the required data to the HTML locations for that data.
Each of these layers can be put together relatively independently, and can be tested in isolation.
As the web-site changes, if you do it right, you can:
- easily create a new factory if the data becomes available in a better format (XML/JSON/Direct-to-DB/whatever)
- relatively easily create a new configuration file if the website layout changes
- the actual configuration for the website can even be remote (you do not need to change any code if the config changes) so you can host a configuration file on a different server, and only change it if the main source changes.
Code Snippets
public abstract class ScheduleFactory {
public static ScheduleFactory newInstance(String datasource) {
// datasource may be something like a URL, whatever....
}
public abstract Course[] getCourses();
public abstract Timetable getTimeTable(Course course);
}Context
StackExchange Code Review Q#38227, answer score: 3
Revisions (0)
No revisions yet.