patternjavaMinor
Parsing dates from an OCR application
Viewed 0 times
ocrapplicationdatesparsingfrom
Problem
I wrote this code to parse dates from the output of the OCR, which means that the obtained date can be literally anything, so I put some restrictions in place:
I first played around with
So I decided to roll out my own code whilst still intending to use as many Java library features as possible (mainly from
The test class:
The parser class:
```
public final class DateParser {
private DateParser() {
throw new UnsupportedOperationException();
- Date is the the format of:
field1?field2?field3, where the fields are either day, month or year and any delimiter can be used to split the numbers, this is also called the short format of dates.
- The fields consist of only numbers. (So no months as text)
- The locale is known.
I first played around with
DateTimeFormatter.ofLocalizedDate(FormatStyle.SHORT).withLocale(locale), but it turned out to be only of use for formatting, and not for parsing, as it only gives one specific format per locale.So I decided to roll out my own code whilst still intending to use as many Java library features as possible (mainly from
java.util.Locale and java.time).The test class:
public class DateParserTest {
@Test
public void testParseDutchDate() {
List dates = Arrays.asList(
"02-10-2014",
"2-10-2014",
"02-10-14",
"2-10-14",
"02/10/2014",
"02 10 2014"
);
for (String date : dates) {
Locale locale = new Locale("nl");
LocalDate localDate = DateParser.parseShortDate(date, locale);
assertEquals(LocalDate.of(2014, 10, 2), localDate);
}
}
@Test
public void testParseAmericanDate() {
List dates = Arrays.asList(
"10-02-2014",
"10-2-2014",
"10-02-14",
"10-2-14",
"10/02/2014",
"10 02 2014"
);
for (String date : dates) {
Locale locale = new Locale("en-US");
LocalDate localDate = DateParser.parseShortDate(date, locale);
assertEquals(LocalDate.of(2014, 10, 2), localDate);
}
}
}The parser class:
```
public final class DateParser {
private DateParser() {
throw new UnsupportedOperationException();
Solution
I would recommend you to add in a test to check for every available
The corresponding test would be:
Using this we find a few sneaky bugs:
Locale on the platform, if you can parse a date formatted using that locale.The corresponding test would be:
@Test
public void testAllLocales() {
LocalDate specificLocalDate = LocalDate.of(2014, 10, 2);
Locale[] locales = Locale.getAvailableLocales();
for (Locale locale : locales) {
DateTimeFormatter dateTimeFormatter = DateTimeFormatter.ofLocalizedDate(FormatStyle.SHORT).withLocale(locale);
String date = specificLocalDate.format(dateTimeFormatter);
LocalDate localDate = DateParser.parseShortDate(date, locale);
assertEquals("for " + date + " using " + locale, specificLocalDate, localDate);
}
}Using this we find a few sneaky bugs:
- It does not work for the locale
hr_HR, as it uses date formatdd.MM.yy, which has a dot at the end. To fix this we need to change theDATE_PATTERN_EXTRACTION_PATTERNtoPattern.compile("^\\W(\\w+)\\W+(\\w+)\\W+(\\w+)\\W$"). Note that we now allow any number matches of non-words at the start and end of the date format.
- The
DATE_PATTERN_EXTRACTION_PATTERNshould work using Unicode, as languages may possibly use Unicode characters as separators, so it should be:Pattern.compile("(?iuU)^\\W(\\w+)\\W+(\\w+)\\W+(\\w+)\\W$").
- There is bug using the locale
zh_HK, supposedly due to unicode characters in the formatted date, as it is not able to extract the three numbers from the date.
- The
ja_JP_JP_#u-ca-japaneselocale does not work, but this may be due to a JDK bug: https://stackoverflow.com/questions/26169008/the-ja-jp-jp-u-ca-japanese-locale-cannot-be-reparsed-by-its-own-pattern-using-t
Code Snippets
@Test
public void testAllLocales() {
LocalDate specificLocalDate = LocalDate.of(2014, 10, 2);
Locale[] locales = Locale.getAvailableLocales();
for (Locale locale : locales) {
DateTimeFormatter dateTimeFormatter = DateTimeFormatter.ofLocalizedDate(FormatStyle.SHORT).withLocale(locale);
String date = specificLocalDate.format(dateTimeFormatter);
LocalDate localDate = DateParser.parseShortDate(date, locale);
assertEquals("for " + date + " using " + locale, specificLocalDate, localDate);
}
}Context
StackExchange Code Review Q#64570, answer score: 2
Revisions (0)
No revisions yet.