HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Parsing dates from an OCR application

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
ocrapplicationdatesparsingfrom

Problem

I wrote this code to parse dates from the output of the OCR, which means that the obtained date can be literally anything, so I put some restrictions in place:

  • Date is the the format of: field1?field2?field3, where the fields are either day, month or year and any delimiter can be used to split the numbers, this is also called the short format of dates.



  • The fields consist of only numbers. (So no months as text)



  • The locale is known.



I first played around with DateTimeFormatter.ofLocalizedDate(FormatStyle.SHORT).withLocale(locale), but it turned out to be only of use for formatting, and not for parsing, as it only gives one specific format per locale.

So I decided to roll out my own code whilst still intending to use as many Java library features as possible (mainly from java.util.Locale and java.time).

The test class:

public class DateParserTest {
    @Test
    public void testParseDutchDate() {
        List dates = Arrays.asList(
            "02-10-2014",
            "2-10-2014",
            "02-10-14",
            "2-10-14",
            "02/10/2014",
            "02 10 2014"
        );

        for (String date : dates) {
            Locale locale = new Locale("nl");

            LocalDate localDate = DateParser.parseShortDate(date, locale);

            assertEquals(LocalDate.of(2014, 10, 2), localDate);
        }
    }

    @Test
    public void testParseAmericanDate() {
        List dates = Arrays.asList(
            "10-02-2014",
            "10-2-2014",
            "10-02-14",
            "10-2-14",
            "10/02/2014",
            "10 02 2014"
        );

        for (String date : dates) {
            Locale locale = new Locale("en-US");

            LocalDate localDate = DateParser.parseShortDate(date, locale);

            assertEquals(LocalDate.of(2014, 10, 2), localDate);
        }
    }
}


The parser class:

```
public final class DateParser {
private DateParser() {
throw new UnsupportedOperationException();

Solution

I would recommend you to add in a test to check for every available Locale on the platform, if you can parse a date formatted using that locale.

The corresponding test would be:

@Test
public void testAllLocales() {
    LocalDate specificLocalDate = LocalDate.of(2014, 10, 2);

    Locale[] locales = Locale.getAvailableLocales();
    for (Locale locale : locales) {
        DateTimeFormatter dateTimeFormatter = DateTimeFormatter.ofLocalizedDate(FormatStyle.SHORT).withLocale(locale);
        String date = specificLocalDate.format(dateTimeFormatter);

        LocalDate localDate = DateParser.parseShortDate(date, locale);

        assertEquals("for " + date + " using " + locale, specificLocalDate, localDate);
    }
}


Using this we find a few sneaky bugs:

  • It does not work for the locale hr_HR, as it uses date format dd.MM.yy, which has a dot at the end. To fix this we need to change the DATE_PATTERN_EXTRACTION_PATTERN to Pattern.compile("^\\W(\\w+)\\W+(\\w+)\\W+(\\w+)\\W$"). Note that we now allow any number matches of non-words at the start and end of the date format.



  • The DATE_PATTERN_EXTRACTION_PATTERN should work using Unicode, as languages may possibly use Unicode characters as separators, so it should be: Pattern.compile("(?iuU)^\\W(\\w+)\\W+(\\w+)\\W+(\\w+)\\W$").



  • There is bug using the locale zh_HK, supposedly due to unicode characters in the formatted date, as it is not able to extract the three numbers from the date.



  • The ja_JP_JP_#u-ca-japanese locale does not work, but this may be due to a JDK bug: https://stackoverflow.com/questions/26169008/the-ja-jp-jp-u-ca-japanese-locale-cannot-be-reparsed-by-its-own-pattern-using-t

Code Snippets

@Test
public void testAllLocales() {
    LocalDate specificLocalDate = LocalDate.of(2014, 10, 2);

    Locale[] locales = Locale.getAvailableLocales();
    for (Locale locale : locales) {
        DateTimeFormatter dateTimeFormatter = DateTimeFormatter.ofLocalizedDate(FormatStyle.SHORT).withLocale(locale);
        String date = specificLocalDate.format(dateTimeFormatter);

        LocalDate localDate = DateParser.parseShortDate(date, locale);

        assertEquals("for " + date + " using " + locale, specificLocalDate, localDate);
    }
}

Context

StackExchange Code Review Q#64570, answer score: 2

Revisions (0)

No revisions yet.