HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaModerate

"ONCE", "UPON", "A", "TIME"

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
onceupontime

Problem

I'm working on a small program to perform various tasks on text content, primarily at the word level. I wrote these methods as ways to help prepare the raw text file into something more malleable, like a List, where at a later point I can perform various routines such as counting and sorting words, and so on.

Concerns:

-
In splitTextStringIntoWordList I found myself having to split textString param into String[] array, and immediately afterwards adding the elements from the array one at a time, parsing with regex, into a List. Is there a better way to do this that might not need as much manipulation?

-
Are some of my methods just doing too many things between parameters and return?

-
Is the JavaDoc clear, concise and descriptive?

-
What are some beginner mistakes I might be making?

```
import java.util.Arrays;
import java.util.ArrayList;
import java.util.List;
import java.util.Collections;
import java.util.regex.*;
import java.net.URL;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;

public class TextFileWordSplitter {

/**
* Remove punctuation marks from a String using regular expressions.
*
* This will account for contractions such as "it's" and "can't", as well as
* hyphenated words such as "first-class" and "low-budget", which in both cases
* will be considered as whole words.
*
* @param input The String from which to remove punctuation
* @return The String with the punctuation removed, or empty String
*/
static String removePunctuationFromString(String input) {

Pattern regex = Pattern.compile("([A-Za-z]?[\\-']?[A-Za-z])+");
Matcher matcher = regex.matcher(input);
if (matcher.find()) {
return matcher.group();
} else {
return "";
}
}

/**
* Create a String by fetching a text file at the provided URL.
* @param url The URL where the text file is loca

Solution

Problems

I'm not sure that TextFileWordSplitter is a good name for the class, especially since the source of the text is, in general, a network resource rather than a java.io.File.

List wordList = new ArrayList(); should be List wordList = new ArrayList<>(); to suppress a compiler warning.

The readUrlTextContent() function declares that it throws IOException, but instead it catches almost every IOException and returns null. (The only IOException that can get thrown is a MalformedURLException.) You should just let all IOExceptions propagate naturally.

Make up your mind which functions are public and which ones are private. The default access is almost never a good choice.

Nitpicks

Your regex does not need the capturing parentheses. It also does not need the backslash to quote the hyphen in the character class, since the hyphen it taken literally if it is the first or the last character in the character class.

The JavaDoc should avoid documenting implementation details such as "using regular expressions" — that's none of the caller's business, unless you want to also document exactly what regex is used. JavaDoc should be written in the third-person indicative mood rather than the imperative.

Decomposition

I like that you have split up the work into functions, but I would decompose the work differently.

Having the .toUpperCase() call buried within splitTextStringIntoWordList() is surprising. What does the uppercase transformation have to do with splitting? I would move it into the removePunctuationFromString() function and rename the function to normalizeWord().

readUrlTextContent() is probably harmful. This task can be accomplished without random access to the stream, so you don't need to buffer the entire text into a string. Just let the BufferedReader do its job: it will buffer just enough to enhance performance, and discard the parts of the text that you have already processed.

Instead of calling String.split(), I recommend using a Scanner, which is a convenient way to fetch a word at a time. I would define two versions of the words() function: one that accepts a Scanner, and another that accepts a URL.

import java.io.*;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class WordExtractor {
    private static final Pattern WORD_PATTERN =
        Pattern.compile("[A-Za-z]*[-']?[A-Za-z]+");

    /**
     * Extracts an alphabetic word, possibly containing up to one hyphen or
     * apostrophe, and returns it in uppercase.
     *
     * @return The extracted word, or an empty string if the input is all
     *         punctuation.
     */
    private static String normalizeWord(String s) {
        Matcher m = WORD_PATTERN.matcher(s);
        return m.find() ? m.group().toUpperCase() : "";
    }

    public static List words(URL url) throws IOException {
        try ( InputStream is = url.openStream();
              BufferedInputStream bis = new BufferedInputStream(is);
              Scanner scanner = new Scanner(bis) ) {
            return words(scanner);
        }
    }

    public static List words(Scanner scanner) {
        List results = new ArrayList<>();
        scanner.reset();
        while (scanner.hasNext()) {
            String word = normalizeWord(scanner.next());
            if (!word.isEmpty()) {
                results.add(word);
            }
        }
        return results;
    }

    public static void main(String[] args) throws IOException {
        URL url = new URL("http://textfiles.com/stories/antcrick.txt");
        int i = 0;
        for (String word : words(url)) {
            System.out.printf("[%d]: %s\n", i++, word);
        }
    }
}

Code Snippets

import java.io.*;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class WordExtractor {
    private static final Pattern WORD_PATTERN =
        Pattern.compile("[A-Za-z]*[-']?[A-Za-z]+");

    /**
     * Extracts an alphabetic word, possibly containing up to one hyphen or
     * apostrophe, and returns it in uppercase.
     *
     * @return The extracted word, or an empty string if the input is all
     *         punctuation.
     */
    private static String normalizeWord(String s) {
        Matcher m = WORD_PATTERN.matcher(s);
        return m.find() ? m.group().toUpperCase() : "";
    }

    public static List<String> words(URL url) throws IOException {
        try ( InputStream is = url.openStream();
              BufferedInputStream bis = new BufferedInputStream(is);
              Scanner scanner = new Scanner(bis) ) {
            return words(scanner);
        }
    }

    public static List<String> words(Scanner scanner) {
        List<String> results = new ArrayList<>();
        scanner.reset();
        while (scanner.hasNext()) {
            String word = normalizeWord(scanner.next());
            if (!word.isEmpty()) {
                results.add(word);
            }
        }
        return results;
    }

    public static void main(String[] args) throws IOException {
        URL url = new URL("http://textfiles.com/stories/antcrick.txt");
        int i = 0;
        for (String word : words(url)) {
            System.out.printf("[%d]: %s\n", i++, word);
        }
    }
}

Context

StackExchange Code Review Q#102645, answer score: 15

Revisions (0)

No revisions yet.