HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaModerate

Fast alternative for String.split by whitespace

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fastalternativewhitespacesplitforstring

Problem

After profiling of one project I've found a bottleneck in the code, specifically the line:

String columns[] = line.split("\\s+");


I've tried to find a workaround, but all ready made solutions, such as StringTokenizer and Scanner seem to be less efficient (for example, look at the answer on SO). So I decided to write a custom function.

private List fastSplitWS(final String text)
{
  final List result = new ArrayList();

  if(text != null)
  {
    final int n = text.length();
    if(n > 0)
    {
      int index1 = 0;
      char x = 0;
      boolean y = false; // "preceding char is separator" flag
      for(int i = 0; i = 9 && x <= 13) || x == ' ')
        {
          if(!y)
          {
            String token = text.substring(index1, i);
            result.add(token);
          }
          index1 = i + 1;
          y = true;
        }
        else
        {
          y = false;
        }
      }

      if(index1 < n - 1)
      {
        result.add(text.substring(index1));
      }
    }
  }

  return result;
}


The problem is that the code performs even worse than String.split() although I don't see how to improve it. According to profiler, most problematic parts are:

  • self (lines of the plain code in this method) - 58%



  • String.charAt - 21%



  • String.subString - 12%



  • ArrayList.add - 7%



Could someone suggest how to optimize the self code?

Solution

Bug #1

You have an off-by-one error here:

if(index1 < n - 1)
  {
    result.add(text.substring(index1));
  }


The consequence of this is that "a b c" will be split to ["a", "b"], poor "c" is left out.

Bug #2

I would expect the same output for these calls:

fastSplitWS("");
fastSplitWS("   ");


But that's not the case. The first returns an empty list, the second a list with a single empty string in it.
This behavior is also inconsistent with String.split.

This is caused by this:

if(!y)


One way to fix:

if(!y && i > 0)


Another way is to initialize y to true.
(Notice that with this variable name,
it's hard to understand the logical meaning of this.)

Readability

@TopinFrassi was too kind.
I find this code extremely hard to read,
mostly due to the poor naming throughout.
The unconventional indenting and brace placement don't help.

Performance

It's worth nothing the trade-offs of using .charAt and .toCharArray.
Take a look at those links, of the implementations in OpenJDK 8:

  • toCharArray does an arraycopy, using double the space



  • charAt does a boundary check on the index parameter



Which one is more efficient? I've no idea...

Improvement ideas

The indentation levels can be reduced here:

if(text != null)
  {
    final int n = text.length();
    if(n > 0)
    {


The first one can be reduced using an early return:

if(text == null) return result;
final int n = text.length();
if(n > 0)
{


The second can be reduced by eliminating the if(n > 0) completely,
as the loop logic takes care of the n == 0 case automatically.

Suggested implementation

Applying the above suggestions and bugfixes, the implementation becomes:

private List fastSplitWS(final String text) {
    if (text == null) {
        throw new IllegalArgumentException("the text to split should not be null");
    }

    final List result = new ArrayList();

    final int len = text.length();
    int tokenStart = 0;
    boolean prevCharIsSeparator = true;  // "preceding char is separator" flag

    char[] chars = text.toCharArray();

    for (int pos = 0; pos < len; ++pos) {
        char c = chars[pos];

        if (c == '\t' || c == ' ') {
            if (!prevCharIsSeparator) {
                result.add(text.substring(tokenStart, pos));
                prevCharIsSeparator = true;
            }
            tokenStart = pos + 1;
        } else {
            prevCharIsSeparator = false;
        }
    }

    if (tokenStart < len) {
        result.add(text.substring(tokenStart));
    }

    return result;
}

Code Snippets

if(index1 < n - 1)
  {
    result.add(text.substring(index1));
  }
fastSplitWS("");
fastSplitWS("   ");
if(!y && i > 0)
if(text != null)
  {
    final int n = text.length();
    if(n > 0)
    {
if(text == null) return result;
final int n = text.length();
if(n > 0)
{

Context

StackExchange Code Review Q#111201, answer score: 10

Revisions (0)

No revisions yet.