patternjavaModerate
Fast alternative for String.split by whitespace
Viewed 0 times
fastalternativewhitespacesplitforstring
Problem
After profiling of one project I've found a bottleneck in the code, specifically the line:
I've tried to find a workaround, but all ready made solutions, such as
The problem is that the code performs even worse than
Could someone suggest how to optimize the
String columns[] = line.split("\\s+");I've tried to find a workaround, but all ready made solutions, such as
StringTokenizer and Scanner seem to be less efficient (for example, look at the answer on SO). So I decided to write a custom function.private List fastSplitWS(final String text)
{
final List result = new ArrayList();
if(text != null)
{
final int n = text.length();
if(n > 0)
{
int index1 = 0;
char x = 0;
boolean y = false; // "preceding char is separator" flag
for(int i = 0; i = 9 && x <= 13) || x == ' ')
{
if(!y)
{
String token = text.substring(index1, i);
result.add(token);
}
index1 = i + 1;
y = true;
}
else
{
y = false;
}
}
if(index1 < n - 1)
{
result.add(text.substring(index1));
}
}
}
return result;
}The problem is that the code performs even worse than
String.split() although I don't see how to improve it. According to profiler, most problematic parts are:self(lines of the plain code in this method) - 58%
String.charAt- 21%
String.subString- 12%
ArrayList.add- 7%
Could someone suggest how to optimize the
self code?Solution
Bug #1
You have an off-by-one error here:
The consequence of this is that
Bug #2
I would expect the same output for these calls:
But that's not the case. The first returns an empty list, the second a list with a single empty string in it.
This behavior is also inconsistent with
This is caused by this:
One way to fix:
Another way is to initialize
(Notice that with this variable name,
it's hard to understand the logical meaning of this.)
Readability
@TopinFrassi was too kind.
I find this code extremely hard to read,
mostly due to the poor naming throughout.
The unconventional indenting and brace placement don't help.
Performance
It's worth nothing the trade-offs of using
Take a look at those links, of the implementations in OpenJDK 8:
Which one is more efficient? I've no idea...
Improvement ideas
The indentation levels can be reduced here:
The first one can be reduced using an early return:
The second can be reduced by eliminating the
as the loop logic takes care of the
Suggested implementation
Applying the above suggestions and bugfixes, the implementation becomes:
You have an off-by-one error here:
if(index1 < n - 1)
{
result.add(text.substring(index1));
}The consequence of this is that
"a b c" will be split to ["a", "b"], poor "c" is left out.Bug #2
I would expect the same output for these calls:
fastSplitWS("");
fastSplitWS(" ");But that's not the case. The first returns an empty list, the second a list with a single empty string in it.
This behavior is also inconsistent with
String.split.This is caused by this:
if(!y)One way to fix:
if(!y && i > 0)Another way is to initialize
y to true.(Notice that with this variable name,
it's hard to understand the logical meaning of this.)
Readability
@TopinFrassi was too kind.
I find this code extremely hard to read,
mostly due to the poor naming throughout.
The unconventional indenting and brace placement don't help.
Performance
It's worth nothing the trade-offs of using
.charAt and .toCharArray.Take a look at those links, of the implementations in OpenJDK 8:
toCharArraydoes anarraycopy, using double the space
charAtdoes a boundary check on the index parameter
Which one is more efficient? I've no idea...
Improvement ideas
The indentation levels can be reduced here:
if(text != null)
{
final int n = text.length();
if(n > 0)
{The first one can be reduced using an early return:
if(text == null) return result;
final int n = text.length();
if(n > 0)
{The second can be reduced by eliminating the
if(n > 0) completely,as the loop logic takes care of the
n == 0 case automatically.Suggested implementation
Applying the above suggestions and bugfixes, the implementation becomes:
private List fastSplitWS(final String text) {
if (text == null) {
throw new IllegalArgumentException("the text to split should not be null");
}
final List result = new ArrayList();
final int len = text.length();
int tokenStart = 0;
boolean prevCharIsSeparator = true; // "preceding char is separator" flag
char[] chars = text.toCharArray();
for (int pos = 0; pos < len; ++pos) {
char c = chars[pos];
if (c == '\t' || c == ' ') {
if (!prevCharIsSeparator) {
result.add(text.substring(tokenStart, pos));
prevCharIsSeparator = true;
}
tokenStart = pos + 1;
} else {
prevCharIsSeparator = false;
}
}
if (tokenStart < len) {
result.add(text.substring(tokenStart));
}
return result;
}Code Snippets
if(index1 < n - 1)
{
result.add(text.substring(index1));
}fastSplitWS("");
fastSplitWS(" ");if(!y && i > 0)if(text != null)
{
final int n = text.length();
if(n > 0)
{if(text == null) return result;
final int n = text.length();
if(n > 0)
{Context
StackExchange Code Review Q#111201, answer score: 10
Revisions (0)
No revisions yet.