patternjavaModerate
Simplify splitting a String into alpha and numeric parts
Viewed 0 times
alphanumericintosplittingsimplifyandstringparts
Problem
Requirement: Parse a String into chunks of numeric characters and alpha characters. Alpha characters should be separated from the numeric, other characters should be ignored.
Example Data:
Input Desired Output
1A [1, A]
12 [12]
12G [12, G]
12ABC-SFS513 [12, ABC, SFS, 513]
AGE+W#FE [AGE, W, FE]
-12WE- [12, WE]
-12- &%3WE- [12, 3, WE]
Question:
The code below accomplishes this. However, I am looking for any suggestions as to a better way to accomplish this (maybe a crazy regex using
Code:
Example Data:
Input Desired Output
1A [1, A]
12 [12]
12G [12, G]
12ABC-SFS513 [12, ABC, SFS, 513]
AGE+W#FE [AGE, W, FE]
-12WE- [12, WE]
-12- &%3WE- [12, 3, WE]
Question:
The code below accomplishes this. However, I am looking for any suggestions as to a better way to accomplish this (maybe a crazy regex using
String.split()? ) or any changes that could make this code more readable/easy to follow.Code:
private static String VALID_PATTERN = "[0-9]+|[A-Z]+";
private List parse(String toParse){
List chunks = new LinkedList();
toParse = toParse + "$"; //Added invalid character to force the last chunk to be chopped off
int beginIndex = 0;
int endIndex = 0;
while(endIndex < toParse.length()){
while(toParse.substring(beginIndex, endIndex + 1).matches(VALID_PATTERN)){
endIndex++;
}
if(beginIndex != endIndex){
chunks.add(toParse.substring(beginIndex, endIndex));
} else {
endIndex++;
}
beginIndex = endIndex;
}
return chunks;
}Solution
First of all, yes there is a crazy regex you can give to
What this means is to split on any sequence of characters which aren't digits or capital letters as well as between any occurrence of a capital letter followed by a digit or any digit followed by a capital letter. The trick here is to match the space between a capital letter and a digit (or vice-versa) without consuming the letter or the digit. For this we use look-behind to match the part before the split and look-ahead to match the part after the split.
However as you've probably noticed, the above regex is quite a bit more complicated than your
So finding all the parts of the string which match the pattern and putting them in a list is the more natural approach to the problem. This is what your code does, but it does so in a needlessly complicated way. You can greatly simplify your code, by simply using
If you do something like this more than once, you might want to refactor the body of this method into a method
String.split:"[^A-Z0-9]+|(?<=[A-Z])(?=[0-9])|(?<=[0-9])(?=[A-Z])"What this means is to split on any sequence of characters which aren't digits or capital letters as well as between any occurrence of a capital letter followed by a digit or any digit followed by a capital letter. The trick here is to match the space between a capital letter and a digit (or vice-versa) without consuming the letter or the digit. For this we use look-behind to match the part before the split and look-ahead to match the part after the split.
However as you've probably noticed, the above regex is quite a bit more complicated than your
VALID_PATTERN. This is because what you're really doing is trying to extract certain parts from the string, not to split it.So finding all the parts of the string which match the pattern and putting them in a list is the more natural approach to the problem. This is what your code does, but it does so in a needlessly complicated way. You can greatly simplify your code, by simply using
Pattern.matcher like this:private static final Pattern VALID_PATTERN = Pattern.compile("[0-9]+|[A-Z]+");
private List parse(String toParse) {
List chunks = new LinkedList();
Matcher matcher = VALID_PATTERN.matcher(toParse);
while (matcher.find()) {
chunks.add( matcher.group() );
}
return chunks;
}If you do something like this more than once, you might want to refactor the body of this method into a method
findAll which takes the string and the pattern as arguments, and then call it as findAll(toParse, VALID_PATTERN) in parse.Code Snippets
"[^A-Z0-9]+|(?<=[A-Z])(?=[0-9])|(?<=[0-9])(?=[A-Z])"private static final Pattern VALID_PATTERN = Pattern.compile("[0-9]+|[A-Z]+");
private List<String> parse(String toParse) {
List<String> chunks = new LinkedList<String>();
Matcher matcher = VALID_PATTERN.matcher(toParse);
while (matcher.find()) {
chunks.add( matcher.group() );
}
return chunks;
}Context
StackExchange Code Review Q#2345, answer score: 15
Revisions (0)
No revisions yet.