patternjavaMinor
Thread-safe large text buffer
Viewed 0 times
textlargethreadsafebuffer
Problem
This is part of one of my projects called largetext, which actually stemmed from a question on Stack Overflow. The goal is to provide access to a very large text file as a
In the next version, I want to provide a thread safe version of the main class,
To give an idea of the impact of this method, looking for all lines more than 10 characters long with a
Some explanation on how this all works:
The text file is divided into "text ranges" by a decoding process, mapping one byte range (into the file) to one character range; therefore only the needed text is loaded. This class also handles callers to
The
Note: copyright header and imports omitted for "brevity".
Here is the non thread safe version (link):
```
/**
* A large text file as a {@link CharSequence}: non thread safe version
*
* Despite the not really reassuring name, this is the class you will use the
* most often.
*
* This class's {@code .charAt()
CharSequence so that it be usable with not only java.util.regex but also grappa.In the next version, I want to provide a thread safe version of the main class,
LargeText. I have written and tested a thread safe implementation, but now I'd like to know whether it can be done faster... I have a worst case loss of performance of 40% in calls to .charAt().To give an idea of the impact of this method, looking for all lines more than 10 characters long with a
Matcher and the simple Pattern ^.{10,}$ (in multiline mode) will make one call to .charAt() for each character in the file; and the problem is even worse if you use lookarounds, of course. As such this method is pretty critical! On my machine, this method is called more than 80 million times per second with the non thread safe variant belowSome explanation on how this all works:
The text file is divided into "text ranges" by a decoding process, mapping one byte range (into the file) to one character range; therefore only the needed text is loaded. This class also handles callers to
.charAt() having required a number of characters greater than what has been decoded at a given moment, waking them up when appropriate etc.The
LargeText class uses the principle of locality: when a caller calls .charAt(), it is very likely that the next call to .charAt() will hit the same character range, therefore it keeps the current CharBuffer and text range at hand, and only loads a new one if .charAt() gets out of range.Note: copyright header and imports omitted for "brevity".
Here is the non thread safe version (link):
```
/**
* A large text file as a {@link CharSequence}: non thread safe version
*
* Despite the not really reassuring name, this is the class you will use the
* most often.
*
* This class's {@code .charAt()
Solution
How large buffer size are you using? What is the hit/miss frequency on your buffer in
If the hit frequency is less than 90% and you're doing as you say 80E6
So I would change:
to:
and change the visibility in
Also, if there is any reason to believe that
Edit:
So with the stats from the comments, the lines executed 99% of the time is:
Not thread safe:
Thread safe:
Set speculation hat: On
The JIT should inline expand the function calls to
Now if you doubt your JVM is good, you can try to inline the two function calls manually and restructure the thread safe variant like this:
just in case the JVM is unable to deduce that those stack variables shouldn't be initialized unless you actually get there.
charAt()?If the hit frequency is less than 90% and you're doing as you say 80E6
charAt() per second that means you have 8 million buffer allocations per second which is going to cause the GC significant head aches. AFAIR the JVM can stall all threads while doing GC under certain circumstances (I had a similar problem developing a high-performance parallel java application).So I would change:
CURRENT.set(new CurrentBuffer(range, buffer));to:
buf.range = range;
buf.buffer = buffer;and change the visibility in
CurrentBuffer, it's a private nested class so you're not breaking encapsulation any way. This will put less strain on the GC.Also, if there is any reason to believe that
charAt() is progressing even roughly linearly through the indices, a pre-fetch running in the background would speed things up.Edit:
So with the stats from the comments, the lines executed 99% of the time is:
Not thread safe:
if (!range.contains(index)) {
// Branch not taken
}
return buffer.charAt(index - range.getLowerBound());Thread safe:
final CurrentBuffer buf = CURRENT_BUFFER.get();
if (buf.containsIndex(index)) // Branch taken
return buf.charAt(index);
... stuff on the stack ...Set speculation hat: On
The JIT should inline expand the function calls to
buf.containsIndex() and buf.charAt, it should also not do anything with the stack variables (not even initializing). So the remaining difference is CURRENT_BUFFER.get() and the book keeping associated with the buf variable which could make or break here as you're only doing a few instructions to fetch the data every time. Adding a few book keeping instructions on the buf variable could have a big impact in your case. I would investigate if I could get rid of it somehow, maybe encapsulate it somehow so the same instance is used in all calls. Now if you doubt your JVM is good, you can try to inline the two function calls manually and restructure the thread safe variant like this:
final CurrentBuffer buf = CURRENT_BUFFER.get();
if (!buf.containsIndex(index)){
final TextRange textRange = decoder.getRange(index);
final IntRange range = textRange.getCharRange();
final CharBuffer buffer = loader.load(textRange);
CURRENT.set(new CurrentBuffer(range, buffer));
return buffer.charAt(index - range.getLowerBound());
}
return buf.charAt(index);just in case the JVM is unable to deduce that those stack variables shouldn't be initialized unless you actually get there.
Code Snippets
CURRENT.set(new CurrentBuffer(range, buffer));buf.range = range;
buf.buffer = buffer;if (!range.contains(index)) {
// Branch not taken
}
return buffer.charAt(index - range.getLowerBound());final CurrentBuffer buf = CURRENT_BUFFER.get();
if (buf.containsIndex(index)) // Branch taken
return buf.charAt(index);
... stuff on the stack ...final CurrentBuffer buf = CURRENT_BUFFER.get();
if (!buf.containsIndex(index)){
final TextRange textRange = decoder.getRange(index);
final IntRange range = textRange.getCharRange();
final CharBuffer buffer = loader.load(textRange);
CURRENT.set(new CurrentBuffer(range, buffer));
return buffer.charAt(index - range.getLowerBound());
}
return buf.charAt(index);Context
StackExchange Code Review Q#48411, answer score: 3
Revisions (0)
No revisions yet.