patternjavaMinor
Parsing BibTeX files in Java
Viewed 0 times
bibtexfilesparsingjava
Problem
As part of a larger project to generate a website with a personal list of publications, I implemented a parser for BibTeX files.
The entry point is the
I'm looking for all kinds of feedback: style issues, missed corner cases, performance, logical organization, etc.
PublicationListParser.java
```
package publy.io.bibtexparser;
import java.io.BufferedReader;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import publy.Console;
import publy.data.Author;
import publy.data.bibitem.BibItem;
public class PublicationListParser {
public static List parseFile(Path file) throws IOException, ParseException {
Console.debug("Parsing publication list \"%s\"", file);
PublicationListParser parser = new PublicationListParser();
parser.parseFileInternal(file);
AbbreviationHandler.handleAbbreviationsAndAuthors(parser.items, parser.abbreviations, parser.authors);
return parser.items;
}
private final List items = new ArrayList<>();
private final Map abbreviations = new HashMap<>();
private final Map authors = new HashMap<>();
private PublicationListParser() {
}
private void parseFileInternal(Path file) throws IOException, ParseException {
try (BufferedReader in = Files.newBufferedReader(file, Charset.forName("UTF-8"))) {
for (String l = in.readLine(); l != null; l = in.readLine()) {
String line = l.trim();
if (line.startsWith("@")) {
// A Bibitem
BibItem item = BibItemParser.parseBibItem(Tokenizer.collectBibItem(in, line).rep
The entry point is the
parseFile method in PublicationListParser. This method scans the file for entries (and tags - a custom extension specific to my project). Each entry is read into a String by Tokenizer, and then parsed by BibItemParser.I'm looking for all kinds of feedback: style issues, missed corner cases, performance, logical organization, etc.
PublicationListParser.java
```
package publy.io.bibtexparser;
import java.io.BufferedReader;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import publy.Console;
import publy.data.Author;
import publy.data.bibitem.BibItem;
public class PublicationListParser {
public static List parseFile(Path file) throws IOException, ParseException {
Console.debug("Parsing publication list \"%s\"", file);
PublicationListParser parser = new PublicationListParser();
parser.parseFileInternal(file);
AbbreviationHandler.handleAbbreviationsAndAuthors(parser.items, parser.abbreviations, parser.authors);
return parser.items;
}
private final List items = new ArrayList<>();
private final Map abbreviations = new HashMap<>();
private final Map authors = new HashMap<>();
private PublicationListParser() {
}
private void parseFileInternal(Path file) throws IOException, ParseException {
try (BufferedReader in = Files.newBufferedReader(file, Charset.forName("UTF-8"))) {
for (String l = in.readLine(); l != null; l = in.readLine()) {
String line = l.trim();
if (line.startsWith("@")) {
// A Bibitem
BibItem item = BibItemParser.parseBibItem(Tokenizer.collectBibItem(in, line).rep
Solution
.substring()
Note that .substring() values are not shared any more with the string
they were created from (link).
That means calls like
will allocate two new strings: one for the
for the
So if you are concerned about efficiency, avoid marching through a string
by repeatedly taking substrings.
use a library
I undersand this is Code Review, but since these seems to be a real
project, consider using a third party library like jbibtex.
jbibtex is distributed under a liberal license and looks like
a very robust BiBTeX fle parser.
use a tokenizer
You use several different methods for consuming input:
followed by
I think you could make your life easier by tokenizing the
input and parse the file at the token level.
This would, for instance, get rid of the multiple
all over the place since ignoring white space would be
the job of the tokenizer and thus handled in one place.
An example of a tokenizer is the class
It has two main methods:
onto the stream so that it will be returned again by the
next
A tokenizer makes it easy to wmanually write a
recursive descent parser for simple grammars
like a BibTeX file.
Here is a blog post on writing a recursive descent parser
with StreamTokenizer to parse boolean expressions:
https://unnikked.ga/how-to-evaluate-a-boolean-expression/
Update
Here is an example of how to roll your own parser - written in Python:
https://gist.github.com/erantapaa/5a2614adde0526d25c03
Note that .substring() values are not shared any more with the string
they were created from (link).
That means calls like
body = body.substring(...).trim()will allocate two new strings: one for the
.substring(...) and anotherfor the
.trim() which is another form of a sub-string.So if you are concerned about efficiency, avoid marching through a string
by repeatedly taking substrings.
use a library
I undersand this is Code Review, but since these seems to be a real
project, consider using a third party library like jbibtex.
jbibtex is distributed under a liberal license and looks like
a very robust BiBTeX fle parser.
use a tokenizer
You use several different methods for consuming input:
- reading a file line by line -
in.readline()
- reading it one character at a time -
in.read()
- indexing into a string, e.g.
body.codePointAt(i)
- searching through a string. e.g.
body.indexOf(',')
followed by
body = body.substring(...).I think you could make your life easier by tokenizing the
input and parse the file at the token level.
This would, for instance, get rid of the multiple
.trim() callsall over the place since ignoring white space would be
the job of the tokenizer and thus handled in one place.
An example of a tokenizer is the class
java.io.StreamTokenizer.It has two main methods:
.nextToken()
.pushBack()
.pushBack() pushes the last token returned by .nextToken() backonto the stream so that it will be returned again by the
next
.nextToken().A tokenizer makes it easy to wmanually write a
recursive descent parser for simple grammars
like a BibTeX file.
Here is a blog post on writing a recursive descent parser
with StreamTokenizer to parse boolean expressions:
https://unnikked.ga/how-to-evaluate-a-boolean-expression/
Update
Here is an example of how to roll your own parser - written in Python:
https://gist.github.com/erantapaa/5a2614adde0526d25c03
Context
StackExchange Code Review Q#106055, answer score: 3
Revisions (0)
No revisions yet.