HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Parsing BibTeX files in Java

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
bibtexfilesparsingjava

Problem

As part of a larger project to generate a website with a personal list of publications, I implemented a parser for BibTeX files.

The entry point is the parseFile method in PublicationListParser. This method scans the file for entries (and tags - a custom extension specific to my project). Each entry is read into a String by Tokenizer, and then parsed by BibItemParser.

I'm looking for all kinds of feedback: style issues, missed corner cases, performance, logical organization, etc.

PublicationListParser.java

```
package publy.io.bibtexparser;

import java.io.BufferedReader;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import publy.Console;
import publy.data.Author;
import publy.data.bibitem.BibItem;

public class PublicationListParser {

public static List parseFile(Path file) throws IOException, ParseException {
Console.debug("Parsing publication list \"%s\"", file);
PublicationListParser parser = new PublicationListParser();

parser.parseFileInternal(file);
AbbreviationHandler.handleAbbreviationsAndAuthors(parser.items, parser.abbreviations, parser.authors);

return parser.items;
}

private final List items = new ArrayList<>();
private final Map abbreviations = new HashMap<>();
private final Map authors = new HashMap<>();

private PublicationListParser() {
}

private void parseFileInternal(Path file) throws IOException, ParseException {
try (BufferedReader in = Files.newBufferedReader(file, Charset.forName("UTF-8"))) {
for (String l = in.readLine(); l != null; l = in.readLine()) {
String line = l.trim();

if (line.startsWith("@")) {
// A Bibitem
BibItem item = BibItemParser.parseBibItem(Tokenizer.collectBibItem(in, line).rep

Solution

.substring()

Note that .substring() values are not shared any more with the string
they were created from (link).
That means calls like body = body.substring(...).trim()
will allocate two new strings: one for the .substring(...) and another
for the .trim() which is another form of a sub-string.

So if you are concerned about efficiency, avoid marching through a string
by repeatedly taking substrings.

use a library

I undersand this is Code Review, but since these seems to be a real
project, consider using a third party library like jbibtex.
jbibtex is distributed under a liberal license and looks like
a very robust BiBTeX fle parser.

use a tokenizer

You use several different methods for consuming input:

  • reading a file line by line - in.readline()



  • reading it one character at a time - in.read()



  • indexing into a string, e.g. body.codePointAt(i)



  • searching through a string. e.g. body.indexOf(',')


followed by body = body.substring(...).

I think you could make your life easier by tokenizing the
input and parse the file at the token level.
This would, for instance, get rid of the multiple .trim() calls
all over the place since ignoring white space would be
the job of the tokenizer and thus handled in one place.

An example of a tokenizer is the class java.io.StreamTokenizer.
It has two main methods:

  • .nextToken()



  • .pushBack()



.pushBack() pushes the last token returned by .nextToken() back
onto the stream so that it will be returned again by the
next .nextToken().

A tokenizer makes it easy to wmanually write a
recursive descent parser for simple grammars
like a BibTeX file.

Here is a blog post on writing a recursive descent parser
with StreamTokenizer to parse boolean expressions:

https://unnikked.ga/how-to-evaluate-a-boolean-expression/

Update

Here is an example of how to roll your own parser - written in Python:

https://gist.github.com/erantapaa/5a2614adde0526d25c03

Context

StackExchange Code Review Q#106055, answer score: 3

Revisions (0)

No revisions yet.