HiveBrain v1.2.0
Get Started
← Back to all entries
patternbashMinor

Get bibtex entries from metadata of PDF files

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
bibtexgetfilesfrommetadatapdfentries

Problem

When I have to write a report or an article, I usually have my bibliography as PDFs in a specific folder, so I wrote this script to automate the generation of bibtex entries.

It works so far, but I'd like to have suggestions about what I could improve

#!/usr/bin/env bash

FOLDER=$1
BIBTEX_FILE=$2

cd $FOLDER
for f in *; do
    echo "Examining $f"
    TITLE=`pdfinfo $f | grep Title | cut -c17-`

    echo "Title found in PDF metadata is : $TITLE"

    case "$TITLE" in
        doi*)
            echo "This is a DOI, nice!"
            DOI=`echo $TITLE | cut -c5-`;;
        *)
            case "$TITLE" in
                "untitled")
                    echo "Let's see what we can find in the text"
                    DOI=`pdftotext $f /tmp/pdf | cat /tmp/pdf | grep -i 'doi\|Digital Object Identifier' | awk 'NF>1{print $NF}'`;;
                *)
                    echo "Let's find a DOI from this title"
                    TITLE=`echo $TITLE | sed 's/ /%20/g'`
                    URL="http://api.crossref.org/works?query=$TITLE&rows=1"
                    DOI=`curl -s $URL | jq '.message.items[].DOI' | sed 's/"//g'`;;
            esac
    esac

    echo "The DOI is $DOI"
    echo "----------------------------------"
    URL="http://api.crossref.org/works/$DOI/transform/application/x-bibtex"
    curl -s $URL >> $BIBTEX_FILE
    echo >> $BIBTEX_FILE
    echo >> $BIBTEX_FILE
done

Solution

Always double-quote variables if they might contain spaces

These commands will fail if the variables contain spaces:

cd $FOLDER

# ...

echo >> $BIBTEX_FILE


Even if you never intend to use your script with paths containing spaces,
it's good to make it a habit to double-quote variables that might contain spaces:

cd "$FOLDER"

# ...

echo >> "$BIBTEX_FILE"


Error checking

What will happen if this command fails?

cd $FOLDER


If this fails, the program will happily continue,
download a likely invalid url with curl,
and create a bogus $BIBTEX_FILE,
essentially putting garbage in the current directory.

Look for possible points of failure where the program should abort.
If something goes wrong with this cd command you definitely don't want execution to continue.
Here's a simple way to abort:

cd "$FOLDER" || exit 1


Here's a more user-friendly way:

if ! cd "$FOLDER"; then
    echo "error: could not cd into $FOLDER"
    exit 1
fi


Use the modern $(...) instead of the obsolete `...

For example:

TITLE=$(pdfinfo "$f" | grep Title | cut -c17-)


Notice that I double-quoted
$f, as it should be.

Bogus pipeline

This is strange:

pdftotext $f /tmp/pdf | cat /tmp/pdf | ...


In a pipeline, usually the output of one command is passed as input to the next through
stdout and stdin file handles. That's not what happens here, data is passed through a file. This is unusual and confusing.
Also confusing is to name a text file
/tmp/pdf.

You don't need a temporary file here,
pdftotext can produce output on stdout,
and then you don't need
cat either:

pdftotext $f - | ...


Removing double-quotes from the output of
jq

Instead of this:

curl -s $URL | jq '.message.items[].DOI' | sed 's/"//g'


You can remove the double-quotes by using the
--raw-output flag:

curl -s "$URL" | jq -r '.message.items[].DOI'


-r is a shortcut for --raw-output.
Notice also, once again, that I added the necessary double-quotes around
$URL`.

Redirecting multiple statements

This kind of repetition quickly becomes annoying:

curl -s $URL >> $BIBTEX_FILE
echo >> $BIBTEX_FILE
echo >> $BIBTEX_FILE


You can dispense with that using grouping:

{
    curl -s "$URL"
    echo
    echo
} >> "$BIBTEX_FILE"

Code Snippets

cd $FOLDER

# ...

echo >> $BIBTEX_FILE
cd "$FOLDER"

# ...

echo >> "$BIBTEX_FILE"
cd "$FOLDER" || exit 1
if ! cd "$FOLDER"; then
    echo "error: could not cd into $FOLDER"
    exit 1
fi
TITLE=$(pdfinfo "$f" | grep Title | cut -c17-)

Context

StackExchange Code Review Q#147350, answer score: 4

Revisions (0)

No revisions yet.