patternbashMinor
Get bibtex entries from metadata of PDF files
Viewed 0 times
bibtexgetfilesfrommetadatapdfentries
Problem
When I have to write a report or an article, I usually have my bibliography as PDFs in a specific folder, so I wrote this script to automate the generation of bibtex entries.
It works so far, but I'd like to have suggestions about what I could improve
It works so far, but I'd like to have suggestions about what I could improve
#!/usr/bin/env bash
FOLDER=$1
BIBTEX_FILE=$2
cd $FOLDER
for f in *; do
echo "Examining $f"
TITLE=`pdfinfo $f | grep Title | cut -c17-`
echo "Title found in PDF metadata is : $TITLE"
case "$TITLE" in
doi*)
echo "This is a DOI, nice!"
DOI=`echo $TITLE | cut -c5-`;;
*)
case "$TITLE" in
"untitled")
echo "Let's see what we can find in the text"
DOI=`pdftotext $f /tmp/pdf | cat /tmp/pdf | grep -i 'doi\|Digital Object Identifier' | awk 'NF>1{print $NF}'`;;
*)
echo "Let's find a DOI from this title"
TITLE=`echo $TITLE | sed 's/ /%20/g'`
URL="http://api.crossref.org/works?query=$TITLE&rows=1"
DOI=`curl -s $URL | jq '.message.items[].DOI' | sed 's/"//g'`;;
esac
esac
echo "The DOI is $DOI"
echo "----------------------------------"
URL="http://api.crossref.org/works/$DOI/transform/application/x-bibtex"
curl -s $URL >> $BIBTEX_FILE
echo >> $BIBTEX_FILE
echo >> $BIBTEX_FILE
doneSolution
Always double-quote variables if they might contain spaces
These commands will fail if the variables contain spaces:
Even if you never intend to use your script with paths containing spaces,
it's good to make it a habit to double-quote variables that might contain spaces:
Error checking
What will happen if this command fails?
If this fails, the program will happily continue,
download a likely invalid url with
and create a bogus
essentially putting garbage in the current directory.
Look for possible points of failure where the program should abort.
If something goes wrong with this
Here's a simple way to abort:
Here's a more user-friendly way:
Use the modern
Redirecting multiple statements
This kind of repetition quickly becomes annoying:
You can dispense with that using grouping:
These commands will fail if the variables contain spaces:
cd $FOLDER
# ...
echo >> $BIBTEX_FILEEven if you never intend to use your script with paths containing spaces,
it's good to make it a habit to double-quote variables that might contain spaces:
cd "$FOLDER"
# ...
echo >> "$BIBTEX_FILE"Error checking
What will happen if this command fails?
cd $FOLDERIf this fails, the program will happily continue,
download a likely invalid url with
curl,and create a bogus
$BIBTEX_FILE,essentially putting garbage in the current directory.
Look for possible points of failure where the program should abort.
If something goes wrong with this
cd command you definitely don't want execution to continue.Here's a simple way to abort:
cd "$FOLDER" || exit 1Here's a more user-friendly way:
if ! cd "$FOLDER"; then
echo "error: could not cd into $FOLDER"
exit 1
fiUse the modern
$(...) instead of the obsolete `...
For example:
TITLE=$(pdfinfo "$f" | grep Title | cut -c17-)
Notice that I double-quoted $f, as it should be.
Bogus pipeline
This is strange:
pdftotext $f /tmp/pdf | cat /tmp/pdf | ...
In a pipeline, usually the output of one command is passed as input to the next through stdout and stdin file handles. That's not what happens here, data is passed through a file. This is unusual and confusing.
Also confusing is to name a text file /tmp/pdf.
You don't need a temporary file here,
pdftotext can produce output on stdout,
and then you don't need cat either:
pdftotext $f - | ...
Removing double-quotes from the output of jq
Instead of this:
curl -s $URL | jq '.message.items[].DOI' | sed 's/"//g'
You can remove the double-quotes by using the --raw-output flag:
curl -s "$URL" | jq -r '.message.items[].DOI'
-r is a shortcut for --raw-output.
Notice also, once again, that I added the necessary double-quotes around $URL`.Redirecting multiple statements
This kind of repetition quickly becomes annoying:
curl -s $URL >> $BIBTEX_FILE
echo >> $BIBTEX_FILE
echo >> $BIBTEX_FILEYou can dispense with that using grouping:
{
curl -s "$URL"
echo
echo
} >> "$BIBTEX_FILE"Code Snippets
cd $FOLDER
# ...
echo >> $BIBTEX_FILEcd "$FOLDER"
# ...
echo >> "$BIBTEX_FILE"cd "$FOLDER" || exit 1if ! cd "$FOLDER"; then
echo "error: could not cd into $FOLDER"
exit 1
fiTITLE=$(pdfinfo "$f" | grep Title | cut -c17-)Context
StackExchange Code Review Q#147350, answer score: 4
Revisions (0)
No revisions yet.