HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Small program to download wikipedia articles to pdf

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
programwikipediaarticlessmalldownloadpdf

Problem

I made a small app to download wikipedia articles (and optionally those that it links to) as PDFs to take on the go. I'd eventually like to do a text-to-speech option and save the article as an mp3, though that's the next step. Any thoughts would be helpful.

import wikipedia
import pdfkit
import os
from gtts import gTTS

class Page:

    def __init__(self):
        """Set default pdfkit options"""
        self.pdfOptions = {
            'page-size': 'Letter',
            'margin-top': '0.75in',
            'margin-right': '0.75in',
            'margin-bottom': '0.75in',
            'margin-left': '0.75in',
            'javascript-delay' : 2000,
            'minimum-font-size': 512
        }

        self.targetDir = os.path.dirname(os.path.realpath(__file__))
        self.includeLinks = False

    def getArticle(self, articleTitle):
        """fetch the article from wiki by title"""
        self.page = wikipedia.page(articleTitle)
        try:
            self.page.summary
        except wikipedia.exceptions.DisambiguationError as e:
            print "Multiple articles with that name: " + e.options

    def setURL(self,URL):
        """fetch the article by URL"""
        pass

    def download(self):
        """download the article (and maybe the articles it links to"""
        if self.includeLinks == False:
            filename = self.targetDir+"/"+self.page.title+'.pdf'
            pdfkit.from_url(self.page.url, filename, options = self.pdfOptions)
        else:
            for link in self.page.links:
                linkedPage = wikipedia.page(link)
                print "Downloading " + linkedPage.url
                filename = self.targetDir+"/"+linkedPage.title+'.pdf'
                pdfkit.from_url(linkedPage.url, filename, options=self.pdfOptions)

    def speak(self):
        pass

Solution

Here are some stylistic and code style points:

  • fix the variable naming - in Python, there is a lower_case_with_underscores variable naming style (PEP8 reference)



-
organize your imports as per PEP8 guidelines - first, the system-level imports, then third-parties, then your "local" imports. Also, remove unused from gtts import gTTS import:

import os

import pdfkit
import wikipedia


-
docstrings should start with a capital letter and end with a dot

  • I think you should define pdfOptions as a module-level (or configuration-layer-level) constant instead of having them defined as an instance variable



  • if self.includeLinks == False: can be simplified to if not self.includeLinks:



  • you should have spaces around the operators inside expressions (reference)



-
I think you can handle both cases in your download() method in a unified way:

def download(self):
    links = self.page.links if self.includeLinks else [self.page]
    for link in links:
        page = wikipedia.page(link)
        print("Downloading " + page.url)
        filename = self.targetDir + "/" + page.title + '.pdf'
        pdfkit.from_url(page.url, filename, options=self.pdfOptions)


  • I would also use os.path.join() instead of string-concatenating filename paths

Code Snippets

import os

import pdfkit
import wikipedia
def download(self):
    links = self.page.links if self.includeLinks else [self.page]
    for link in links:
        page = wikipedia.page(link)
        print("Downloading " + page.url)
        filename = self.targetDir + "/" + page.title + '.pdf'
        pdfkit.from_url(page.url, filename, options=self.pdfOptions)

Context

StackExchange Code Review Q#160784, answer score: 4

Revisions (0)

No revisions yet.