HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Get Wikimedia attributions for images

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
attributionsgetforimageswikimedia

Problem

We are using images from Wikimedia commons for some of Cardshifter's game artwork. I selected some art for each card to make a .jpg file to use in the game clients, and added a URL to the original file as a comment for reference. This is what one card looks like: (with our DSL made from Groovy)

// https://en.wikipedia.org/wiki/Sun_Wukong
// https://commons.wikimedia.org/wiki/File:Sun_Wukong_and_Jade_Rabbit.jpg
card('MONKING') {
creature 'Chinese Hero'
flavor 'Monkey King Warrior of Immense Strength.'
imagePath 'mythos/chinese/monking.jpg'
maxInDeck 2
health 5
sickness 0
manaCost 15
attack 5
denyCounterAttack()
onEndOfTurn {
change ATTACK by 1 on {
thisCard()
}
}
}

However, the problem I was facing is that while I was gathering images, I forgot to also get the attribution information for each image. So, I wrote a Python script and a few helper classes to do this work.

The script uses Selenium webdriver to control Firefox (as suggested by @jacwah), along with some read & write instructions and a bit of regex to isolate URLs to be used by Selenium to automate getting the information from Wikimedia.

In the resulting output files, the above card's header will look like this with the added attribution text (or an error if Selenium is unable to resolve it, by design)

// https://en.wikipedia.org/wiki/Sun_Wukong
// https://commons.wikimedia.org/wiki/File:Sun_Wukong_and_Jade_Rabbit.jpg
// Attribution: By Yoshitoshi Tsukioka (http://www.japaneseprints.net/viewitem.cfm?ID=2182) [Public domain], via Wikimedia Commons// License: Public Domain
card('MONKING') {
creature 'Chinese Hero'
flavor 'Monkey King Warrior of Immense Strength.'
imagePath 'mythos/chinese/monking.jpg'
...

This is my first rodeo with Python, please help me improve any aspect of the code that you feel is not "Pythonic", not efficient, or just plain terrible.

Note: The script takes about 5 minutes to run, keeping in min

Solution

Python

I know you're used to Java, which is probably why you've put everything in classes in separate files. @SuperBiasedMan commented on this, but I want to stress this.

Python is not a strictly object oriented language like Java, you can mix different approaches as you see fit. ListOfFilesInDirectory is a good candidate to be a function, because it doesn't need any internal state except parameters. See Classes vs. Functions.

Object initialization vs creation

def __init__(self, directory, extension=".*"):
    """
    Get a list of file names from a directory.
        :param directory: Where to look for files
        :param extension: The desired file extension (optional)
        :return: list of file names as strings
        :rtype: object
    """


A note: __init__ doesn't return anything. This method initializes an already created object. The method that creates objects is called __new__, but there's seldom a reason to touch it.

Filename matching

Python comes with a module for filename matching called fnmatch. It supports so-called glob patterns like *.cardset, which would match any file with the .cardset extension. Example from the docs:

import fnmatch
import os

for file in os.listdir('.'):
    if fnmatch.fnmatch(file, '*.txt'):
        print file


I suggest that instead of keeping track of dots and extensions you take a fnmatch pattern as an argument to your function returning file names.

Arrow shaped code

WikimediaAttributionsFromFilesInDirectory.py is a bit hard to read due to the high level of indentation and nested loops. I think you should extract each level of the loop into its own function: handle_file(name) and handle_line.

You can also extract the print statements into a function like write_comment(str, stream) where stream is either the output file or sys.stdout (which print writes to behind-the-scenes).

get_attribution(url, driver) could also be a function that encapsulates the following logic:

firefox.get(url) 
firefox.find_element_by_css_selector("a[title=\"Use this file on the web\"]").click()
firefox.find_element_by_id("stockphoto_attribution_html").click()
attribution_text = firefox.find_element_by_id("stockphoto_attribution").get_attribute("value")


Variable naming

I think this is generally good, except for firefox. Names should explain what a variable is used for, not how it does it. I consider the fact that the web browser being emulated is Firefox an implementation detail, and something that you might want to change. Using webdriver.Chrome would probably give the exact same effects, but then the name wouldn't make sense any more. I would name the variable driver or browser.

Code Snippets

def __init__(self, directory, extension=".*"):
    """
    Get a list of file names from a directory.
        :param directory: Where to look for files
        :param extension: The desired file extension (optional)
        :return: list of file names as strings
        :rtype: object
    """
import fnmatch
import os

for file in os.listdir('.'):
    if fnmatch.fnmatch(file, '*.txt'):
        print file
firefox.get(url) 
firefox.find_element_by_css_selector("a[title=\"Use this file on the web\"]").click()
firefox.find_element_by_id("stockphoto_attribution_html").click()
attribution_text = firefox.find_element_by_id("stockphoto_attribution").get_attribute("value")

Context

StackExchange Code Review Q#104197, answer score: 6

Revisions (0)

No revisions yet.