patternpythonMinor
Get Wikimedia attributions for images
Viewed 0 times
attributionsgetforimageswikimedia
Problem
We are using images from Wikimedia commons for some of Cardshifter's game artwork. I selected some art for each card to make a
// https://en.wikipedia.org/wiki/Sun_Wukong
// https://commons.wikimedia.org/wiki/File:Sun_Wukong_and_Jade_Rabbit.jpg
card('MONKING') {
creature 'Chinese Hero'
flavor 'Monkey King Warrior of Immense Strength.'
imagePath 'mythos/chinese/monking.jpg'
maxInDeck 2
health 5
sickness 0
manaCost 15
attack 5
denyCounterAttack()
onEndOfTurn {
change ATTACK by 1 on {
thisCard()
}
}
}
However, the problem I was facing is that while I was gathering images, I forgot to also get the attribution information for each image. So, I wrote a Python script and a few helper classes to do this work.
The script uses Selenium webdriver to control Firefox (as suggested by @jacwah), along with some read & write instructions and a bit of regex to isolate URLs to be used by Selenium to automate getting the information from Wikimedia.
In the resulting output files, the above card's header will look like this with the added attribution text (or an error if Selenium is unable to resolve it, by design)
// https://en.wikipedia.org/wiki/Sun_Wukong
// https://commons.wikimedia.org/wiki/File:Sun_Wukong_and_Jade_Rabbit.jpg
// Attribution: By Yoshitoshi Tsukioka (http://www.japaneseprints.net/viewitem.cfm?ID=2182) [Public domain], via Wikimedia Commons// License: Public Domain
card('MONKING') {
creature 'Chinese Hero'
flavor 'Monkey King Warrior of Immense Strength.'
imagePath 'mythos/chinese/monking.jpg'
...
This is my first rodeo with Python, please help me improve any aspect of the code that you feel is not "Pythonic", not efficient, or just plain terrible.
Note: The script takes about 5 minutes to run, keeping in min
.jpg file to use in the game clients, and added a URL to the original file as a comment for reference. This is what one card looks like: (with our DSL made from Groovy)// https://en.wikipedia.org/wiki/Sun_Wukong
// https://commons.wikimedia.org/wiki/File:Sun_Wukong_and_Jade_Rabbit.jpg
card('MONKING') {
creature 'Chinese Hero'
flavor 'Monkey King Warrior of Immense Strength.'
imagePath 'mythos/chinese/monking.jpg'
maxInDeck 2
health 5
sickness 0
manaCost 15
attack 5
denyCounterAttack()
onEndOfTurn {
change ATTACK by 1 on {
thisCard()
}
}
}
However, the problem I was facing is that while I was gathering images, I forgot to also get the attribution information for each image. So, I wrote a Python script and a few helper classes to do this work.
The script uses Selenium webdriver to control Firefox (as suggested by @jacwah), along with some read & write instructions and a bit of regex to isolate URLs to be used by Selenium to automate getting the information from Wikimedia.
In the resulting output files, the above card's header will look like this with the added attribution text (or an error if Selenium is unable to resolve it, by design)
// https://en.wikipedia.org/wiki/Sun_Wukong
// https://commons.wikimedia.org/wiki/File:Sun_Wukong_and_Jade_Rabbit.jpg
// Attribution: By Yoshitoshi Tsukioka (http://www.japaneseprints.net/viewitem.cfm?ID=2182) [Public domain], via Wikimedia Commons// License: Public Domain
card('MONKING') {
creature 'Chinese Hero'
flavor 'Monkey King Warrior of Immense Strength.'
imagePath 'mythos/chinese/monking.jpg'
...
This is my first rodeo with Python, please help me improve any aspect of the code that you feel is not "Pythonic", not efficient, or just plain terrible.
Note: The script takes about 5 minutes to run, keeping in min
Solution
Python
I know you're used to Java, which is probably why you've put everything in classes in separate files. @SuperBiasedMan commented on this, but I want to stress this.
Python is not a strictly object oriented language like Java, you can mix different approaches as you see fit.
Object initialization vs creation
A note:
Filename matching
Python comes with a module for filename matching called
I suggest that instead of keeping track of dots and extensions you take a
Arrow shaped code
You can also extract the
Variable naming
I think this is generally good, except for
I know you're used to Java, which is probably why you've put everything in classes in separate files. @SuperBiasedMan commented on this, but I want to stress this.
Python is not a strictly object oriented language like Java, you can mix different approaches as you see fit.
ListOfFilesInDirectory is a good candidate to be a function, because it doesn't need any internal state except parameters. See Classes vs. Functions.Object initialization vs creation
def __init__(self, directory, extension=".*"):
"""
Get a list of file names from a directory.
:param directory: Where to look for files
:param extension: The desired file extension (optional)
:return: list of file names as strings
:rtype: object
"""A note:
__init__ doesn't return anything. This method initializes an already created object. The method that creates objects is called __new__, but there's seldom a reason to touch it.Filename matching
Python comes with a module for filename matching called
fnmatch. It supports so-called glob patterns like *.cardset, which would match any file with the .cardset extension. Example from the docs:import fnmatch
import os
for file in os.listdir('.'):
if fnmatch.fnmatch(file, '*.txt'):
print fileI suggest that instead of keeping track of dots and extensions you take a
fnmatch pattern as an argument to your function returning file names.Arrow shaped code
WikimediaAttributionsFromFilesInDirectory.py is a bit hard to read due to the high level of indentation and nested loops. I think you should extract each level of the loop into its own function: handle_file(name) and handle_line. You can also extract the
print statements into a function like write_comment(str, stream) where stream is either the output file or sys.stdout (which print writes to behind-the-scenes).get_attribution(url, driver) could also be a function that encapsulates the following logic:firefox.get(url)
firefox.find_element_by_css_selector("a[title=\"Use this file on the web\"]").click()
firefox.find_element_by_id("stockphoto_attribution_html").click()
attribution_text = firefox.find_element_by_id("stockphoto_attribution").get_attribute("value")Variable naming
I think this is generally good, except for
firefox. Names should explain what a variable is used for, not how it does it. I consider the fact that the web browser being emulated is Firefox an implementation detail, and something that you might want to change. Using webdriver.Chrome would probably give the exact same effects, but then the name wouldn't make sense any more. I would name the variable driver or browser.Code Snippets
def __init__(self, directory, extension=".*"):
"""
Get a list of file names from a directory.
:param directory: Where to look for files
:param extension: The desired file extension (optional)
:return: list of file names as strings
:rtype: object
"""import fnmatch
import os
for file in os.listdir('.'):
if fnmatch.fnmatch(file, '*.txt'):
print filefirefox.get(url)
firefox.find_element_by_css_selector("a[title=\"Use this file on the web\"]").click()
firefox.find_element_by_id("stockphoto_attribution_html").click()
attribution_text = firefox.find_element_by_id("stockphoto_attribution").get_attribute("value")Context
StackExchange Code Review Q#104197, answer score: 6
Revisions (0)
No revisions yet.