snippetpythonMinor
Parse Twine HTML to JSON
Viewed 0 times
jsonhtmltwineparse
Problem
For those who don't know, Twine is just a simple interactive fiction making tool. It lets you easily create a series of passages that are hyperlinked to each other, making a choose your own adventure style structure. It exports as HTML format, but if you wanted to just use Twine to write nodes to use elsewhere it's lacking in any other export format. I thought JSON would be a more valuable format to use, so I decide to make this parser.
The source data is a bit of a mess though, here's how it looks:
In case it's not clear (as it wasn't to me at first), the line breaks are only occuring when the actual text passages contain newline characters. Otherwise all the tags just run on and on, on the same line. This is not at all ideal for parsing, especially if I want to read line by line. So the first step of the process is calling my
Now I can easily read it line by line, parsing the keyvalue pairs from starting tags, parsing the passage text separate from tags and then knowing when each tag is closed. This tidied up html can now be read into json with my
`{
"style": {
"type": "text/twine-css",
"role": "stylesheet",
"id": "twine-user-stylesheet"
},
"script": {
"type": "text/twine-javascript",
"role": "script",
"id": "twine-user-script"
},
"tw-passagedata": [
{
"position": "197,62",
"text": "[[Passage B]]\n[[Go to passage C|Passage C]]\n",
"pid": "1",
"name": "Passage_A",
"tags": ""
},
{
"position": "114,225
The source data is a bit of a mess though, here's how it looks:
[[Passage B]]
[[Go to passage C|Passage C]]This is passage B
[[Passage B]]
[[Passage A]] This passage goes nowhere.
In case it's not clear (as it wasn't to me at first), the line breaks are only occuring when the actual text passages contain newline characters. Otherwise all the tags just run on and on, on the same line. This is not at all ideal for parsing, especially if I want to read line by line. So the first step of the process is calling my
reformat_html function that will separate tags to one per line and put passages on a line by themselves:
[[Passage B]]
[[Go to passage C|Passage C]]
This is passage B
[[Passage B]]
[[Passage A]]
This passage goes nowhere.
Now I can easily read it line by line, parsing the keyvalue pairs from starting tags, parsing the passage text separate from tags and then knowing when each tag is closed. This tidied up html can now be read into json with my
read_as_json function, producing this:`{
"style": {
"type": "text/twine-css",
"role": "stylesheet",
"id": "twine-user-stylesheet"
},
"script": {
"type": "text/twine-javascript",
"role": "script",
"id": "twine-user-script"
},
"tw-passagedata": [
{
"position": "197,62",
"text": "[[Passage B]]\n[[Go to passage C|Passage C]]\n",
"pid": "1",
"name": "Passage_A",
"tags": ""
},
{
"position": "114,225
Solution
Don't reinvent the wheel. You want to parse HTML/XML, use an HTML/XML parser. No matter how tricky the layout seems, as long as well-formed data are fed into them, they should handle it. It’s their job.
Based on your example input, I’ll make the assumption that twine produces well formed XML files. Thus you can get rid of your custom tag splitting/parsing and use the parser of your choice.
For instance, the
which prints:
Pretty close to what you are looking for.
Next thing to do is to take care of multiples
Based on your example input, I’ll make the assumption that twine produces well formed XML files. Thus you can get rid of your custom tag splitting/parsing and use the parser of your choice.
For instance, the
xml.etree.ElementTree is shipped with the standard library. You can use it to parse your files like:import xml.etree.ElementTree as ETree
inpath = r'Sample Data\TwineInput.html'
xml = ETree.parse(inpath)
for element in xml.getroot():
print(element.tag, element.attrib)which prints:
style {'role': 'stylesheet', 'id': 'twine-user-stylesheet', 'type': 'text/twine-css'}
script {'role': 'script', 'id': 'twine-user-script', 'type': 'text/twine-javascript'}
tw-passagedata {'position': '197,62', 'name': 'Passage_A', 'pid': '1', 'tags': ''}
tw-passagedata {'position': '114,225', 'name': 'Passage_B', 'pid': '2', 'tags': 'tag-2'}
tw-passagedata {'position': '314,225', 'name': 'Passage_C', 'pid': '3', 'tags': 'tag-1 tag-2'}Pretty close to what you are looking for.
Next thing to do is to take care of multiples
tw-passagedata tags, add them a text attribute, handle the case of the root tw-storydata and, possibly, handles duplicates tags with your MULTIPLE_TAG_ERROR message:import xml.etree.ElementTree as ETree
from json import dump
PASSAGE_TAG = "tw-passagedata"
MULTIPLE_TAG_ERROR = "Found multiple '{}' tags, not currently supported"
def parse_twine_tag(element, data):
"""Parse Twine tag into the data dictionary which is modified in place.
The tag name is the key, it's value is a dictionary of the tag's key value
pairs. Passage tags are stored in a list, as of now no other tag should
be stored this way, and having multiple tags raises a ValueError.
"""
tagname = element.tag
attributes = element.attrib
if tagname == PASSAGE_TAG:
attributes['text'] = element.text
data.setdefault(PASSAGE_TAG, []).append(attributes)
elif tagname in data:
raise ValueError(MULTIPLE_TAG_ERROR.format(tagname))
else:
data[tagname] = attributes
for child in element:
parse_twine_tag(child, data)
def parse_twine_file(filepath):
"""Return a dictionary of data from the parsed file at filepath"""
xml = ETree.parse(filepath)
data = dict()
parse_twine_tag(xml.getroot(), data)
return data
if __name__ == "__main__":
# Sample test
inpath = r'Sample Data\TwineInput.html'
outpath = r'Sample Data\FinalOutput.json'
data = parse_twine_file(inpath)
with open(outpath, 'w') as f:
dump(data, f, indent=4)outpath, as expected, contains:{
"style": {
"role": "stylesheet",
"id": "twine-user-stylesheet",
"type": "text/twine-css"
},
"tw-passagedata": [
{
"position": "197,62",
"text": "[[Passage B]]\n[[Go to passage C|Passage C]]",
"name": "Passage_A",
"pid": "1",
"tags": ""
},
{
"position": "114,225",
"text": "This is passage B\n[[Passage B]] \n[[Passage A]] ",
"name": "Passage_B",
"pid": "2",
"tags": "tag-2"
},
{
"position": "314,225",
"text": "This passage goes nowhere.",
"name": "Passage_C",
"pid": "3",
"tags": "tag-1 tag-2"
}
],
"script": {
"role": "script",
"id": "twine-user-script",
"type": "text/twine-javascript"
},
"tw-storydata": {
"startnode": "1",
"name": "Sample",
"creator-version": "2.0.8",
"ifid": "1A382346-FBC1-411F-837E-BAB9EE2FB2E9",
"format": "Harlowe",
"options": "",
"creator": "Twine"
}
}Code Snippets
import xml.etree.ElementTree as ETree
inpath = r'Sample Data\TwineInput.html'
xml = ETree.parse(inpath)
for element in xml.getroot():
print(element.tag, element.attrib)style {'role': 'stylesheet', 'id': 'twine-user-stylesheet', 'type': 'text/twine-css'}
script {'role': 'script', 'id': 'twine-user-script', 'type': 'text/twine-javascript'}
tw-passagedata {'position': '197,62', 'name': 'Passage_A', 'pid': '1', 'tags': ''}
tw-passagedata {'position': '114,225', 'name': 'Passage_B', 'pid': '2', 'tags': 'tag-2'}
tw-passagedata {'position': '314,225', 'name': 'Passage_C', 'pid': '3', 'tags': 'tag-1 tag-2'}import xml.etree.ElementTree as ETree
from json import dump
PASSAGE_TAG = "tw-passagedata"
MULTIPLE_TAG_ERROR = "Found multiple '{}' tags, not currently supported"
def parse_twine_tag(element, data):
"""Parse Twine tag into the data dictionary which is modified in place.
The tag name is the key, it's value is a dictionary of the tag's key value
pairs. Passage tags are stored in a list, as of now no other tag should
be stored this way, and having multiple tags raises a ValueError.
"""
tagname = element.tag
attributes = element.attrib
if tagname == PASSAGE_TAG:
attributes['text'] = element.text
data.setdefault(PASSAGE_TAG, []).append(attributes)
elif tagname in data:
raise ValueError(MULTIPLE_TAG_ERROR.format(tagname))
else:
data[tagname] = attributes
for child in element:
parse_twine_tag(child, data)
def parse_twine_file(filepath):
"""Return a dictionary of data from the parsed file at filepath"""
xml = ETree.parse(filepath)
data = dict()
parse_twine_tag(xml.getroot(), data)
return data
if __name__ == "__main__":
# Sample test
inpath = r'Sample Data\TwineInput.html'
outpath = r'Sample Data\FinalOutput.json'
data = parse_twine_file(inpath)
with open(outpath, 'w') as f:
dump(data, f, indent=4){
"style": {
"role": "stylesheet",
"id": "twine-user-stylesheet",
"type": "text/twine-css"
},
"tw-passagedata": [
{
"position": "197,62",
"text": "[[Passage B]]\n[[Go to passage C|Passage C]]",
"name": "Passage_A",
"pid": "1",
"tags": ""
},
{
"position": "114,225",
"text": "This is passage B\n[[Passage B]] \n[[Passage A]] ",
"name": "Passage_B",
"pid": "2",
"tags": "tag-2"
},
{
"position": "314,225",
"text": "This passage goes nowhere.",
"name": "Passage_C",
"pid": "3",
"tags": "tag-1 tag-2"
}
],
"script": {
"role": "script",
"id": "twine-user-script",
"type": "text/twine-javascript"
},
"tw-storydata": {
"startnode": "1",
"name": "Sample",
"creator-version": "2.0.8",
"ifid": "1A382346-FBC1-411F-837E-BAB9EE2FB2E9",
"format": "Harlowe",
"options": "",
"creator": "Twine"
}
}Context
StackExchange Code Review Q#109988, answer score: 2
Revisions (0)
No revisions yet.