HiveBrain v1.2.0
Get Started
← Back to all entries
snippetpythonMinor

Parse Twine HTML to JSON

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
jsonhtmltwineparse

Problem

For those who don't know, Twine is just a simple interactive fiction making tool. It lets you easily create a series of passages that are hyperlinked to each other, making a choose your own adventure style structure. It exports as HTML format, but if you wanted to just use Twine to write nodes to use elsewhere it's lacking in any other export format. I thought JSON would be a more valuable format to use, so I decide to make this parser.

The source data is a bit of a mess though, here's how it looks:

[[Passage B]]
[[Go to passage C|Passage C]]This is passage B
[[Passage B]]
[[Passage A]] This passage goes nowhere.


In case it's not clear (as it wasn't to me at first), the line breaks are only occuring when the actual text passages contain newline characters. Otherwise all the tags just run on and on, on the same line. This is not at all ideal for parsing, especially if I want to read line by line. So the first step of the process is calling my reformat_html function that will separate tags to one per line and put passages on a line by themselves:



[[Passage B]]
[[Go to passage C|Passage C]]

This is passage B
[[Passage B]]
[[Passage A]]

This passage goes nowhere.



Now I can easily read it line by line, parsing the keyvalue pairs from starting tags, parsing the passage text separate from tags and then knowing when each tag is closed. This tidied up html can now be read into json with my read_as_json function, producing this:

`{
"style": {
"type": "text/twine-css",
"role": "stylesheet",
"id": "twine-user-stylesheet"
},
"script": {
"type": "text/twine-javascript",
"role": "script",
"id": "twine-user-script"
},
"tw-passagedata": [
{
"position": "197,62",
"text": "[[Passage B]]\n[[Go to passage C|Passage C]]\n",
"pid": "1",
"name": "Passage_A",
"tags": ""
},
{
"position": "114,225

Solution

Don't reinvent the wheel. You want to parse HTML/XML, use an HTML/XML parser. No matter how tricky the layout seems, as long as well-formed data are fed into them, they should handle it. It’s their job.

Based on your example input, I’ll make the assumption that twine produces well formed XML files. Thus you can get rid of your custom tag splitting/parsing and use the parser of your choice.

For instance, the xml.etree.ElementTree is shipped with the standard library. You can use it to parse your files like:

import xml.etree.ElementTree as ETree

inpath = r'Sample Data\TwineInput.html'
xml = ETree.parse(inpath)
for element in xml.getroot():
    print(element.tag, element.attrib)


which prints:

style {'role': 'stylesheet', 'id': 'twine-user-stylesheet', 'type': 'text/twine-css'}
script {'role': 'script', 'id': 'twine-user-script', 'type': 'text/twine-javascript'}
tw-passagedata {'position': '197,62', 'name': 'Passage_A', 'pid': '1', 'tags': ''}
tw-passagedata {'position': '114,225', 'name': 'Passage_B', 'pid': '2', 'tags': 'tag-2'}
tw-passagedata {'position': '314,225', 'name': 'Passage_C', 'pid': '3', 'tags': 'tag-1 tag-2'}


Pretty close to what you are looking for.

Next thing to do is to take care of multiples tw-passagedata tags, add them a text attribute, handle the case of the root tw-storydata and, possibly, handles duplicates tags with your MULTIPLE_TAG_ERROR message:

import xml.etree.ElementTree as ETree
from json import dump

PASSAGE_TAG = "tw-passagedata"
MULTIPLE_TAG_ERROR = "Found multiple '{}' tags, not currently supported"

def parse_twine_tag(element, data):
    """Parse Twine tag into the data dictionary which is modified in place.

    The tag name is the key, it's value is a dictionary of the tag's key value
    pairs. Passage tags are stored in a list, as of now no other tag should
    be stored this way, and having multiple tags raises a ValueError.
    """

    tagname = element.tag
    attributes = element.attrib

    if tagname == PASSAGE_TAG:
        attributes['text'] = element.text
        data.setdefault(PASSAGE_TAG, []).append(attributes)
    elif tagname in data:
        raise ValueError(MULTIPLE_TAG_ERROR.format(tagname))
    else:
        data[tagname] = attributes

    for child in element:
        parse_twine_tag(child, data)

def parse_twine_file(filepath):
    """Return a dictionary of data from the parsed file at filepath"""

    xml = ETree.parse(filepath)
    data = dict()
    parse_twine_tag(xml.getroot(), data)
    return data

if __name__ == "__main__":
    # Sample test
    inpath = r'Sample Data\TwineInput.html'
    outpath = r'Sample Data\FinalOutput.json'

    data = parse_twine_file(inpath)
    with open(outpath, 'w') as f:
        dump(data, f, indent=4)


outpath, as expected, contains:

{
    "style": {
        "role": "stylesheet", 
        "id": "twine-user-stylesheet", 
        "type": "text/twine-css"
    }, 
    "tw-passagedata": [
        {
            "position": "197,62", 
            "text": "[[Passage B]]\n[[Go to passage C|Passage C]]", 
            "name": "Passage_A", 
            "pid": "1", 
            "tags": ""
        }, 
        {
            "position": "114,225", 
            "text": "This is passage B\n[[Passage B]] \n[[Passage A]] ", 
            "name": "Passage_B", 
            "pid": "2", 
            "tags": "tag-2"
        }, 
        {
            "position": "314,225", 
            "text": "This passage goes nowhere.", 
            "name": "Passage_C", 
            "pid": "3", 
            "tags": "tag-1 tag-2"
        }
    ], 
    "script": {
        "role": "script", 
        "id": "twine-user-script", 
        "type": "text/twine-javascript"
    }, 
    "tw-storydata": {
        "startnode": "1", 
        "name": "Sample", 
        "creator-version": "2.0.8", 
        "ifid": "1A382346-FBC1-411F-837E-BAB9EE2FB2E9", 
        "format": "Harlowe", 
        "options": "", 
        "creator": "Twine"
    }
}

Code Snippets

import xml.etree.ElementTree as ETree

inpath = r'Sample Data\TwineInput.html'
xml = ETree.parse(inpath)
for element in xml.getroot():
    print(element.tag, element.attrib)
style {'role': 'stylesheet', 'id': 'twine-user-stylesheet', 'type': 'text/twine-css'}
script {'role': 'script', 'id': 'twine-user-script', 'type': 'text/twine-javascript'}
tw-passagedata {'position': '197,62', 'name': 'Passage_A', 'pid': '1', 'tags': ''}
tw-passagedata {'position': '114,225', 'name': 'Passage_B', 'pid': '2', 'tags': 'tag-2'}
tw-passagedata {'position': '314,225', 'name': 'Passage_C', 'pid': '3', 'tags': 'tag-1 tag-2'}
import xml.etree.ElementTree as ETree
from json import dump

PASSAGE_TAG = "tw-passagedata"
MULTIPLE_TAG_ERROR = "Found multiple '{}' tags, not currently supported"

def parse_twine_tag(element, data):
    """Parse Twine tag into the data dictionary which is modified in place.

    The tag name is the key, it's value is a dictionary of the tag's key value
    pairs. Passage tags are stored in a list, as of now no other tag should
    be stored this way, and having multiple tags raises a ValueError.
    """

    tagname = element.tag
    attributes = element.attrib

    if tagname == PASSAGE_TAG:
        attributes['text'] = element.text
        data.setdefault(PASSAGE_TAG, []).append(attributes)
    elif tagname in data:
        raise ValueError(MULTIPLE_TAG_ERROR.format(tagname))
    else:
        data[tagname] = attributes

    for child in element:
        parse_twine_tag(child, data)

def parse_twine_file(filepath):
    """Return a dictionary of data from the parsed file at filepath"""

    xml = ETree.parse(filepath)
    data = dict()
    parse_twine_tag(xml.getroot(), data)
    return data

if __name__ == "__main__":
    # Sample test
    inpath = r'Sample Data\TwineInput.html'
    outpath = r'Sample Data\FinalOutput.json'

    data = parse_twine_file(inpath)
    with open(outpath, 'w') as f:
        dump(data, f, indent=4)
{
    "style": {
        "role": "stylesheet", 
        "id": "twine-user-stylesheet", 
        "type": "text/twine-css"
    }, 
    "tw-passagedata": [
        {
            "position": "197,62", 
            "text": "[[Passage B]]\n[[Go to passage C|Passage C]]", 
            "name": "Passage_A", 
            "pid": "1", 
            "tags": ""
        }, 
        {
            "position": "114,225", 
            "text": "This is passage B\n[[Passage B]] \n[[Passage A]] ", 
            "name": "Passage_B", 
            "pid": "2", 
            "tags": "tag-2"
        }, 
        {
            "position": "314,225", 
            "text": "This passage goes nowhere.", 
            "name": "Passage_C", 
            "pid": "3", 
            "tags": "tag-1 tag-2"
        }
    ], 
    "script": {
        "role": "script", 
        "id": "twine-user-script", 
        "type": "text/twine-javascript"
    }, 
    "tw-storydata": {
        "startnode": "1", 
        "name": "Sample", 
        "creator-version": "2.0.8", 
        "ifid": "1A382346-FBC1-411F-837E-BAB9EE2FB2E9", 
        "format": "Harlowe", 
        "options": "", 
        "creator": "Twine"
    }
}

Context

StackExchange Code Review Q#109988, answer score: 2

Revisions (0)

No revisions yet.