HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

String parsing with multiple delimeters

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
delimeterswithparsingmultiplestring

Problem

My data is in this format:


龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n

And I want to return:

('龍舟', '龙舟', 'long2 zhou1', '/dragon boat/imperial boat/')


In C I could do this in one line with sscanf, but I seem to be f̶a̶i̶l̶i̶n̶g̶ writing code like a schoolkid with Python:

working = line.rstrip().split(" ")
    trad, simp = working[0], working[1]
    working = " ".join(working[2:]).split("]")
    pinyin = working[0][1:]
    english = working[1][1:]
    return trad, simp, pinyin, english


Can I improve?

Solution

You can use Regular Expressions with re module. For example the following regular expression works with binary strings and Unicode string (I'm not sure which version of Python you use).

For Python 2.7.3:

>>> s = "龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n"
>>> u = s.decode("utf-8")
>>> import re
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", s).groups()
('\xe9\xbe\x8d\xe8\x88\x9f', '\xe9\xbe\x99\xe8\x88\x9f', 'long2 zhou1', '/dragon boat/imperial boat/')
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", u).groups()
(u'\u9f8d\u821f', u'\u9f99\u821f', u'long2 zhou1', u'/dragon boat/imperial boat/')


For Python 3.2.3:

>>> s = "龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n"
>>> b = s.encode("utf-8")
>>> import re
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", s).groups()
('龍舟', '龙舟', 'long2 zhou1', '/dragon boat/imperial boat/')
>>> re.match(br"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", b).groups()
(b'\xe9\xbe\x8d\xe8\x88\x9f', b'\xe9\xbe\x99\xe8\x88\x9f', b'long2 zhou1', b'/dragon boat/imperial boat/')

Code Snippets

>>> s = "龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n"
>>> u = s.decode("utf-8")
>>> import re
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", s).groups()
('\xe9\xbe\x8d\xe8\x88\x9f', '\xe9\xbe\x99\xe8\x88\x9f', 'long2 zhou1', '/dragon boat/imperial boat/')
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", u).groups()
(u'\u9f8d\u821f', u'\u9f99\u821f', u'long2 zhou1', u'/dragon boat/imperial boat/')
>>> s = "龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n"
>>> b = s.encode("utf-8")
>>> import re
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", s).groups()
('龍舟', '龙舟', 'long2 zhou1', '/dragon boat/imperial boat/')
>>> re.match(br"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", b).groups()
(b'\xe9\xbe\x8d\xe8\x88\x9f', b'\xe9\xbe\x99\xe8\x88\x9f', b'long2 zhou1', b'/dragon boat/imperial boat/')

Context

StackExchange Code Review Q#21539, answer score: 5

Revisions (0)

No revisions yet.