patternpythonMinor

String parsing with multiple delimeters

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

delimeterswithparsingmultiplestring

Problem

My data is in this format:

龍舟龙舟 [long2 zhou1] /dragon boat/imperial boat/\n

And I want to return:

('龍舟', '龙舟', 'long2 zhou1', '/dragon boat/imperial boat/')

In C I could do this in one line with sscanf, but I seem to be f̶a̶i̶l̶i̶n̶g̶ writing code like a schoolkid with Python:

working = line.rstrip().split(" ")
    trad, simp = working[0], working[1]
    working = " ".join(working[2:]).split("]")
    pinyin = working[0][1:]
    english = working[1][1:]
    return trad, simp, pinyin, english

Can I improve?

Solution

You can use Regular Expressions with re module. For example the following regular expression works with binary strings and Unicode string (I'm not sure which version of Python you use).

For Python 2.7.3:

>>> s = "龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n"
>>> u = s.decode("utf-8")
>>> import re
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", s).groups()
('\xe9\xbe\x8d\xe8\x88\x9f', '\xe9\xbe\x99\xe8\x88\x9f', 'long2 zhou1', '/dragon boat/imperial boat/')
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", u).groups()
(u'\u9f8d\u821f', u'\u9f99\u821f', u'long2 zhou1', u'/dragon boat/imperial boat/')

For Python 3.2.3:

>>> s = "龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n"
>>> b = s.encode("utf-8")
>>> import re
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", s).groups()
('龍舟', '龙舟', 'long2 zhou1', '/dragon boat/imperial boat/')
>>> re.match(br"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", b).groups()
(b'\xe9\xbe\x8d\xe8\x88\x9f', b'\xe9\xbe\x99\xe8\x88\x9f', b'long2 zhou1', b'/dragon boat/imperial boat/')

Code Snippets

>>> s = "龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n"
>>> u = s.decode("utf-8")
>>> import re
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", s).groups()
('\xe9\xbe\x8d\xe8\x88\x9f', '\xe9\xbe\x99\xe8\x88\x9f', 'long2 zhou1', '/dragon boat/imperial boat/')
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", u).groups()
(u'\u9f8d\u821f', u'\u9f99\u821f', u'long2 zhou1', u'/dragon boat/imperial boat/')

>>> s = "龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n"
>>> b = s.encode("utf-8")
>>> import re
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", s).groups()
('龍舟', '龙舟', 'long2 zhou1', '/dragon boat/imperial boat/')
>>> re.match(br"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", b).groups()
(b'\xe9\xbe\x8d\xe8\x88\x9f', b'\xe9\xbe\x99\xe8\x88\x9f', b'long2 zhou1', b'/dragon boat/imperial boat/')

Context

StackExchange Code Review Q#21539, answer score: 5

Revisions (0)

No revisions yet.