patternpythonMinor
String parsing with multiple delimeters
Viewed 0 times
delimeterswithparsingmultiplestring
Problem
My data is in this format:
龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n
And I want to return:
In C I could do this in one line with
Can I improve?
龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n
And I want to return:
('龍舟', '龙舟', 'long2 zhou1', '/dragon boat/imperial boat/')In C I could do this in one line with
sscanf, but I seem to be f̶a̶i̶l̶i̶n̶g̶ writing code like a schoolkid with Python:working = line.rstrip().split(" ")
trad, simp = working[0], working[1]
working = " ".join(working[2:]).split("]")
pinyin = working[0][1:]
english = working[1][1:]
return trad, simp, pinyin, englishCan I improve?
Solution
You can use Regular Expressions with re module. For example the following regular expression works with binary strings and Unicode string (I'm not sure which version of Python you use).
For Python 2.7.3:
For Python 3.2.3:
For Python 2.7.3:
>>> s = "龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n"
>>> u = s.decode("utf-8")
>>> import re
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", s).groups()
('\xe9\xbe\x8d\xe8\x88\x9f', '\xe9\xbe\x99\xe8\x88\x9f', 'long2 zhou1', '/dragon boat/imperial boat/')
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", u).groups()
(u'\u9f8d\u821f', u'\u9f99\u821f', u'long2 zhou1', u'/dragon boat/imperial boat/')For Python 3.2.3:
>>> s = "龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n"
>>> b = s.encode("utf-8")
>>> import re
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", s).groups()
('龍舟', '龙舟', 'long2 zhou1', '/dragon boat/imperial boat/')
>>> re.match(br"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", b).groups()
(b'\xe9\xbe\x8d\xe8\x88\x9f', b'\xe9\xbe\x99\xe8\x88\x9f', b'long2 zhou1', b'/dragon boat/imperial boat/')Code Snippets
>>> s = "龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n"
>>> u = s.decode("utf-8")
>>> import re
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", s).groups()
('\xe9\xbe\x8d\xe8\x88\x9f', '\xe9\xbe\x99\xe8\x88\x9f', 'long2 zhou1', '/dragon boat/imperial boat/')
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", u).groups()
(u'\u9f8d\u821f', u'\u9f99\u821f', u'long2 zhou1', u'/dragon boat/imperial boat/')>>> s = "龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n"
>>> b = s.encode("utf-8")
>>> import re
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", s).groups()
('龍舟', '龙舟', 'long2 zhou1', '/dragon boat/imperial boat/')
>>> re.match(br"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", b).groups()
(b'\xe9\xbe\x8d\xe8\x88\x9f', b'\xe9\xbe\x99\xe8\x88\x9f', b'long2 zhou1', b'/dragon boat/imperial boat/')Context
StackExchange Code Review Q#21539, answer score: 5
Revisions (0)
No revisions yet.