patternpythonMinor
Korean word segmentation using frequency heuristic
Viewed 0 times
segmentationfrequencywordheuristicusingkorean
Problem
This a continuation of a previous question. I want to thank Joe Wallis for his help with increasing the readability of my code. Although the changes made by Joe Wallis did increase the speed of the code, the speed improvements aren't enough for my purposes.
I'll reiterate the problem, but please feel free to look at the previous question. The algorithm uses a corpus to analyze a list of phrases, such that each phrase is split into constituent words in a way that maximizes its frequency score.
The corpus is represented as a list of Korean words and their frequencies (pretend that each letter represents a Korean character):
The list of phrases, or "wordlist", looks like this (ignore the numbers):
The output of the script would be:
In the previous question, there are some sample inputs which I am using. The biggest problem is the size of the data sets. The corpus and wordlist have 1M+ entries in each file. It's currently taking on average 1-2 seconds to process each word in the wordlist, which in total will take 250 hours+ to process.
`#!/usr/bin/env python
# -- coding: utf-8 --
import sys, codecs, collections, operator, itertools
from argparse import ArgumentParser
sys.stdout = codecs.getwriter("utf8")(sys.stdout)
sys.stderr = codecs.getwriter("utf8")(sys.stderr)
def read_corpa(file_name):
print 'Reading Corpa....'
with codecs.open(file_name, 'r', 'UTF-8') as f:
return {l[0]: int(l[-1]) for l in (line.rstrip().split('\t') for line in f)}
def read_words(file_name):
with codecs.open(file_name, 'r', 'UTF-8') as f:
for word in f:
yield word.split('\t')[0]
def contains(small, big):
small_ = len(small)
for i in xrange(len(big) -
I'll reiterate the problem, but please feel free to look at the previous question. The algorithm uses a corpus to analyze a list of phrases, such that each phrase is split into constituent words in a way that maximizes its frequency score.
The corpus is represented as a list of Korean words and their frequencies (pretend that each letter represents a Korean character):
A 56
AB 7342
ABC 3
BC 116
C 5
CD 10
BCD 502
ABCD 23
D 132
DD 6
The list of phrases, or "wordlist", looks like this (ignore the numbers):
AAB 1123
DCDD 83
The output of the script would be:
Original Pois Makeup Freq_Max_Delta
AAB A AB [AB, 7342][A, 56] 7398
DCDD D C DD [D, 132][DD, 6][C, 5] 143
In the previous question, there are some sample inputs which I am using. The biggest problem is the size of the data sets. The corpus and wordlist have 1M+ entries in each file. It's currently taking on average 1-2 seconds to process each word in the wordlist, which in total will take 250 hours+ to process.
`#!/usr/bin/env python
# -- coding: utf-8 --
import sys, codecs, collections, operator, itertools
from argparse import ArgumentParser
sys.stdout = codecs.getwriter("utf8")(sys.stdout)
sys.stderr = codecs.getwriter("utf8")(sys.stderr)
def read_corpa(file_name):
print 'Reading Corpa....'
with codecs.open(file_name, 'r', 'UTF-8') as f:
return {l[0]: int(l[-1]) for l in (line.rstrip().split('\t') for line in f)}
def read_words(file_name):
with codecs.open(file_name, 'r', 'UTF-8') as f:
for word in f:
yield word.split('\t')[0]
def contains(small, big):
small_ = len(small)
for i in xrange(len(big) -
Solution
It will be much faster to use Python's
Also I wonder if you could rewrite some of your custom functions as dictionary comprehensions, something like this:
in operator for your substring search rather than using your custom-built "contains" function.In [15]: %timeit 'brown' in 'blah'*1000 + 'the quick brown fox'
100000 loops, best of 3: 2.79 µs per loop
In [16]: %timeit contains('brown','blah'*1000 + 'the quick brown fox')
1000 loops, best of 3: 870 µs per loopAlso I wonder if you could rewrite some of your custom functions as dictionary comprehensions, something like this:
for word in read_words(args.wordlist):
combos = {k:v for k,v in corpus if k in word}
results = {'Original': word,
'Pois': list(combos.keys())
'Makeup': combos.items()
'Freq_Max_Delta': sum(combox.values())}
print(results)Code Snippets
In [15]: %timeit 'brown' in 'blah'*1000 + 'the quick brown fox'
100000 loops, best of 3: 2.79 µs per loop
In [16]: %timeit contains('brown','blah'*1000 + 'the quick brown fox')
1000 loops, best of 3: 870 µs per loopfor word in read_words(args.wordlist):
combos = {k:v for k,v in corpus if k in word}
results = {'Original': word,
'Pois': list(combos.keys())
'Makeup': combos.items()
'Freq_Max_Delta': sum(combox.values())}
print(results)Context
StackExchange Code Review Q#121507, answer score: 3
Revisions (0)
No revisions yet.