HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Korean word segmentation using frequency heuristic

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
segmentationfrequencywordheuristicusingkorean

Problem

This a continuation of a previous question. I want to thank Joe Wallis for his help with increasing the readability of my code. Although the changes made by Joe Wallis did increase the speed of the code, the speed improvements aren't enough for my purposes.

I'll reiterate the problem, but please feel free to look at the previous question. The algorithm uses a corpus to analyze a list of phrases, such that each phrase is split into constituent words in a way that maximizes its frequency score.

The corpus is represented as a list of Korean words and their frequencies (pretend that each letter represents a Korean character):

A 56
AB 7342
ABC 3
BC 116
C 5
CD 10
BCD 502
ABCD 23
D 132
DD 6


The list of phrases, or "wordlist", looks like this (ignore the numbers):

AAB 1123
DCDD 83


The output of the script would be:

Original Pois Makeup Freq_Max_Delta
AAB A AB [AB, 7342][A, 56] 7398
DCDD D C DD [D, 132][DD, 6][C, 5] 143


In the previous question, there are some sample inputs which I am using. The biggest problem is the size of the data sets. The corpus and wordlist have 1M+ entries in each file. It's currently taking on average 1-2 seconds to process each word in the wordlist, which in total will take 250 hours+ to process.

`#!/usr/bin/env python
# -- coding: utf-8 --
import sys, codecs, collections, operator, itertools
from argparse import ArgumentParser

sys.stdout = codecs.getwriter("utf8")(sys.stdout)
sys.stderr = codecs.getwriter("utf8")(sys.stderr)

def read_corpa(file_name):
print 'Reading Corpa....'
with codecs.open(file_name, 'r', 'UTF-8') as f:
return {l[0]: int(l[-1]) for l in (line.rstrip().split('\t') for line in f)}

def read_words(file_name):
with codecs.open(file_name, 'r', 'UTF-8') as f:
for word in f:
yield word.split('\t')[0]

def contains(small, big):
small_ = len(small)
for i in xrange(len(big) -

Solution

It will be much faster to use Python's in operator for your substring search rather than using your custom-built "contains" function.

In [15]: %timeit 'brown' in  'blah'*1000 + 'the quick brown fox'
100000 loops, best of 3: 2.79 µs per loop

In [16]: %timeit contains('brown','blah'*1000 + 'the quick brown fox')
1000 loops, best of 3: 870 µs per loop


Also I wonder if you could rewrite some of your custom functions as dictionary comprehensions, something like this:

for word in read_words(args.wordlist):
    combos = {k:v for k,v in corpus if k in word}
    results = {'Original': word,
               'Pois': list(combos.keys())
               'Makeup': combos.items()
               'Freq_Max_Delta': sum(combox.values())}

    print(results)

Code Snippets

In [15]: %timeit 'brown' in  'blah'*1000 + 'the quick brown fox'
100000 loops, best of 3: 2.79 µs per loop

In [16]: %timeit contains('brown','blah'*1000 + 'the quick brown fox')
1000 loops, best of 3: 870 µs per loop
for word in read_words(args.wordlist):
    combos = {k:v for k,v in corpus if k in word}
    results = {'Original': word,
               'Pois': list(combos.keys())
               'Makeup': combos.items()
               'Freq_Max_Delta': sum(combox.values())}

    print(results)

Context

StackExchange Code Review Q#121507, answer score: 3

Revisions (0)

No revisions yet.