patternpythonMinor

Korean word segmentation using frequency heuristic

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

segmentationfrequencywordheuristicusingkorean

Problem

This a continuation of a previous question. I want to thank Joe Wallis for his help with increasing the readability of my code. Although the changes made by Joe Wallis did increase the speed of the code, the speed improvements aren't enough for my purposes.

I'll reiterate the problem, but please feel free to look at the previous question. The algorithm uses a corpus to analyze a list of phrases, such that each phrase is split into constituent words in a way that maximizes its frequency score.

The corpus is represented as a list of Korean words and their frequencies (pretend that each letter represents a Korean character):

A 56
AB    7342
ABC   3
BC    116
C 5
CD    10
BCD   502
ABCD  23
D 132
DD    6

The list of phrases, or "wordlist", looks like this (ignore the numbers):

AAB       1123
DCDD  83

The output of the script would be:

Original    Pois        Makeup                    Freq_Max_Delta
AAB         A AB        [AB, 7342][A, 56]         7398
DCDD        D C DD      [D, 132][DD, 6][C, 5]     143

In the previous question, there are some sample inputs which I am using. The biggest problem is the size of the data sets. The corpus and wordlist have 1M+ entries in each file. It's currently taking on average 1-2 seconds to process each word in the wordlist, which in total will take 250 hours+ to process.

`#!/usr/bin/env python
# -- coding: utf-8 --
import sys, codecs, collections, operator, itertools
from argparse import ArgumentParser

sys.stdout = codecs.getwriter("utf8")(sys.stdout)
sys.stderr = codecs.getwriter("utf8")(sys.stderr)

def read_corpa(file_name):
print 'Reading Corpa....'
with codecs.open(file_name, 'r', 'UTF-8') as f:
return {l[0]: int(l[-1]) for l in (line.rstrip().split('\t') for line in f)}

def read_words(file_name):
with codecs.open(file_name, 'r', 'UTF-8') as f:
for word in f:
yield word.split('\t')[0]

def contains(small, big):
small_ = len(small)
for i in xrange(len(big) -

Solution

It will be much faster to use Python's in operator for your substring search rather than using your custom-built "contains" function.

In [15]: %timeit 'brown' in  'blah'*1000 + 'the quick brown fox'
100000 loops, best of 3: 2.79 µs per loop

In [16]: %timeit contains('brown','blah'*1000 + 'the quick brown fox')
1000 loops, best of 3: 870 µs per loop

Also I wonder if you could rewrite some of your custom functions as dictionary comprehensions, something like this:

for word in read_words(args.wordlist):
    combos = {k:v for k,v in corpus if k in word}
    results = {'Original': word,
               'Pois': list(combos.keys())
               'Makeup': combos.items()
               'Freq_Max_Delta': sum(combox.values())}

    print(results)

Code Snippets

In [15]: %timeit 'brown' in  'blah'*1000 + 'the quick brown fox'
100000 loops, best of 3: 2.79 µs per loop

In [16]: %timeit contains('brown','blah'*1000 + 'the quick brown fox')
1000 loops, best of 3: 870 µs per loop

for word in read_words(args.wordlist):
    combos = {k:v for k,v in corpus if k in word}
    results = {'Original': word,
               'Pois': list(combos.keys())
               'Makeup': combos.items()
               'Freq_Max_Delta': sum(combox.values())}

    print(results)

Context

StackExchange Code Review Q#121507, answer score: 3

Revisions (0)

No revisions yet.