snippetMinor

How do learn the most important nodes in a tree?

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

machine-learning cs trees data-structures stackoverflow strings natural-language-processing

nodesthehowimportantlearnmosttree

Problem

I have a list of 20000 words and how often they appeared in a set of 500 newspaper articles. I am trying to build a stemmer which chops off suffuxes from each words, so walked, walking, walks are the same word.

In English, the Porter Stemmer is a rule-based system where you repeatedly chop off suffixes:

CONNECTIONS
    CONNECTION
    CONNECT

I am concerned if I do this for my collection of Spanish words and articles, I may not have a complete list of rules or it may be prone to other forms of error. So I had proposed to learn the suffixes.

Right now, I just count the appearances of each suffixed up to 4 letters. Here is the result the most common last letter in my vocabulary list:

u'a': 58189
 u'd': 3183
 u'e': 62971
 u'i': 1725
 u'l': 26374
 u'n': 37823
 u'o': 46786
 u'r': 16833
 u's': 57396
 u'u': 2639
 u'y': 2212
 u'z': 1968
 u'\xe1': 1813
 u'\xf3': 6722

The last letters a and o are obvious things to stem since they indicate masculine and feminine. However o could also be the 1st person singular of a verb. a could be the 3rd person singular.

e and s are also obvious choices to stem. Let's look at the last 4 letters:

u'ados': 1826,
 u'ales': 1633,
 u'ando': 1291,
 u'ante': 1062,
 u'aron': 1027,
 u'ci\xf3n': 5355,
 u'ente': 3084,
 u'ento': 1690,
 u'erto': 1061,
 u'idad': 1749,
 u'ncia': 1362,
 u'ntes': 1511,
 u'ones': 2845,
 u'ores': 1050,
 u'si\xf3n': 1127

These are very common Spanish suffixes, appearing more than 1000 times in my corpus. Should I stem them?

How do I choose a data set which handles the suffixes of different sizes and decides which ones are the most "significant" ?

Solution

The first thing you need to consider is that stemming rules are language-specific - a stemmer for the English language will not work for Spanish.

That being said, there is already an implementation which you can use, it's called Snowball and it already has a Spanish stemmer you can use.

The only thing you need to figure out is how you want to install and use it - how you feed it the raw data and what you want to do with the output (store it somewhere, run post-processing on it etc.). There's no point in trying to reinvent the wheel. I've used Snowball and successfully written a Romanian stemmer back in the day (about seven years ago) and I must warn you it's not easy to do it from scratch even when you have all the tools (I had Snowball and the stemmer Dana Cojocaru wrote back then, but I wanted to do it on my own).

Best of luck in your endeavors!

Context

StackExchange Computer Science Q#24731, answer score: 2

Revisions (0)

No revisions yet.