snippetMinor
How do learn the most important nodes in a tree?
Viewed 0 times
nodesthehowimportantlearnmosttree
Problem
I have a list of 20000 words and how often they appeared in a set of 500 newspaper articles. I am trying to build a stemmer which chops off suffuxes from each words, so
In English, the Porter Stemmer is a rule-based system where you repeatedly chop off suffixes:
I am concerned if I do this for my collection of Spanish words and articles, I may not have a complete list of rules or it may be prone to other forms of error. So I had proposed to learn the suffixes.
Right now, I just count the appearances of each suffixed up to 4 letters. Here is the result the most common last letter in my vocabulary list:
The last letters
These are very common Spanish suffixes, appearing more than 1000 times in my corpus. Should I stem them?
How do I choose a data set which handles the suffixes of different sizes and decides which ones are the most "significant" ?
walked, walking, walks are the same word.In English, the Porter Stemmer is a rule-based system where you repeatedly chop off suffixes:
CONNECTIONS
CONNECTION
CONNECTI am concerned if I do this for my collection of Spanish words and articles, I may not have a complete list of rules or it may be prone to other forms of error. So I had proposed to learn the suffixes.
Right now, I just count the appearances of each suffixed up to 4 letters. Here is the result the most common last letter in my vocabulary list:
u'a': 58189
u'd': 3183
u'e': 62971
u'i': 1725
u'l': 26374
u'n': 37823
u'o': 46786
u'r': 16833
u's': 57396
u'u': 2639
u'y': 2212
u'z': 1968
u'\xe1': 1813
u'\xf3': 6722The last letters
a and o are obvious things to stem since they indicate masculine and feminine. However o could also be the 1st person singular of a verb. a could be the 3rd person singular.e and s are also obvious choices to stem. Let's look at the last 4 letters:u'ados': 1826,
u'ales': 1633,
u'ando': 1291,
u'ante': 1062,
u'aron': 1027,
u'ci\xf3n': 5355,
u'ente': 3084,
u'ento': 1690,
u'erto': 1061,
u'idad': 1749,
u'ncia': 1362,
u'ntes': 1511,
u'ones': 2845,
u'ores': 1050,
u'si\xf3n': 1127These are very common Spanish suffixes, appearing more than 1000 times in my corpus. Should I stem them?
How do I choose a data set which handles the suffixes of different sizes and decides which ones are the most "significant" ?
Solution
The first thing you need to consider is that stemming rules are language-specific - a stemmer for the English language will not work for Spanish.
That being said, there is already an implementation which you can use, it's called Snowball and it already has a Spanish stemmer you can use.
The only thing you need to figure out is how you want to install and use it - how you feed it the raw data and what you want to do with the output (store it somewhere, run post-processing on it etc.). There's no point in trying to reinvent the wheel. I've used Snowball and successfully written a Romanian stemmer back in the day (about seven years ago) and I must warn you it's not easy to do it from scratch even when you have all the tools (I had Snowball and the stemmer Dana Cojocaru wrote back then, but I wanted to do it on my own).
Best of luck in your endeavors!
That being said, there is already an implementation which you can use, it's called Snowball and it already has a Spanish stemmer you can use.
The only thing you need to figure out is how you want to install and use it - how you feed it the raw data and what you want to do with the output (store it somewhere, run post-processing on it etc.). There's no point in trying to reinvent the wheel. I've used Snowball and successfully written a Romanian stemmer back in the day (about seven years ago) and I must warn you it's not easy to do it from scratch even when you have all the tools (I had Snowball and the stemmer Dana Cojocaru wrote back then, but I wanted to do it on my own).
Best of luck in your endeavors!
Context
StackExchange Computer Science Q#24731, answer score: 2
Revisions (0)
No revisions yet.