snippetrubyMinor
Generate word list based on Spanish text file
Viewed 0 times
filetextgeneratewordbasedlistspanish
Problem
I'm a beginner and wrote a program that takes a text file and writes a downcased vocabulary list to another text file. I intend to use it mainly to work with text in Spanish, so I added a line to downcase capitalized-accented words. I'm wondering if there is a more efficient way of reading from the original file, as well as removing non-letters and sorting for unique items.
f = File.open("/.../quijote.txt")
words = f.read.split.map(&:downcase)
f.close
#remove numbers and non-letters
words = words.map {|item| item.tr('0-9.,;:¿¡?!«»\‘\“\”\–\]\[\-\(\)\'\"', '')}
#downcase capitalized accented words
words = words.map {|item| item.tr('ÁÉÍÓÚÑ', 'áéíóúñ')}
words = words.uniq.sort
# write each word on a separate line in the file...
File.open("/.../quijotewords.txt", "w+") do |f|
words.each { |element| f.puts(element) }
endSolution
Some notes:
I'd write:
If your input file is not UTF8-encoded but, let's say, ISO8859-15, you'd write:
open+read+close: Better to use the block form:contents = File.open(path) { |fd| fd.read }or simplycontents = File.read(path)
words = words.something: Don't re-use variable names. New values, new names. For example:sorted_words = words.sort.
- Use
File.write
- Instead of removing chars that you don't want, I'd remove the chars that you do want.
- You can apply the processing to the whole file or line and then split.
string.tr(something, '')->string.delete(something).
I'd write:
words = File.read("quijote.txt").downcase.
tr("ÁÉÍÓÚÑ", "áéíóúñ").delete("^[a-z]áéíóúüñ \n").
split.uniq.sort
File.write("quijote-words.txt", words.join("\n"))If your input file is not UTF8-encoded but, let's say, ISO8859-15, you'd write:
File.read("quijote.txt", encoding: "iso8859-15").encode("utf-8").Code Snippets
words = File.read("quijote.txt").downcase.
tr("ÁÉÍÓÚÑ", "áéíóúñ").delete("^[a-z]áéíóúüñ \n").
split.uniq.sort
File.write("quijote-words.txt", words.join("\n"))Context
StackExchange Code Review Q#122000, answer score: 3
Revisions (0)
No revisions yet.