HiveBrain v1.2.0
Get Started
← Back to all entries
snippetrubyMinor

Generate word list based on Spanish text file

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
filetextgeneratewordbasedlistspanish

Problem

I'm a beginner and wrote a program that takes a text file and writes a downcased vocabulary list to another text file. I intend to use it mainly to work with text in Spanish, so I added a line to downcase capitalized-accented words. I'm wondering if there is a more efficient way of reading from the original file, as well as removing non-letters and sorting for unique items.

f = File.open("/.../quijote.txt")
words = f.read.split.map(&:downcase)
f.close

#remove numbers and non-letters
words = words.map {|item| item.tr('0-9.,;:¿¡?!«»\‘\“\”\–\]\[\-\(\)\'\"', '')}

#downcase capitalized accented words
words = words.map {|item| item.tr('ÁÉÍÓÚÑ', 'áéíóúñ')}

words = words.uniq.sort

# write each word on a separate line in the file...
File.open("/.../quijotewords.txt", "w+") do |f|
  words.each { |element| f.puts(element) }
end

Solution

Some notes:

  • open + read + close: Better to use the block form: contents = File.open(path) { |fd| fd.read } or simply contents = File.read(path)



  • words = words.something: Don't re-use variable names. New values, new names. For example: sorted_words = words.sort.



  • Use File.write



  • Instead of removing chars that you don't want, I'd remove the chars that you do want.



  • You can apply the processing to the whole file or line and then split.



  • string.tr(something, '') -> string.delete(something).



I'd write:

words = File.read("quijote.txt").downcase.
  tr("ÁÉÍÓÚÑ", "áéíóúñ").delete("^[a-z]áéíóúüñ \n").
  split.uniq.sort
File.write("quijote-words.txt", words.join("\n"))


If your input file is not UTF8-encoded but, let's say, ISO8859-15, you'd write: File.read("quijote.txt", encoding: "iso8859-15").encode("utf-8").

Code Snippets

words = File.read("quijote.txt").downcase.
  tr("ÁÉÍÓÚÑ", "áéíóúñ").delete("^[a-z]áéíóúüñ \n").
  split.uniq.sort
File.write("quijote-words.txt", words.join("\n"))

Context

StackExchange Code Review Q#122000, answer score: 3

Revisions (0)

No revisions yet.