patternpythonMinor
.txt word-counter
Viewed 0 times
wordtxtcounter
Problem
I'm starting coding with Python (only able to use & create easy code!) and just created a word-counter. It reads a .txt file and enters the words in a dictionary with dict[word]="number of appearances in the text". As far as I can see, the code works well.
Questions:
- Nevertheless, I'm not satisfied with the fact that I needed to create 3 "wordlist" since I wanted to delete special characters or empty words from the first wordlist. Afaik, modifying the list while looping over it with "for" is not possible (?).
- Moreover, I only succeeded printing out the most popular words by iterating over the maximum of the dict, printing and deleting it. Thus, the original dictionary will be altered within the printing process.
Questions:
- Are there any (simple) improvements to make the points mentioned above more elegant?
- In general, what's your opinion about the code?
book = open("bibel.txt", "r")
dict = {}
lines = book.readlines()
wordlist=[]
#adding every word into the wordlist
for line in lines:
words = line.split(" ")
for word in words:
wordlist.append(word)
book.close()
#wordlist2 = cleaned wordlist (delete non-alphabetical characters)
def clean(word):
for char in word:
if char.lower() not in "abcdefghijklmnopqrstuvwxyzäüöß":
word = word.replace(char,'')
word=word.lower()
return word
wordlist2=[]
for word in wordlist:
wordlist2.append(clean(word))
#wordlist3 = delete empty words ("")
wordlist3=[]
for word in wordlist2:
if len(word)>0:
wordlist3.append(word)
#count wordlist3 into dictionary:
def count(word):
if word in dict:
dict[word]=dict[word]+1
else:
dict[word]=1
for word in wordlist3:
count(word)
#print out the first 10 words and values from dictionary:
for i in range(100):
topword = max(dict, key=dict.get)
print(topword, dict[topword])
del dict[topword]Solution
You are mixing function declarations and code execution. Do not.
Your programme should look like:
Stating the obvious in comments is not good
Comments should explain why not what, if you need comments to describe the what you need better names and maybe more functions.
The following for example should be declared as a function and called later
I added in
Also it should be noted that
You should use
You wrote
The following is more idiomatic then a for loop with append but may be a little harder to understand. Using it or not is up to personal style.
When a list comprehension is ease you should prefer it over append:
should become
Checking if a thing is empty in Python is done like
Puttting
Clean is not a clear name. You should call that function alphabet_chars_only.
Just to show the sheer power of list comprehension, you may or may not use the following, it is personal preference:
Down there there must be a typo
You do not use
You may print the top words like the following:
(
Python has an official style guide that you should follow when writing good and readable code, it is called Pep8. The best and faster option is to write your code without thinking about it and then run Autopep8 on your script, it will make your script Pep8/compliant with no effort.
Docstrings are triple quoted strings put at the start of a function to give some info about it. It is up to you to decide when a function is simple that it doesn't need one.
Putting all my advice together:
```
import operator
FILENAME = "bibel.txt"
ALPHABET = "abcdefghijklmnopqrstuvwxyzäüöß"
def get_wordlist(filename):
with open(filename) as f:
lines = f.readlines()
wordlist = [word for line in lines for word in line.split(" ")]
return wordlist
def alphabet_chars(word):
word = word.lower()
return ''.join([char for char in word if char in ALPHABET])
def count(wordlist):
"""
Returns a dict where the words are the keys
and their frequencies are the values.
"""
dict_ = {}
for word in word
Your programme should look like:
def fun1():
# do_stuff
def fun2()
# do_stuff
def main():
fun1()
fun2()
if __name__ == "__main__":
main()Stating the obvious in comments is not good
#adding every word into the wordlist
#wordlist2 = cleaned wordlist (delete non-alphabetical characters)
#wordlist3 = delete empty words ("")', '#count wordlist3 into dictionary:
#print out the first 10 words and values from dictionary:Comments should explain why not what, if you need comments to describe the what you need better names and maybe more functions.
The following for example should be declared as a function and called later
book = open("bibel.txt", "r")
dict = {}
lines = book.readlines()
wordlist=[]
#adding every word into the wordlist
for line in lines:
words = line.split(" ")
for word in words:
wordlist.append(word)
book.close()
def get_wordlist(filename):
with open(filename) as f:
lines = f.readlines()
# More exaplanation about the following line later
wordlist = [word for line in lines for word in line.split(" ")]
return wordlistI added in
if __name__ == "__main__": because it allows you to import your file without actually running it.dict = {} is a global, do not use changing globals, (Global Constants are ok).Also it should be noted that
dict is reserved, it is better to write word_dict or dict_.You should use
with open(filename) as f: do_stuff(f.read()) when you open file, it is simpler and handles closing automatically..You wrote
wordlist=[]
#adding every word into the wordlist
for line in lines:
words = line.split(" ")
for word in words:
wordlist.append(word)The following is more idiomatic then a for loop with append but may be a little harder to understand. Using it or not is up to personal style.
wordlist = [word for line in lines for word in line.split(" ")]When a list comprehension is ease you should prefer it over append:
wordlist2=[]
for word in wordlist:
wordlist2.append(clean(word))should become
wordlist2 = [(clean(word) for word in wordlist]Checking if a thing is empty in Python is done like
if not thingwordlist3 = [word for word in wordlist2 if word]Puttting
word = word.lower() at the start simplifies things a bitdef clean(word):
word = word.lower()
for char in word:
if char not in "abcdefghijklmnopqrstuvwxyzäüöß":
word = word.replace(char,'')
return wordClean is not a clear name. You should call that function alphabet_chars_only.
"abcdefghijklmnopqrstuvwxyzäüöß" is the alphabet, it is easy to guess it, anyway a global consant at the start of your file would be better:ALPHABET = "abcdefghijklmnopqrstuvwxyzäüöß"
def clean(word):
word = word.lower()
for char in word:
if char not in ALPHABET:
word = word.replace(char,'')
return wordJust to show the sheer power of list comprehension, you may or may not use the following, it is personal preference:
def clean(word):
word = word.lower()
return ''.join([char for char in word if char in alphabet])Down there there must be a typo
#print out the first 10 words and values from dictionary: # --- TEN
for i in range(100): # -- A HUNDREAD
topword = max(dict, key=dict.get)
print(topword, dict[topword])
del dict[topword]You do not use
i in the above loop, it is convention to mark an unused variable as _ or __You may print the top words like the following:
# Credit goes to http://stackoverflow.com/questions/613183/sort-a-python-dictionary-by-value
import operator
topword = max(dict, key=dict.get)
sorted_topword = sorted(x.items(), key=operator.itemgetter(1))
print(sorted_topword[:10])(
lst[:x] means the first x elements of lst)Python has an official style guide that you should follow when writing good and readable code, it is called Pep8. The best and faster option is to write your code without thinking about it and then run Autopep8 on your script, it will make your script Pep8/compliant with no effort.
Docstrings are triple quoted strings put at the start of a function to give some info about it. It is up to you to decide when a function is simple that it doesn't need one.
Putting all my advice together:
```
import operator
FILENAME = "bibel.txt"
ALPHABET = "abcdefghijklmnopqrstuvwxyzäüöß"
def get_wordlist(filename):
with open(filename) as f:
lines = f.readlines()
wordlist = [word for line in lines for word in line.split(" ")]
return wordlist
def alphabet_chars(word):
word = word.lower()
return ''.join([char for char in word if char in ALPHABET])
def count(wordlist):
"""
Returns a dict where the words are the keys
and their frequencies are the values.
"""
dict_ = {}
for word in word
Code Snippets
def fun1():
# do_stuff
def fun2()
# do_stuff
def main():
fun1()
fun2()
if __name__ == "__main__":
main()#adding every word into the wordlist
#wordlist2 = cleaned wordlist (delete non-alphabetical characters)
#wordlist3 = delete empty words ("")', '#count wordlist3 into dictionary:
#print out the first 10 words and values from dictionary:book = open("bibel.txt", "r")
dict = {}
lines = book.readlines()
wordlist=[]
#adding every word into the wordlist
for line in lines:
words = line.split(" ")
for word in words:
wordlist.append(word)
book.close()
def get_wordlist(filename):
with open(filename) as f:
lines = f.readlines()
# More exaplanation about the following line later
wordlist = [word for line in lines for word in line.split(" ")]
return wordlistwordlist=[]
#adding every word into the wordlist
for line in lines:
words = line.split(" ")
for word in words:
wordlist.append(word)wordlist = [word for line in lines for word in line.split(" ")]Context
StackExchange Code Review Q#73318, answer score: 7
Revisions (0)
No revisions yet.