HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Histogram of a string

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
histogramstringstackoverflow

Problem

I'm teaching myself Python and when a friend posted this sentence


Only the fool would take trouble to verify that his sentence was
composed of ten a's, three b's, four c's, four d's, forty-six e's,
sixteen f's, four g's, thirteen h's, fifteen i's, two k's, nine l's,
four m's, twenty-five n's, twenty-four o's, five p's, sixteen r's,
forty-one s's, thirty-seven t's, ten u's, eight v's, eight w's, four
x's, eleven y's, twenty-seven commas, twenty-three apostrophes, seven
hyphens and, last but not least, a single !

I thought, as a fool, I would try to verify it by plotting a histogram. This is my code:

import matplotlib.pyplot as plt
import numpy as np

sentence = "Only the fool would take trouble to verify that his sentence was composed of ten a's, three b's, four c's, four d's, forty-six e's, sixteen f's, four g's, thirteen h's, fifteen i's, two k's, nine l's, four m's, twenty-five n's, twenty-four o's, five p's, sixteen r's, forty-one s's, thirty-seven t's, ten u's, eight v's, eight w's, four x's, eleven y's, twenty-seven commas, twenty-three apostrophes, seven hyphens and, last but not least, a single !".lower()

# Convert the string to an array of integers
numbers = np.array([ord(c) for c in sentence])
u = np.unique(numbers)
# Make the integers range from 0 to n so there are no gaps in the histogram
# [0][0] was a hack to make sure `np.where` returned an int instead of an array.
ind = [np.where(u==n)[0][0] for n in numbers]
bins = range(0,len(u)+1)
hist, bins = np.histogram(ind, bins)

plt.bar(bins[:-1], hist, align='center')
plt.xticks(np.unique(ind), [str(unichr(n)) for n in set(numbers)])
plt.grid()
plt.show()


Which generates

Please let me know how to improve my code. Also, please let me know what I did wrong with plt.xticks that resulted in the gaps at the beginning and the end (or is that just a case of incorrect axis limits?).

Solution

Your code is pretty good! I have only one substantive and a few stylistic suggestion.

Style

  • Since sentence is a hard-coded variable, Python convention is that it should be in all-uppercase, i.e. SENTENCE is a better variable name.



  • What are u and n in your code? It's hard to figure out what those variables mean. Could you be more descriptive with your naming?



  • Your call to .lower() on sentence is hidden after the very long sentence. For readability I wouldn't hide any function calls at the end of very long strings.



  • Python has multi-line string support using the """ delimiters. Using it makes the sentence and the code more readable, although at the expense of introducing newline \n characters that would show up on the histogram if they are not removed. In my code below I use the """ delimiter and remove the \n characters I introduced to break the string into screen-width-sized chunks. PEP8 convention is that code lines shouldn't be more than about 80 characters long.



  • You should consider breaking this code up into two functions, one to make generate the data, and one to make the graph, but we can leave that for another time.



Substance

  • Since your sentence is a Python string (not a NumPy character array), you can generate the data for your histogram quite easily by using the Counter data type that is available in the collections module. It's designed for exactly applications like this. Doing so will let you avoid the complications of bin edges vs. bin centers that stem from using np.histogram entirely.



Putting all these ideas together:

import matplotlib.pyplot as plt
import numpy as np
from collections import Counter

SENTENCE = """Only the fool would take trouble to verify that his sentence was composed of ten a's, three b's, four c's, 
four d's, forty-six e's, sixteen f's, four g's, thirteen h's, fifteen i's, two k's, nine l's, four m's, twenty-five n's, 
twenty-four o's, five p's, sixteen r's, forty-one s's, thirty-seven t's, ten u's, eight v's, eight w's, four x's, 
eleven y's, twenty-seven commas, twenty-three apostrophes, seven hyphens and, last but not least, a single !"""

# generate histogram
letters_hist = Counter(SENTENCE.lower().replace('\n', ''))
counts = letters_hist.values()
letters = letters_hist.keys()

# graph data
bar_x_locations = np.arange(len(counts))
plt.bar(bar_x_locations, counts, align = 'center')
plt.xticks(bar_x_locations, letters)
plt.grid()
plt.show()


Other

It wasn't anything you did with plt.xticks that led to the gaps. That's the matplotlib default. If you want a "tight" border to the graph, try adding a plt.xlim(-0.5, len(counts) - 0.5) before the plt.show().

Code Snippets

import matplotlib.pyplot as plt
import numpy as np
from collections import Counter

SENTENCE = """Only the fool would take trouble to verify that his sentence was composed of ten a's, three b's, four c's, 
four d's, forty-six e's, sixteen f's, four g's, thirteen h's, fifteen i's, two k's, nine l's, four m's, twenty-five n's, 
twenty-four o's, five p's, sixteen r's, forty-one s's, thirty-seven t's, ten u's, eight v's, eight w's, four x's, 
eleven y's, twenty-seven commas, twenty-three apostrophes, seven hyphens and, last but not least, a single !"""

# generate histogram
letters_hist = Counter(SENTENCE.lower().replace('\n', ''))
counts = letters_hist.values()
letters = letters_hist.keys()

# graph data
bar_x_locations = np.arange(len(counts))
plt.bar(bar_x_locations, counts, align = 'center')
plt.xticks(bar_x_locations, letters)
plt.grid()
plt.show()

Context

StackExchange Code Review Q#129412, answer score: 5

Revisions (0)

No revisions yet.