HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Find all needles in the haystack and return their sum

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
sumtheneedlesallreturnfindandhaystacktheir

Problem

This is based on a programming challenge I recently solved. However, since the writer of the challenge asked me not to disclose the question and/or it's answer, I modified it a little. If it doesn't make sense: my mistake.

The target is to find all numbers in a file and sum those. So if the following is found:

Sed ut 11 perspiciatis, unde omnis 11 iste natus 20 error sit


The result should be 42.

Floats like 0.42 will be considered equal to 42 in this challenge.

To generate example input, I took the revised version of my 'n x on the y' generator and made it run from 999. That's almost 5k lines of code, a nice amount to test against. The old code produces the same output, for those interested.

I was quite content with the performance. Now I'm wondering whether I could've written it better. Readability and maintainability are key, as usual.

The next iteration of this code will likely try to match keywords instead of arbitrary numbers. More on that in a later question.

haystack.py

# Sum all numbers in a haystack
import re

def getNumbers(inp):
with open(inp, 'r') as f:
return re.findall('[0-9]+', str(f.readlines()))

def calculateSum(results):
total = 0
for result in results:
total += int(result)
return total

def main(input_file='haystack.txt'):
print calculateSum(getNumbers(input_file))

if __name__ == "__main__":
main()

Solution

First, you should have a look at PEP8 again, specifically the part about function names. It recommends using lower_case_with_underscores. It also recommends leaving two blank lines between function definitions on the module level.

Next, open already opens the file read-only by default, there is no need for 'r'. Also, f.readlines() returns a list, which you then convert into a string representation of that list. However, you can just use f.read() here, which directly returns the full string. Note that reading in the full file content at once might be quite memory-consuming, so don't do it with files larger than your RAM.

Last but not least, your function calculate_sum could just be the built-in sum, if the elements of results were already numbers. So I would change get_numbers to return actual numbers (like the name implies), by using map:

import re

def get_numbers(file_name):
    with open(file_name) as f:
        return map(int, re.findall('[0-9]+', f.read()))

def main(file_name='haystack.txt'):
    print sum(get_numbers(file_name))

if __name__ == "__main__":
    main()


In Python 3.x, you could use the yield from keyword to make your get_numbers into a generator and avoid the memory issue, because the file is read one line at a time (at the cost of executing the regexp for every line):

def get_numbers(file_name):
    with open(file_name) as f:
        for line in f:
            yield from map(int, re.findall('[0-9]+', line))


In Python 2.x this could be implemented like this:

def get_numbers(file_name):
    with open(file_name) as f:
        for line in f:
            for item in map(int, re.findall('[0-9]+', line)):
                yield item

Code Snippets

import re

def get_numbers(file_name):
    with open(file_name) as f:
        return map(int, re.findall('[0-9]+', f.read()))


def main(file_name='haystack.txt'):
    print sum(get_numbers(file_name))

if __name__ == "__main__":
    main()
def get_numbers(file_name):
    with open(file_name) as f:
        for line in f:
            yield from map(int, re.findall('[0-9]+', line))
def get_numbers(file_name):
    with open(file_name) as f:
        for line in f:
            for item in map(int, re.findall('[0-9]+', line)):
                yield item

Context

StackExchange Code Review Q#150995, answer score: 3

Revisions (0)

No revisions yet.