HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Retrieving lists of consecutive capitalised words from a list

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
wordslistcapitalisedlistsretrievingfromconsecutive

Problem

Ok, so given the string:

s = "Born in Honolulu Hawaii Obama is a graduate of Columbia University and Harvard Law School"


I want to retrieve:

[ ["Born"], ["Honolulu", "Hawaii", "Obama"], ["Columbia", "University"] ...]


Assuming that we have successfully tokenised the original string, my first psuedocodish attempt was:

def retrieve(tokens):
    results = []
    i = 0
    while i < len(tokens):
        if tokens[i][0].isupper():
            group = [tokens[i]]
            j = i + 1
            while i + j < len(tokens):
                if tokens[i + j][0].isupper():
                    group.append(tokens[i + j])
                    j += 1
                else:
                    break                
        i += 1
    return results


This is actually quite fast (well compared to some of my trying-to-be-pythonic attempts):


Timeit: 0.0160551071167 (1000 cycles)

Playing around with it, the quickest I can get is:

def retrive(tokens):
    results = []
    group = []
    for i in xrange(len(tokens)):
        if tokens[i][0].isupper():
            group.append(tokens[i])
        else:
            results.append(group)
            group = []
    results.append(group)
    return filter(None, results)



Timeit 0.0116229057312

Are there any more concise, pythonic ways to go about this (with similar execution times)?

Solution

A trivial optimisation that iterates on the tokens instead of by index (remember that in Python lists are iterables, it's unPythonic to iterate a list by index):

def retrieve(tokens):
    results = []
    group = []
    for token in tokens:
        if token[0].isupper():
            group.append(token)
        else:
            if group:  # group is not empty
                results.append(group)
            group = []  # reset group
    return results


A solution like @JeremyK's with a list comprehension and regular expressions is always going to be more compact. I am only giving this answer to point out how lists should be iterated.

Code Snippets

def retrieve(tokens):
    results = []
    group = []
    for token in tokens:
        if token[0].isupper():
            group.append(token)
        else:
            if group:  # group is not empty
                results.append(group)
            group = []  # reset group
    return results

Context

StackExchange Code Review Q#18965, answer score: 4

Revisions (0)

No revisions yet.