HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Given a random section of text delimited by line breaks, get the first paragraph

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
randomthesectiondelimitedlineparagraphtextgetfirstgiven

Problem

Requirements:


Given a long section of text, where the only indication that a paragraph has ended is a shorter line, make a guess about the first paragraph. The lines are hardwrapped, and the wrapping is consistent for the entire text.

The code below assumes that a paragraph ends with a line that is shorter than the average of all of the other lines. It also checks to see whether the line is shorter merely because of word wrapping by looking at the word in the next line and seeing whether that would have made the line extend beyond the "maximum" width for the paragraph.

def get_first_paragraph(source_text):
    lines = source_text.splitlines()
    lens = [len(line) for line in lines]
    avglen = sum(lens)/len(lens)
    maxlen = max(lens)
    newlines = []
    for line_idx, line in enumerate(lines):
        newlines.append(line)
        try:
            word_in_next_line = lines[line_idx+1].split()[0]
        except IndexError:
            break # we've reached the last line
        if len(line) < avglen and len(line) + 1 + len(word_in_next_line) < maxlen: # 1 is for space between words
            break
    return '\n'.join(newlines)


Sample #1

Input:

This is a sample paragaraph. It goes on and on for several sentences.
Many OF These Remarkable Sentences are Considerable in Length.
It has a variety of words with different lengths, and there is not a
consistent line length, although it appears to hover
supercalifragilisticexpialidociously around the 70 character mark.
Ideally the code should recognize that one line is much shorter than
the rest, and is shorter not because of a much longer word following
it which has wrapped the line, but because we have reached the end of
a paragraph.
This is the next paragraph, and continues onwards for
more and more sentences.


Output:

`This is a sample paragaraph. It goes on and on for several sentences.
Many OF These Remarkable Sentences are Considerable in Length.
It has a variety of words with different le

Solution

Stating Your Requirements

It's important to have a clear defintion of what you want to achieve before you start writing code, even though there are many different ways of achieving that, for instance by practicing Test-Driven Development or writing a formal specification.

The important part is that without a clear definition, you can't validate whether you're done. In your case, the question contains a description that is completely different than the code and quite unclear.

The above is vital even if you're only writing the code as an exercise for personal use or learning.
Testing and Edge Cases

In the following code:

word_in_next_line = lines[li+1].split()[0]


Why are you assuming that

  • there will be a next line? What if the text consists of only one paragraph?



  • the next line will not be empty?



These assumptions are unreasonable and when I first tried out your code on some text, it immediately threw an exception.
Naming

-
Be careful with historically significant terms such as ss (Google it if you don't know what I mean).

-
Expressive names are better than abbreviations! Replace:

  • ss with source_text



  • ll with line (this looks like the number 11!)



  • lens with line_lengths



  • avglen with average_length



  • maxlen with maximum_length



  • in the for loop, li with index and ll with line



Conclusion

Without a clear explanation of what you are trying to accomplish, how the input data looks and how you define a paragraph, it's impossible to show you a better way of solving the problem.

Code Snippets

word_in_next_line = lines[li+1].split()[0]

Context

StackExchange Code Review Q#24431, answer score: 6

Revisions (0)

No revisions yet.