HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

extract string from body of text

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
textextractbodyfromstring

Problem

Trying to return the below values from an RSS feed. I have sorted out the RSS feed side of things but I am not sure if what I am doing is the best way to extract the data that I need. I have looked at beautifulsoup and regex's, re.search but not sure what is the best way to do this.

Values that I need; (both of these values change daily obviously)

4 Nov 2013

LOW-MODERATE

This is the cut down version of the full body of text that has the data i need to get from rss feed;

Fire Danger RatingsBureau of Meteorology forecast issued at: Mon, 4 Nov 2013 05:30 AM

Central: LOW-MODERATE


My code to extract the rss feed and data at the moment;

import feedparser #https://wiki.python.org/moin/RssLibraries

cfa_rss_url = "http://www.cfa.vic.gov.au/restrictions/central-firedistrict_rss.xml"

d = feedparser.parse( cfa_rss_url )

#get the data from the first item only, as it is only updated daily.
data = d.entries[0].description
print (data)
print('---------')

getdate = re.compile('forecast issued at: (.*?)')
getrating = re.compile('Central: (.*?)')

m = getdate.search(data)
n = getrating.search(data)

print(m.group(1))
print(n.group(1))


This is the FULL data that is returned;

Total Fire Ban StatusToday, Mon, 4 Nov 2013 is not currently a day of Total Fire Ban in the Central (includes Melbourne and Geelong) fire district.Fire Danger RatingsBureau of Meteorology forecast issued at: Mon, 4 Nov 2013 05:30 AMCentral: LOW-MODERATE Displays when Total Fire Ban in forceRestrictions may apply

---------
Mon, 4 Nov 2013 05:30 AM
LOW-MODERATE


As you can see my code returns the correct values atm but is this the most efficient way of doing this or should I be using another command?

Solution

You can do the whole thing with a single regexp:

import re

data = """
Fire Danger Ratings
Bureau of Meteorology 
forecast issued at: Mon, 4 Nov 2013 05:30 AM

Central: LOW-MODERATE         
"""

rgx = re.compile("(forecast issued at: |Central: )(.*?)")
results = rgx.findall(data)
print results[0][1]
print results[1][1]


Output:

$ python rss_parse.py
Mon, 4 Nov 2013 05:30 AM
LOW-MODERATE


If this only runs a time or two a day, pre-compiling the regex is not important.

...
results = re.findall("(forecast issued at: |Central: )(.*?)", data)
print results[0][1]
print results[1][1]


We can simplify how the result is handled:

...
[when, rating] = [x[1] for x in
                  re.findall("(forecast issued at: |Central: )(.*?)",
                             data)]
print("%s\n%s" % (when, rating))

Code Snippets

import re

data = """
<p>Fire Danger Ratings
<br />Bureau of Meteorology 
forecast issued at: Mon, 4 Nov 2013 05:30 AM</p>

<p>Central: LOW-MODERATE</p>         
"""

rgx = re.compile("(forecast issued at: |<p>Central: )(.*?)</p>")
results = rgx.findall(data)
print results[0][1]
print results[1][1]
$ python rss_parse.py
Mon, 4 Nov 2013 05:30 AM
LOW-MODERATE
...
results = re.findall("(forecast issued at: |<p>Central: )(.*?)</p>", data)
print results[0][1]
print results[1][1]
...
[when, rating] = [x[1] for x in
                  re.findall("(forecast issued at: |<p>Central: )(.*?)</p>",
                             data)]
print("%s\n%s" % (when, rating))

Context

StackExchange Code Review Q#35339, answer score: 2

Revisions (0)

No revisions yet.