patternpythonMinor
extract string from body of text
Viewed 0 times
textextractbodyfromstring
Problem
Trying to return the below values from an RSS feed. I have sorted out the RSS feed side of things but I am not sure if what I am doing is the best way to extract the data that I need. I have looked at beautifulsoup and regex's, re.search but not sure what is the best way to do this.
Values that I need; (both of these values change daily obviously)
4 Nov 2013
LOW-MODERATE
This is the cut down version of the full body of text that has the data i need to get from rss feed;
My code to extract the rss feed and data at the moment;
This is the FULL data that is returned;
As you can see my code returns the correct values atm but is this the most efficient way of doing this or should I be using another command?
Values that I need; (both of these values change daily obviously)
4 Nov 2013
LOW-MODERATE
This is the cut down version of the full body of text that has the data i need to get from rss feed;
Fire Danger RatingsBureau of Meteorology forecast issued at: Mon, 4 Nov 2013 05:30 AM
Central: LOW-MODERATEMy code to extract the rss feed and data at the moment;
import feedparser #https://wiki.python.org/moin/RssLibraries
cfa_rss_url = "http://www.cfa.vic.gov.au/restrictions/central-firedistrict_rss.xml"
d = feedparser.parse( cfa_rss_url )
#get the data from the first item only, as it is only updated daily.
data = d.entries[0].description
print (data)
print('---------')
getdate = re.compile('forecast issued at: (.*?)')
getrating = re.compile('Central: (.*?)')
m = getdate.search(data)
n = getrating.search(data)
print(m.group(1))
print(n.group(1))This is the FULL data that is returned;
Total Fire Ban StatusToday, Mon, 4 Nov 2013 is not currently a day of Total Fire Ban in the Central (includes Melbourne and Geelong) fire district.Fire Danger RatingsBureau of Meteorology forecast issued at: Mon, 4 Nov 2013 05:30 AMCentral: LOW-MODERATE Displays when Total Fire Ban in forceRestrictions may apply
---------
Mon, 4 Nov 2013 05:30 AM
LOW-MODERATEAs you can see my code returns the correct values atm but is this the most efficient way of doing this or should I be using another command?
Solution
You can do the whole thing with a single regexp:
Output:
If this only runs a time or two a day, pre-compiling the regex is not important.
We can simplify how the result is handled:
import re
data = """
Fire Danger Ratings
Bureau of Meteorology
forecast issued at: Mon, 4 Nov 2013 05:30 AM
Central: LOW-MODERATE
"""
rgx = re.compile("(forecast issued at: |Central: )(.*?)")
results = rgx.findall(data)
print results[0][1]
print results[1][1]Output:
$ python rss_parse.py
Mon, 4 Nov 2013 05:30 AM
LOW-MODERATEIf this only runs a time or two a day, pre-compiling the regex is not important.
...
results = re.findall("(forecast issued at: |Central: )(.*?)", data)
print results[0][1]
print results[1][1]We can simplify how the result is handled:
...
[when, rating] = [x[1] for x in
re.findall("(forecast issued at: |Central: )(.*?)",
data)]
print("%s\n%s" % (when, rating))Code Snippets
import re
data = """
<p>Fire Danger Ratings
<br />Bureau of Meteorology
forecast issued at: Mon, 4 Nov 2013 05:30 AM</p>
<p>Central: LOW-MODERATE</p>
"""
rgx = re.compile("(forecast issued at: |<p>Central: )(.*?)</p>")
results = rgx.findall(data)
print results[0][1]
print results[1][1]$ python rss_parse.py
Mon, 4 Nov 2013 05:30 AM
LOW-MODERATE...
results = re.findall("(forecast issued at: |<p>Central: )(.*?)</p>", data)
print results[0][1]
print results[1][1]...
[when, rating] = [x[1] for x in
re.findall("(forecast issued at: |<p>Central: )(.*?)</p>",
data)]
print("%s\n%s" % (when, rating))Context
StackExchange Code Review Q#35339, answer score: 2
Revisions (0)
No revisions yet.