patternpythonMinor
Content filtering of webpage
Viewed 0 times
contentwebpagefiltering
Problem
This is code for filtering the data of webpage, for the web crawler I made for my project. I know python scripts can lag than other languages, but this takes a lot of time when processing even a single page.
I don't want to use any other external libraries for filtering content. Is there any way my current code can be improved to be cleaner and faster?
```
# -- coding: utf-8 --
import urllib
url='http://designingadam2.wordpress.com'
def content(page,url):#FILTERS THE CONTENT OF THE REMAINING PORTION
flg=0
#REMOVES &nsbp LIKE CHARACTERS
while page.find("&",flg)!=-1:
page.replace(' ','')
start=page.find("&",flg)
end=page.find(";",start+1)
if (end-start)REMOVE IF NOT NEEDED
pageO=page[:start]
pageT=page[end+1:]
page=pageO+pageT
flg=start+1#TO CONTINUE FROM NEXT POS
else:
flg+=1
flg=0
#REMOVES CONTENT BETWEEN SCRIPT TAGS
while page.find("",flg)
end=end+9
i,k=0,end-start
page=list(page)
while i",flg)
end=end+9
i,k=0,end-start
page=list(page)
while i':# and i!=(len(s_list)-1):
# remove everything between the
s_list.pop(i)
# make sure we get rid of the > to
s_list.pop(i)
else:
i=i+1
#-------------------------------------------------------------------
#REMOVES WHITESPACES
s_list="".join(s_list)
lst=s_list.split()
#CONVERT TO LOWERCASE
i=0
while i<len(lst):
lst[i]=lst[i].lower()
i+=1
#REMOVES DUPLICATES
lst=list(set(lst))
#REMOVE COMMON WORDS
phrase=['to','a','an','the',\
'for','from','that','their',\
'i','my','your','you','mine',\
'we','okay','yes','no','as',\
'if','but','why','can','now',\
'are','is','also',',','.',';',\
I don't want to use any other external libraries for filtering content. Is there any way my current code can be improved to be cleaner and faster?
```
# -- coding: utf-8 --
import urllib
url='http://designingadam2.wordpress.com'
def content(page,url):#FILTERS THE CONTENT OF THE REMAINING PORTION
flg=0
#REMOVES &nsbp LIKE CHARACTERS
while page.find("&",flg)!=-1:
page.replace(' ','')
start=page.find("&",flg)
end=page.find(";",start+1)
if (end-start)REMOVE IF NOT NEEDED
pageO=page[:start]
pageT=page[end+1:]
page=pageO+pageT
flg=start+1#TO CONTINUE FROM NEXT POS
else:
flg+=1
flg=0
#REMOVES CONTENT BETWEEN SCRIPT TAGS
while page.find("",flg)
end=end+9
i,k=0,end-start
page=list(page)
while i",flg)
end=end+9
i,k=0,end-start
page=list(page)
while i':# and i!=(len(s_list)-1):
# remove everything between the
s_list.pop(i)
# make sure we get rid of the > to
s_list.pop(i)
else:
i=i+1
#-------------------------------------------------------------------
#REMOVES WHITESPACES
s_list="".join(s_list)
lst=s_list.split()
#CONVERT TO LOWERCASE
i=0
while i<len(lst):
lst[i]=lst[i].lower()
i+=1
#REMOVES DUPLICATES
lst=list(set(lst))
#REMOVE COMMON WORDS
phrase=['to','a','an','the',\
'for','from','that','their',\
'i','my','your','you','mine',\
'we','okay','yes','no','as',\
'if','but','why','can','now',\
'are','is','also',',','.',';',\
Solution
Per my comment, follow the style guide - for example, there should be whitespace around
Your approach to removing HTML tags (picking through the whole page character by character) is particularly prone to error; what if one of the attributes within a tag contains
The conversion to lowercase is, frankly, ludicrous:
you had the whole string (called, confusingly,
is so much simpler.
As you're making a
why not keep the set, instead of converting back to
Your other function could be simplified significantly:
= when assigning, and after commas:i, k = 0, end - startcontent is not a good name for your function. You should be more descriptive of what it actually does (perhaps filter_content?) and add a docstring providing more information. Throughout your code there are temporary variables with cryptic names (s_list? lst?) that could be changed to make things much clearer - I was wondering why flg isn't Boolean, and it turns out that it isn't actually a flag.Your approach to removing HTML tags (picking through the whole page character by character) is particularly prone to error; what if one of the attributes within a tag contains
'>'? For a good standard library solution, see here.The conversion to lowercase is, frankly, ludicrous:
i=0
while i<len(lst):
lst[i]=lst[i].lower()
i+=1you had the whole string (called, confusingly,
s_list) to hand just two lines beforehand, ands_list = s_list.lower()is so much simpler.
As you're making a
set to remove duplication:lst=list(set(lst))why not keep the set, instead of converting back to
list, and use it to do the filtering, too? For example, use set.difference_update:>>> words = set('this is a sentence to filter'.split())
>>> words.difference_update(['a', 'to', 'this', 'is'])
>>> words
set(['sentence', 'filter'])Your other function could be simplified significantly:
def page_content(url):
with urllib.urlopen(url) as f:
return f.read()Code Snippets
i, k = 0, end - starti=0
while i<len(lst):
lst[i]=lst[i].lower()
i+=1s_list = s_list.lower()lst=list(set(lst))>>> words = set('this is a sentence to filter'.split())
>>> words.difference_update(['a', 'to', 'this', 'is'])
>>> words
set(['sentence', 'filter'])Context
StackExchange Code Review Q#59497, answer score: 4
Revisions (0)
No revisions yet.