HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Content filtering of webpage

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
contentwebpagefiltering

Problem

This is code for filtering the data of webpage, for the web crawler I made for my project. I know python scripts can lag than other languages, but this takes a lot of time when processing even a single page.

I don't want to use any other external libraries for filtering content. Is there any way my current code can be improved to be cleaner and faster?

```
# -- coding: utf-8 --
import urllib
url='http://designingadam2.wordpress.com'
def content(page,url):#FILTERS THE CONTENT OF THE REMAINING PORTION
flg=0
#REMOVES &nsbp LIKE CHARACTERS
while page.find("&",flg)!=-1:
page.replace(' ','')
start=page.find("&",flg)
end=page.find(";",start+1)
if (end-start)REMOVE IF NOT NEEDED
pageO=page[:start]
pageT=page[end+1:]
page=pageO+pageT
flg=start+1#TO CONTINUE FROM NEXT POS
else:
flg+=1
flg=0

#REMOVES CONTENT BETWEEN SCRIPT TAGS
while page.find("",flg)
end=end+9
i,k=0,end-start
page=list(page)
while i",flg)
end=end+9
i,k=0,end-start
page=list(page)
while i':# and i!=(len(s_list)-1):
# remove everything between the
s_list.pop(i)

# make sure we get rid of the > to
s_list.pop(i)
else:
i=i+1
#-------------------------------------------------------------------

#REMOVES WHITESPACES
s_list="".join(s_list)
lst=s_list.split()
#CONVERT TO LOWERCASE
i=0
while i<len(lst):
lst[i]=lst[i].lower()
i+=1

#REMOVES DUPLICATES
lst=list(set(lst))

#REMOVE COMMON WORDS
phrase=['to','a','an','the',\
'for','from','that','their',\
'i','my','your','you','mine',\
'we','okay','yes','no','as',\
'if','but','why','can','now',\
'are','is','also',',','.',';',\

Solution

Per my comment, follow the style guide - for example, there should be whitespace around = when assigning, and after commas:

i, k = 0, end - start


content is not a good name for your function. You should be more descriptive of what it actually does (perhaps filter_content?) and add a docstring providing more information. Throughout your code there are temporary variables with cryptic names (s_list? lst?) that could be changed to make things much clearer - I was wondering why flg isn't Boolean, and it turns out that it isn't actually a flag.

Your approach to removing HTML tags (picking through the whole page character by character) is particularly prone to error; what if one of the attributes within a tag contains '>'? For a good standard library solution, see here.

The conversion to lowercase is, frankly, ludicrous:

i=0
while i<len(lst):
    lst[i]=lst[i].lower()
    i+=1


you had the whole string (called, confusingly, s_list) to hand just two lines beforehand, and

s_list = s_list.lower()


is so much simpler.

As you're making a set to remove duplication:

lst=list(set(lst))


why not keep the set, instead of converting back to list, and use it to do the filtering, too? For example, use set.difference_update:

>>> words = set('this is a sentence to filter'.split())
>>> words.difference_update(['a', 'to', 'this', 'is'])
>>> words
set(['sentence', 'filter'])


Your other function could be simplified significantly:

def page_content(url):
    with urllib.urlopen(url) as f:
        return f.read()

Code Snippets

i, k = 0, end - start
i=0
while i<len(lst):
    lst[i]=lst[i].lower()
    i+=1
s_list = s_list.lower()
lst=list(set(lst))
>>> words = set('this is a sentence to filter'.split())
>>> words.difference_update(['a', 'to', 'this', 'is'])
>>> words
set(['sentence', 'filter'])

Context

StackExchange Code Review Q#59497, answer score: 4

Revisions (0)

No revisions yet.