patternpythonMinor
Extracting and normalizing URLs in an HTML document
Viewed 0 times
documentextractingandnormalizinghtmlurls
Problem
I have written code to get all urls on a webpage & put them in a set, and would like tips on simple changes I can make to increase its performance.
soup = BeautifulSoup(html_doc)
for link in soup.find_all('a'):
url = link.get('href')
if url is None or ' ' in url or '' in url:
continue
if url.startswith('//'):
url = url.replace('//', 'http://')
if url.startswith('/'):
url = hostname + url
if '?' in url:
url = url.split('?')[0]
if '#' in url:
url = url.split('#')[0]
if url.endswith('/'):
url = url[:-1]
if url.endswith(excluded_extensions):
continue
if url.startswith(hostname):
urls_set.add(url)Solution
Some stuff you could perhaps do differently:
Also, you can call the split method directly, without the
Notice the micro-optimization of using the second argument of
# your code
if url is None or ' ' in url or '' in url:
continue
# the alternative
if url is None or any(char in url for char in ' <>'):
continueAlso, you can call the split method directly, without the
if statement, as it will return a single item list with the full string inside if the character is not in the string:# your code
if '?' in url:
url = url.split('?')[0]
if '#' in url:
url = url.split('#')[0]
# the alternative
for splitter in '?#':
url = url.split(splitter, 1)[0]Notice the micro-optimization of using the second argument of
split, so that the string is only split at the first occurrence if there is more than one.Code Snippets
# your code
if url is None or ' ' in url or '<' in url or '>' in url:
continue
# the alternative
if url is None or any(char in url for char in ' <>'):
continue# your code
if '?' in url:
url = url.split('?')[0]
if '#' in url:
url = url.split('#')[0]
# the alternative
for splitter in '?#':
url = url.split(splitter, 1)[0]Context
StackExchange Code Review Q#100490, answer score: 3
Revisions (0)
No revisions yet.