HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Extracting and normalizing URLs in an HTML document

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
documentextractingandnormalizinghtmlurls

Problem

I have written code to get all urls on a webpage & put them in a set, and would like tips on simple changes I can make to increase its performance.

soup = BeautifulSoup(html_doc)
for link in soup.find_all('a'):
    url = link.get('href')
    if url is None or ' ' in url or '' in url:
        continue
    if url.startswith('//'):
        url = url.replace('//', 'http://')
    if url.startswith('/'):
        url = hostname + url
    if '?' in url:
        url = url.split('?')[0]
    if '#' in url:
        url = url.split('#')[0]
    if url.endswith('/'):
        url = url[:-1]
    if url.endswith(excluded_extensions):
        continue
    if url.startswith(hostname):
        urls_set.add(url)

Solution

Some stuff you could perhaps do differently:

# your code
if url is None or ' ' in url or '' in url:
    continue

# the alternative
if url is None or any(char in url for char in ' <>'):
    continue


Also, you can call the split method directly, without the if statement, as it will return a single item list with the full string inside if the character is not in the string:

# your code
if '?' in url:
    url = url.split('?')[0]
if '#' in url:
    url = url.split('#')[0]

# the alternative
for splitter in '?#':
    url = url.split(splitter, 1)[0]


Notice the micro-optimization of using the second argument of split, so that the string is only split at the first occurrence if there is more than one.

Code Snippets

# your code
if url is None or ' ' in url or '<' in url or '>' in url:
    continue

# the alternative
if url is None or any(char in url for char in ' <>'):
    continue
# your code
if '?' in url:
    url = url.split('?')[0]
if '#' in url:
    url = url.split('#')[0]

# the alternative
for splitter in '?#':
    url = url.split(splitter, 1)[0]

Context

StackExchange Code Review Q#100490, answer score: 3

Revisions (0)

No revisions yet.