HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

YouTube Search Result Scraper

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
youtuberesultscrapersearch

Problem

This is a program I wrote in Python using the BeautifulSoup library. The program scrapes YouTube search results for a given query and extracts data from the channels returned in the search results.

I'm just looking for some tips on how to make my code look (and function) better. I removed most of the redundancies but to me the code still feels ugly.

Suggestions?

```
#!/usr/bin/python
# http://docs.python-requests.org/en/latest/user/quickstart/
# http://www.crummy.com/software/BeautifulSoup/bs4/doc/

import csv
import re
import requests
import time
from bs4 import BeautifulSoup

# scrapes the title
def getTitle():
d = soup.find_all("h1", "branded-page-header-title")
for i in d:
name = i.text.strip().replace('\n',' ').replace(',','').encode("utf-8")
f.write(name+',')
print('\t\t%s') % (name)

# scrapes the subscriber and view count
def getStats():
b = soup.find_all("li", "about-stat ") # trailing space is required.
for i in b:
value = i.b.text.strip().replace(',','')
name = i.b.next_sibling.strip().replace(',','')
f.write(value+',')
print('\t\t%s = %s') % (name, value)

# scrapes the description
def getDescription():
c = soup.find_all("div", "about-description")
for i in c:
description = i.text.strip().replace('\n',' ').replace(',','').encode("utf-8")
f.write(description+',')
#print('\t\t%s') % (description)

# scrapes all the external links
def getLinks():
a = soup.find_all("a", "about-channel-link ") # trailing space is required.
for i in a:
url = i.get('href')
f.write(url+',')
print('\t\t%s') % (url)

# scrapes the related channels
def getRelated():
s = soup.find_all("h3", "yt-lockup-title")
for i in s:
t = i.find_all(href=re.compile("user"))
for i in t:
url = 'https://www.youtube.com'+i.get('href')
rCSV.write(url+'\n')
print('\t\t%s,%s') % (i.text, url)

f = open

Solution

I'm pretty much a n00b to programming myself, so take my advice with a grain of salt... but I would try making each of your "get..." functions into a method of a class (let's say YoutubeVid). It's __init__ would set all the attributes at once, without printing. A seperate function, let's say print_attributes could do the printing. Once you code that part, you would replace this:

else:
                r = requests.get(url)
                soup = BeautifulSoup(r.text)
                f.write(url+',')
                print('\t%s') % (url)
                getTitle()
                getStats()
                getDescription()
                getLinks()
                getRelated()
                f.write('\n')   
                print('\n')
                visited.append(url)
                time.sleep(3)


With something like this:

else:
                r = requests.get(url)
                soup = BeautifulSoup(r.text)
                video_page = YoutubeVid(soup)
                print_attributes(video_page)


I'm sorry I don't have the time to work out a more detailed example, but if that makes any sense to you, maybe you can give it a try and post what you come up with.

Also, a minor detail regarding function names... Mixed case like getTitle() is depricated. Lowercase with underscores like get_title() is prefered. See the PEP Style Guide.

Code Snippets

else:
                r = requests.get(url)
                soup = BeautifulSoup(r.text)
                f.write(url+',')
                print('\t%s') % (url)
                getTitle()
                getStats()
                getDescription()
                getLinks()
                getRelated()
                f.write('\n')   
                print('\n')
                visited.append(url)
                time.sleep(3)
else:
                r = requests.get(url)
                soup = BeautifulSoup(r.text)
                video_page = YoutubeVid(soup)
                print_attributes(video_page)

Context

StackExchange Code Review Q#92001, answer score: 3

Revisions (0)

No revisions yet.