HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

The YouTube crawler

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
crawleryoutubethe

Problem

I have coded a program to scrape YouTube data (for educational purposes). When the link of the channel is entered it scrapes the channel name, description of the channel, the videos posted by the channel, number of views and links to those videos.

But I hardly managed to make these things work. I am not sure it is the right way to do it. I know the YouTube API could help but for learning I prefer using only requests and Beautiful Soup.

```
from tkinter import *
from bs4 import BeautifulSoup
import re
import requests


def channelInfo():
Link = link.get()
r = requests.get(Link)
soup = BeautifulSoup(r.content)
channelName = "Channel Name: " + soup.title.string
firrt = Label(text=channelName,fg='yellow',bg='black').place(x=0,y=0)
var = None
var1 = None
var3 = None
var4 = None
placer =0
placer1 =0
adjust = 0
for i in soup.find_all('a',class_="yt-uix-sessionlink yt-uix-tile-link spf-link yt-ui-ellipsis yt-ui-ellipsis-2"):
var = i.text
second = Label(text=var,fg='black',bg='white').place(x=200,y=40+adjust)
adjust+=20
desc = soup.find_all(attrs={"name":"description"})
DESC = desc[0]['content'].encode('utf-8')
third = Label(text=DESC,fg='black',bg='yellow').place(x=0,y=20)
for j in soup.find_all('li'):
var1=j.text
varr = re.findall('[0-9]+,[0-9]+ views',var1)
for views in varr:
var3 = Label(text=views,fg='blue').place(x=650,y=40+placer)
placer+=20
for k in soup.find_all('a',class_="yt-uix-sessionlink yt-uix-tile-link spf-link yt-ui-ellipsis yt-ui-ellipsis-2"):
links = k.get("href")
final = Link+links
var4 = Label(text=final).place(x=750,y=40+placer1)
placer1+=20



gui = Tk()
gui.geometry('500x400')
gui.title('The Youtube Crawler')
label = Label(text='Paste the link below to crawl Youtube',fg='blue')
label.pack()
link = StringVar()
entry = Entry(gui,textvariable=link)
e

Solution

from tkinter import *


Importing everything from a module is generally frowned upon because it hampers readability, can cause confusion ("where does Label come from?") and breaks useful tools like pyflakes. However it is acceptable with tkinter because it was designed to work like this. Just keep in mind that tkinter is not an example of a good Python API.

from bs4 import BeautifulSoup
import re
import requests


Nitpick: consider importing standard library modules (re here) first.

def channelInfo():


PEP8: channel_info, not channelInfo.

Link = link.get()


PEP8: link, not Link. Variable names that start with an uppercase character are reserved for classes, like BeautifulSoup. I'm going to stop reporting all PEP 8 violations. Use a tool like flake8: it will help you write better code which will be easier to read by other Python developers. Consider a tool like yapf which will try to do this by itself.

r = requests.get(Link)


Check the status code. The Link variable was not useful here.

soup = BeautifulSoup(r.content)
   channelName = "Channel Name: " + soup.title.string
   firrt = Label(text=channelName,fg='yellow',bg='black').place(x=0,y=0)


firrt? What does this mean? Also, this shows that you're completely mixing your interface with the crawling. It's fine here, but for a larger program you would want to separate concerns: first retrieve the data structure, and then show it to your user. A number of design patterns exist for this, one example being MVC in web applications. Using a better GUI framework (other than tkinter) would probably make this easier, but I don't know any good GUI framework in Python (maybe some exist though).

var = None
   var1 = None
   var3 = None
   var4 = None


You can do var, var1, var3, var4 = None, None, None, None. But when I see this I ask myself many questions. Why so many var? What do they mean? Why did you not chooose a descriptive name? Why no var2? Are they different variables or should they go into a list? Your main goal when writing code should be to make sure anyone reading you (including your future self) should not be wondering what's going on.

placer =0
   placer1 =0
   adjust = 0
   for i in soup.find_all('a',class_="yt-uix-sessionlink yt-uix-tile-link  spf-link  yt-ui-ellipsis yt-ui-ellipsis-2"):


i is a bad name: it should only be used for numerical loops, but they should be rare in Python. Also, Use a constant to make your class filtering clearer and shorter insteading of hardcoding the value, like VIDEO_LINK or something. This is very brittle, too, because if someone at YouTube decides they need another class, your code will break. What you can do instead is choosing one class that you think is less likely to change. spf-link looks like it would be a good choice. But these things are very brittle anyway, so if you want to ensure this continue to work, you'll need to write tests to ensure that for this specific channel, you continue to get the things you expect.

var = i.text


Oh, so maybe they are different variables after all. i.text is better than var, but you can probably find a better name.
second = Label(text=var,fg='black',bg='white').place(x=200,y=40+adjust)
adjust+=20

Do you need to assign second here?

desc = soup.find_all(attrs={"name":"description"})
   DESC = desc[0]['content'].encode('utf-8')


desc and DESC are too similar. Use desc_list and desc, maybe. The encode('utf-8') here is not needed. If you really need this then something is wrong. Tkinter accepts Unicode strings. See the Unicode HOWTO to understand better how Unicode works. It's a very important skill for a programmer.

third = Label(text=DESC,fg='black',bg='yellow').place(x=0,y=20)
   for j in soup.find_all('li'):


j is a bad name.

var1=j.text


Sorry, but var1 is also worse than j.text.

varr = re.findall('[0-9]+,[0-9]+ views',var1)


Seriously? varr? I'll stop whining about bad names, but please do something about it. :)

for views in varr:
               var3 = Label(text=views,fg='blue').place(x=650,y=40+placer)
               placer+=20
   for k in soup.find_all('a',class_="yt-uix-sessionlink yt-uix-tile-link  spf-link  yt-ui-ellipsis yt-ui-ellipsis-2"):
      links = k.get("href")
      final = Link+links
      var4 = Label(text=final).place(x=750,y=40+placer1)
      placer1+=20

gui = Tk()
gui.geometry('500x400')
gui.title('The Youtube Crawler')
label = Label(text='Paste the link below to crawl Youtube',fg='blue')
label.pack()
link = StringVar()
entry = Entry(gui,textvariable=link)
entry.pack()
channel = Button(text='Crawl this channel',fg='white',bg='black',width=30,command=channelInfo)


You're going to block everything while the requests are done. I think using threads could help, but that's probably overkill here.

```
channel.place(x=10,y=45)
'''
specific = Button(text='Inform about

Code Snippets

from tkinter import *
from bs4 import BeautifulSoup
import re
import requests
def channelInfo():
Link = link.get()
r = requests.get(Link)

Context

StackExchange Code Review Q#128936, answer score: 2

Revisions (0)

No revisions yet.