patternpythonMinor
Regex-guided crawler that downloads regex-matching images up to a crawling level
Viewed 0 times
imageslevelcrawlingguidedthatcrawlerregexdownloadsmatching
Problem
This is one simple crawler that downloads images from websites, the website's URL to be crawled to must match the regex, as well as any image-to-download's URL.
(Also, I know, I made my own thread pool, because I did want to keep my crawler lightweight...)
Code:
```
import BeautifulSoup
import requests
import re
import json
import os.path
import urlparse
import os
import ntpath
import urllib
import threading
import Queue
import time
class MultiThreadQueue(object):
def __init__(self, max_simultaneous_threads):
self.thread_queue = Queue.Queue()
self.max_simultaneous_threads = max_simultaneous_threads
self.executing_threads = 0
self._threads_executed = []
def add_thread(self, thread):
return self.thread_queue.put(thread)
def set_max_simultaneous_threads(self, max_simultaneous_threads):
self.max_simultaneous_threads = max_simultaneous_threads
def execute_last_thread(self):
if self.executing_threads >= self.max_simultaneous_threads:
return False
thread = self.thread_queue.get_nowait()
self._threads_executed.append(thread)
self._threads_executed[-1].start()
self.executing_threads += 1
return True
def execute_threads(self):
if self.max_simultaneous_threads == 0:
while True:
try:
self.thread_queue.get().start()
except Queue.Empty:
return
return
while True:
try:
if self.execute_last_thread():
thread = self._threads_executed[-1]
threading.Thread(name="Temporary internal TJT", target=self._join_thread, args=(thread,)).start()
if self.executing_threads 0:
for link in soup.findAll("a", {"href": True}):
if ends_with_any(urlparse.urljoin(link_url, link["href"]).lower(),
("png", "jpeg", "jpg", "tga", "
(Also, I know, I made my own thread pool, because I did want to keep my crawler lightweight...)
Code:
```
import BeautifulSoup
import requests
import re
import json
import os.path
import urlparse
import os
import ntpath
import urllib
import threading
import Queue
import time
class MultiThreadQueue(object):
def __init__(self, max_simultaneous_threads):
self.thread_queue = Queue.Queue()
self.max_simultaneous_threads = max_simultaneous_threads
self.executing_threads = 0
self._threads_executed = []
def add_thread(self, thread):
return self.thread_queue.put(thread)
def set_max_simultaneous_threads(self, max_simultaneous_threads):
self.max_simultaneous_threads = max_simultaneous_threads
def execute_last_thread(self):
if self.executing_threads >= self.max_simultaneous_threads:
return False
thread = self.thread_queue.get_nowait()
self._threads_executed.append(thread)
self._threads_executed[-1].start()
self.executing_threads += 1
return True
def execute_threads(self):
if self.max_simultaneous_threads == 0:
while True:
try:
self.thread_queue.get().start()
except Queue.Empty:
return
return
while True:
try:
if self.execute_last_thread():
thread = self._threads_executed[-1]
threading.Thread(name="Temporary internal TJT", target=self._join_thread, args=(thread,)).start()
if self.executing_threads 0:
for link in soup.findAll("a", {"href": True}):
if ends_with_any(urlparse.urljoin(link_url, link["href"]).lower(),
("png", "jpeg", "jpg", "tga", "
Solution
Kudos for choosing a terrific HTML parser. I recommend invoking version 4.6 in this way:
Using
In
From the
The expression
Repeated copy-n-paste calls to urljoin suggest that you may want to cache results in temp variables.
The code starting with
You had an opportunity to name
The
from bs4 import BeautifulSoupUsing
any() could turn ends_with_any() into a (perfectly clear) one-liner.In
fetch_from_url, the default keyword args are perfect. I would love to see a docstring for the function. I'm reading a return [] a few times, which gives me a hint about its signature, but I'd much rather see the function's author put a stake in the sand. download()'s rewrite_file=False is also nicely readable.From the
print I see you're using python2. Maybe consider 3? Also, maybe this explains the use of downrev BeautifulSoup.The expression
"|" + ("-" * this_level) appears a few times, which suggests the opportunity to define a helper function. If you nest it within fetch_from_url then you won't even have to pass in this_level as an explicit argument.Repeated copy-n-paste calls to urljoin suggest that you may want to cache results in temp variables.
The code starting with
parsed = json.load ... looks like it wants be wrapped within def main(), as Mathias observed. Always strive for import to silently succeed, just in case someone wants to reuse your code later. It's too easy, may as well just do it, it's very low hanging fruit.image_kind_or_folder is nicely descriptive, and so is this:downloads = MultiThreadQueue(parsed[1])You had an opportunity to name
parsed[0] there, as well, so the Gentle Reader would understand what sort of dict it is.The
set_max_simultaneous_threads setter is an anti-pattern, and is not used, recommend you delete it, looks like the ctor has that covered already. Please add a docstring to the constructor - the semantics of max_simultaneous_threads == 0 are not obvious.execute_threads() works for you, so I'll mostly refrain from criticizing it, but _threads_executed seems a little odd, perhaps _threads_executing would be more natural (so _join_thread() would delete an entry). That way you could use len() rather than maintaining the executing_threads count.Code Snippets
from bs4 import BeautifulSoupdownloads = MultiThreadQueue(parsed[1])Context
StackExchange Code Review Q#133723, answer score: 2
Revisions (0)
No revisions yet.