HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Regex-guided crawler that downloads regex-matching images up to a crawling level

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
imageslevelcrawlingguidedthatcrawlerregexdownloadsmatching

Problem

This is one simple crawler that downloads images from websites, the website's URL to be crawled to must match the regex, as well as any image-to-download's URL.

(Also, I know, I made my own thread pool, because I did want to keep my crawler lightweight...)

Code:

```
import BeautifulSoup
import requests
import re
import json
import os.path
import urlparse
import os
import ntpath
import urllib
import threading
import Queue
import time

class MultiThreadQueue(object):
def __init__(self, max_simultaneous_threads):
self.thread_queue = Queue.Queue()
self.max_simultaneous_threads = max_simultaneous_threads
self.executing_threads = 0
self._threads_executed = []

def add_thread(self, thread):
return self.thread_queue.put(thread)

def set_max_simultaneous_threads(self, max_simultaneous_threads):
self.max_simultaneous_threads = max_simultaneous_threads

def execute_last_thread(self):
if self.executing_threads >= self.max_simultaneous_threads:
return False

thread = self.thread_queue.get_nowait()
self._threads_executed.append(thread)
self._threads_executed[-1].start()
self.executing_threads += 1
return True

def execute_threads(self):
if self.max_simultaneous_threads == 0:
while True:
try:
self.thread_queue.get().start()

except Queue.Empty:
return

return

while True:
try:
if self.execute_last_thread():
thread = self._threads_executed[-1]
threading.Thread(name="Temporary internal TJT", target=self._join_thread, args=(thread,)).start()

if self.executing_threads 0:
for link in soup.findAll("a", {"href": True}):
if ends_with_any(urlparse.urljoin(link_url, link["href"]).lower(),
("png", "jpeg", "jpg", "tga", "

Solution

Kudos for choosing a terrific HTML parser. I recommend invoking version 4.6 in this way:

from bs4 import BeautifulSoup


Using any() could turn ends_with_any() into a (perfectly clear) one-liner.

In fetch_from_url, the default keyword args are perfect. I would love to see a docstring for the function. I'm reading a return [] a few times, which gives me a hint about its signature, but I'd much rather see the function's author put a stake in the sand. download()'s rewrite_file=False is also nicely readable.

From the print I see you're using python2. Maybe consider 3? Also, maybe this explains the use of downrev BeautifulSoup.

The expression "|" + ("-" * this_level) appears a few times, which suggests the opportunity to define a helper function. If you nest it within fetch_from_url then you won't even have to pass in this_level as an explicit argument.

Repeated copy-n-paste calls to urljoin suggest that you may want to cache results in temp variables.

The code starting with parsed = json.load ... looks like it wants be wrapped within def main(), as Mathias observed. Always strive for import to silently succeed, just in case someone wants to reuse your code later. It's too easy, may as well just do it, it's very low hanging fruit.

image_kind_or_folder is nicely descriptive, and so is this:

downloads = MultiThreadQueue(parsed[1])


You had an opportunity to name parsed[0] there, as well, so the Gentle Reader would understand what sort of dict it is.

The set_max_simultaneous_threads setter is an anti-pattern, and is not used, recommend you delete it, looks like the ctor has that covered already. Please add a docstring to the constructor - the semantics of max_simultaneous_threads == 0 are not obvious.

execute_threads() works for you, so I'll mostly refrain from criticizing it, but _threads_executed seems a little odd, perhaps _threads_executing would be more natural (so _join_thread() would delete an entry). That way you could use len() rather than maintaining the executing_threads count.

Code Snippets

from bs4 import BeautifulSoup
downloads = MultiThreadQueue(parsed[1])

Context

StackExchange Code Review Q#133723, answer score: 2

Revisions (0)

No revisions yet.