patternpythonMinor
Expanding url from shortened url obtained from tweet
Viewed 0 times
tweetobtainedexpandingshortenedfromurl
Problem
I have a twitter data set. I have extracted all the expanded urls from the json and now am trying to resolve the shortened ones. Also, I need to check which urls are still working and only keep those.
I am parsing over 5 million urls. The problem is that the code below is slow. Can anyone suggest how to make it faster? Is there a better way to do this?
```
import csv
import pandas as pd
from urllib2 import urlopen
import urllib2
import threading
import time
def urlResolution(url,tweetId,w):
try:
print "Entered Function"
print "Original Url:",url
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
#header has been added since some sites give an error otherwise
req = urllib2.Request(url, headers=hdr)
temp = urlopen(req)
newUrl = temp.geturl()
print "Resolved Url:",newUrl
if newUrl!= 'None':
print "in if condition"
w.writerow([tweetId,newUrl])
except Exception,e:
print "Throwing exception"
print str(e)
return None
def urlResolver(urlFile):
df=pd.read_csv(urlFile, delimiter="\t")
df['Url']
df2 = df[["Tweet ID","Url"]].copy()
start = time.time()
df3 = df2[df2.Url!="None"]
list_url = []
n=0
w = csv.writer(open("OUTPUT_FILE.tsv", "w"), delimiter = '\t')
w.writerow(["Tweet ID","Url"])
maxC = 0
while maxC error
threads = [threading.Thread(target=urlResolution, args=(df3.iloc[n]['Url'],df3.iloc[n]['Tweet ID'],w)) for n in range(maxC,maxC+40)]
for thread in threads:
thread.start()
for thread in threads:
thre
I am parsing over 5 million urls. The problem is that the code below is slow. Can anyone suggest how to make it faster? Is there a better way to do this?
```
import csv
import pandas as pd
from urllib2 import urlopen
import urllib2
import threading
import time
def urlResolution(url,tweetId,w):
try:
print "Entered Function"
print "Original Url:",url
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
#header has been added since some sites give an error otherwise
req = urllib2.Request(url, headers=hdr)
temp = urlopen(req)
newUrl = temp.geturl()
print "Resolved Url:",newUrl
if newUrl!= 'None':
print "in if condition"
w.writerow([tweetId,newUrl])
except Exception,e:
print "Throwing exception"
print str(e)
return None
def urlResolver(urlFile):
df=pd.read_csv(urlFile, delimiter="\t")
df['Url']
df2 = df[["Tweet ID","Url"]].copy()
start = time.time()
df3 = df2[df2.Url!="None"]
list_url = []
n=0
w = csv.writer(open("OUTPUT_FILE.tsv", "w"), delimiter = '\t')
w.writerow(["Tweet ID","Url"])
maxC = 0
while maxC error
threads = [threading.Thread(target=urlResolution, args=(df3.iloc[n]['Url'],df3.iloc[n]['Tweet ID'],w)) for n in range(maxC,maxC+40)]
for thread in threads:
thread.start()
for thread in threads:
thre
Solution
Couple things I'd try:
-
switch to
..if you're making several requests to the same host, the underlying TCP
connection will be reused, which can result in a significant
performance increase
-
use the "HEAD" HTTP method (in case of
Some micro-optimization ideas:
-
switch to
requests module reusing the requests.Session() to let it reuse the same TCP connection:..if you're making several requests to the same host, the underlying TCP
connection will be reused, which can result in a significant
performance increase
-
use the "HEAD" HTTP method (in case of
requests you may need the allow_redirects=True)- try out
Scrapyweb-scraping framework which is of an asynchronous nature and is based on thetwistednetwork library. You would also move the CSV output part to an output pipeline.
- another thing to try is use the
grequestslibrary (requestsongevent)
Some micro-optimization ideas:
- move the
hdrdictionary definition to the module level to avoid redefining it every timeurlResolution()is called (and, since it is a constant use upper-case; and pick a more readable variable name -HEADERS?)
Context
StackExchange Code Review Q#156592, answer score: 2
Revisions (0)
No revisions yet.