patternpythonMinor

Expanding url from shortened url obtained from tweet

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

tweetobtainedexpandingshortenedfromurl

Problem

I have a twitter data set. I have extracted all the expanded urls from the json and now am trying to resolve the shortened ones. Also, I need to check which urls are still working and only keep those.

I am parsing over 5 million urls. The problem is that the code below is slow. Can anyone suggest how to make it faster? Is there a better way to do this?

```
import csv
import pandas as pd
from urllib2 import urlopen
import urllib2
import threading
import time

def urlResolution(url,tweetId,w):

try:

print "Entered Function"
print "Original Url:",url

hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}

#header has been added since some sites give an error otherwise
req = urllib2.Request(url, headers=hdr)
temp = urlopen(req)
newUrl = temp.geturl()
print "Resolved Url:",newUrl
if newUrl!= 'None':
print "in if condition"
w.writerow([tweetId,newUrl])

except Exception,e:
print "Throwing exception"
print str(e)
return None

def urlResolver(urlFile):
df=pd.read_csv(urlFile, delimiter="\t")

df['Url']
df2 = df[["Tweet ID","Url"]].copy()
start = time.time()

df3 = df2[df2.Url!="None"]

list_url = []
n=0
w = csv.writer(open("OUTPUT_FILE.tsv", "w"), delimiter = '\t')
w.writerow(["Tweet ID","Url"])

maxC = 0
while maxC error
threads = [threading.Thread(target=urlResolution, args=(df3.iloc[n]['Url'],df3.iloc[n]['Tweet ID'],w)) for n in range(maxC,maxC+40)]

for thread in threads:
thread.start()
for thread in threads:
thre

Solution

Couple things I'd try:

-
switch to requests module reusing the requests.Session() to let it reuse the same TCP connection:

..if you're making several requests to the same host, the underlying TCP
connection will be reused, which can result in a significant
performance increase

-
use the "HEAD" HTTP method (in case of requests you may need the allow_redirects=True)

try out Scrapy web-scraping framework which is of an asynchronous nature and is based on the twisted network library. You would also move the CSV output part to an output pipeline.

another thing to try is use the grequests library (requests on gevent)

Some micro-optimization ideas:

move the hdr dictionary definition to the module level to avoid redefining it every time urlResolution() is called (and, since it is a constant use upper-case; and pick a more readable variable name - HEADERS?)

Context

StackExchange Code Review Q#156592, answer score: 2

Revisions (0)

No revisions yet.