HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Creating nodes and edges

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
creatingnodesandedges

Problem

I'm looking for a way to optimize this piece of code in Python.

all_users holds all the Twitter users and friends_ids is an array of IDs.

g = Graph(directed=True)
all_users = users.find({}, { 'id_str' : 1, 'friends_ids' : 1}) # This is from MongoDb

names = []
edges = []
print 'Processing first 1000 batch'
for idx, user in enumerate(all_users):
    if idx % 1000 == 0 and idx != 0:
        print 'Processing ' + str(idx) + ' batch'
    names.append(str(user['id_str']))
    for friend_id in user['friends_ids']:
        edges.append((user['id_str'], str(friend_id)))
        if not str(friend_id) in names:
            names.append(str(friend_id))

print len(names)
print len(edges)


all_users has 4k records but each of the user has at least 10k of friends_ids. In order to create a graph I need a list of nodes which I have in names and list of edges edges.

The format of edges would be [(node1, node2), (node2, node3), ...], which means that node1 connects to node2 and node2 connect to node3. I'm not sure why but this script takes almost full 10 min to run. So, I'm looking for a way to optimize this.

Solution

Try this minor variation over your script, that will probably have a dramatic effect on performance:

names = set()
edges = []

for idx, user in enumerate(all_users):
    if idx and not idx % 1000:
        print('Processing {} batch',format(idx))
    user_id = str(user['id_str'])
    names.add(user_id)
    for friend_id in user['friends_ids']:
        friend_id = str(friend_id)
        edges.append((user_id, friend_id))
        names.add(friend_id)


The main modification is making names a set, not a list, Your script is checking membership of every friend_id in names, an operation that takes time linear in the size of names. In contrast, membership in a set is done in constant time.

If you need names to be a list, simply do:

names = list(names)


after the processing is completed.

Code Snippets

names = set()
edges = []

for idx, user in enumerate(all_users):
    if idx and not idx % 1000:
        print('Processing {} batch',format(idx))
    user_id = str(user['id_str'])
    names.add(user_id)
    for friend_id in user['friends_ids']:
        friend_id = str(friend_id)
        edges.append((user_id, friend_id))
        names.add(friend_id)
names = list(names)

Context

StackExchange Code Review Q#104275, answer score: 5

Revisions (0)

No revisions yet.