HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Parsing JSON and exploring relationships in Python

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
andexploringparsingpythonjsonrelationships

Problem

Question:


Given two users say user1 and usern, print True if there was a chain
of tweets such that user1 tagged user2, user2 tagged user3 .....
usern-1 tagged usern or viceversa.


Input Format


The 1st line and the 2nd line contains the two users. The 3rd line of
the input contains the filename in which the dataset of tweets in JSON
format.


JSON Format


{ "TweetCount" : 1, "TwitterData" : [ { "userName" : "user1",
"tweetid" : 12345, "tweet" : "@user2" } ] }


Output Format


Print "True" ( quotes are for clarity ) if the two mentioned users are
connected, and "False" otherwise.


Constraints:


1 20


Sample Input:

user1
user2
input000.in




input000.in


{ "TweetCount" : 2, "TwitterData" : [ { "userName" : "user1",
"tweetid" : 1, "tweet" : "Tweet 1 @user2" }, { "userName" : "user2",
"tweetid" : 2, "tweet" : "Tweet 2 @user4" } ] }


Sample Output:


True


Explanation:


user1 has user2 tagged in his tweet, hence True

I was wondering if you could take a look at my code and see if

  • there's a better way of solving the problem than the way I did it (i.e. should I construct a graph?)



  • I make my code more pythonic



```
import json

# standardize all input names to lower case
user1 = raw_input().lower()
user2 = raw_input().lower()

filename = raw_input()
user1_mentions = []
to_explore = []
reached = [] # list of users reached

with open(filename) as f:
json = json.load(f) # JSON representation of input file

# find user1's mentions
for tweet in json['TwitterData']:
if tweet['userName'] == user1:
user1_mentions += ([x[1:].encode('utf-8') for x in tweet['tweet'].split() if x.startswith('@') and x[1:].isalnum()])
to_explore = user1_mentions

while to_explore != []:
temp_to_explore = [] # temp array for storing words to explore next
for tweet in json['TwitterData']:
if tweet['userName'] in to_explore:
temp_to_explore += ([x[1:

Solution

Writing pythonic code

Python has a coding guideline called PEP 8. It contains a lot of relevant information about how to write your code. You'll find various tools to check your code againt PEP 8 to ensure you follow it properly.

Among other things :


For sequences, (strings, lists, tuples), use the fact that empty
sequences are false.

In your case, you should write while to_explore instead of while to_explore != [].

A bit of scafolding

In order to improve your code, it is a good thing to split it into different parts. Here I have splitted your code into functions. I took this chance to move the part calling functions behind an if main guard. Also, I have hardcoded some values to make testing easier. To make things better, the different functions should be documented but I can't be bothered.

import json

def get_input_data(from_user = False):
    if from_user:
        # standardize all input names to lower case
        user1 = raw_input().lower()
        user2 = raw_input().lower()
        with open(raw_input()) as f:
            json_data = json.load(f) # JSON representation of input file
    else:
        user1 = 'user1'
        user2 = 'user2'
        json_data = json.loads('{ "TweetCount" : 1, "TwitterData" : [ { "userName" : "user1", "tweetid" : 12345, "tweet" : "@user2" } ] }')
    data = json_data['TwitterData']
    return (user1, user2, data)

def chain_exists(user1, user2, data):
    user1_mentions = []
    to_explore = []
    reached = [] # list of users reached

    # find user1's mentions
    for tweet in data:
        if tweet['userName'] == user1:
            user1_mentions += ([x[1:].encode('utf-8') for x in tweet['tweet'].split() if x.startswith('@') and x[1:].isalnum()])
            to_explore = user1_mentions

    while to_explore != []:
        temp_to_explore = [] # temp array for storing words to explore next
        for tweet in data:
            if tweet['userName'] in to_explore:
                temp_to_explore += ([x[1:].encode('utf-8') for x in tweet['tweet'].split() if x.startswith('@') and x[1:].isalnum()])
        reached += to_explore
        to_explore = temp_to_explore

    print reached # ['user2', 'user4']
    return user2 in reached

if __name__ == '__main__':
    user1, user2, data = get_input_data()
    print(chain_exists(user1, user2, data))


It could be interesting to add more hardcoded tests.

A bit of logic

Everytime you add something to user1_mentions, you change to_explore. It probably would make sense to get the value at the end of the loop.

# find user1's mentions
for tweet in data:
    if tweet['userName'] == user1:
        user1_mentions += ([x[1:].encode('utf-8') for x in tweet['tweet'].split() if x.startswith('@') and x[1:].isalnum()])
to_explore = user1_mentions # this doesn't need to be defined before this point


The right data structure

The way you build user1_mentions, it might contain the same value multiple times. It is something we might want to avoid because it will not bring anything except performance issues. Instead of using list, you should use sets.

def chain_exists(user1, user2, data):
    user1_mentions = set()
    reached = [] # list of users reached

    # find user1's mentions
    for tweet in data:
        if tweet['userName'] == user1:
            user1_mentions.update([x[1:].encode('utf-8') for x in tweet['tweet'].split() if x.startswith('@') and x[1:].isalnum()])
    to_explore = user1_mentions

    while to_explore:
        temp_to_explore = set() # temp set for storing words to explore next
        for tweet in data:
            if tweet['userName'] in to_explore:
                temp_to_explore.update([x[1:].encode('utf-8') for x in tweet['tweet'].split() if x.startswith('@') and x[1:].isalnum()])
        reached += to_explore
        to_explore = temp_to_explore

    print reached # ['user2', 'user4']
    return user2 in reached


The very right data structure

In order to use to implement efficient algorithm, you'll need to preprocess your data into something relevant. Mapping user names to the set of users they mention is likely to be required. Here's a piece of code to do so :

def make_graph_from_data(data):
    graph = {}
    for tweet in data:
        graph.setdefault(
            tweet['userName'],
            set()).update(x[1:] for x in tweet['tweet'].split() if x.startswith('@') and x[1:].isalnum)
    return graph


Once you have the data preprocess, you won't need to worry about parsing tweets or whatever.

A simple bug

What if a user was to mention hisself ? You'd get stuck in the loop.

Conclusion

You'll find various solutions to your problem in the litterature so I'll let you have a look at this : it might be a good idea to consider solutions to the shortest path problem by considering a graph : you are in a case of Directed graphs with nonnegative weights. There might be some even more relevant algorithm because you care about the existence of a chain but not s

Code Snippets

import json

def get_input_data(from_user = False):
    if from_user:
        # standardize all input names to lower case
        user1 = raw_input().lower()
        user2 = raw_input().lower()
        with open(raw_input()) as f:
            json_data = json.load(f) # JSON representation of input file
    else:
        user1 = 'user1'
        user2 = 'user2'
        json_data = json.loads('{ "TweetCount" : 1, "TwitterData" : [ { "userName" : "user1", "tweetid" : 12345, "tweet" : "@user2" } ] }')
    data = json_data['TwitterData']
    return (user1, user2, data)


def chain_exists(user1, user2, data):
    user1_mentions = []
    to_explore = []
    reached = [] # list of users reached

    # find user1's mentions
    for tweet in data:
        if tweet['userName'] == user1:
            user1_mentions += ([x[1:].encode('utf-8') for x in tweet['tweet'].split() if x.startswith('@') and x[1:].isalnum()])
            to_explore = user1_mentions

    while to_explore != []:
        temp_to_explore = [] # temp array for storing words to explore next
        for tweet in data:
            if tweet['userName'] in to_explore:
                temp_to_explore += ([x[1:].encode('utf-8') for x in tweet['tweet'].split() if x.startswith('@') and x[1:].isalnum()])
        reached += to_explore
        to_explore = temp_to_explore

    print reached # ['user2', 'user4']
    return user2 in reached

if __name__ == '__main__':
    user1, user2, data = get_input_data()
    print(chain_exists(user1, user2, data))
# find user1's mentions
for tweet in data:
    if tweet['userName'] == user1:
        user1_mentions += ([x[1:].encode('utf-8') for x in tweet['tweet'].split() if x.startswith('@') and x[1:].isalnum()])
to_explore = user1_mentions # this doesn't need to be defined before this point
def chain_exists(user1, user2, data):
    user1_mentions = set()
    reached = [] # list of users reached

    # find user1's mentions
    for tweet in data:
        if tweet['userName'] == user1:
            user1_mentions.update([x[1:].encode('utf-8') for x in tweet['tweet'].split() if x.startswith('@') and x[1:].isalnum()])
    to_explore = user1_mentions

    while to_explore:
        temp_to_explore = set() # temp set for storing words to explore next
        for tweet in data:
            if tweet['userName'] in to_explore:
                temp_to_explore.update([x[1:].encode('utf-8') for x in tweet['tweet'].split() if x.startswith('@') and x[1:].isalnum()])
        reached += to_explore
        to_explore = temp_to_explore

    print reached # ['user2', 'user4']
    return user2 in reached
def make_graph_from_data(data):
    graph = {}
    for tweet in data:
        graph.setdefault(
            tweet['userName'],
            set()).update(x[1:] for x in tweet['tweet'].split() if x.startswith('@') and x[1:].isalnum)
    return graph

Context

StackExchange Code Review Q#69138, answer score: 2

Revisions (0)

No revisions yet.