HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Computing the in-degree of a tweet graph

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
tweetdegreethegraphcomputing

Problem

This was my entry for a recent coding challenge for computing the in-degree of a tweet graph (the competition is over).

The requirement was to compute the in-degree of a graph made from unique hash tags within a given window of time (rolling). Any hash tag (case insensitive) is a node, and two unique hash tags present in the same tweet becomes an edge. There are no weighted edges. Any time a new tweet comes in, the graph is reexamined, and tweets older than the given window time from the latest tweet is discarded.

I would like to get feed back on my coding style, any idioms that I missed, and better ways of doing this. My full repo, including the unit tests is here. The original coding challenge description is here.

``
#!/usr/bin/env python3
"""
This module computes the rolling average vertex degree of a twitter
tweet hashtag graph.
"""
import itertools
import json
import sys
import time
import argparse
import logging
from typing import Dict, Tuple, List, Any, Optional, cast, Iterable

from heapdict import heapdict

TIME_FMT = "%a %b %d %H:%M:%S +0000 %Y"
logging.basicConfig(level=logging.WARNING, format='%(levelname)s %(message)s', stream=sys.stderr)
LOG = logging.getLogger(__name__)

class TweetGraph:
"""
Process the tweet, and keeps track of the time. This implementation uses
a priority queue (heap) containing a tuple (edge,created time) as the
backbone.
"""
def __init__(self, curtime: int, window: int) -> None:
"""
Initialize the TweetGraph
:param curtime: The starting time
:param window: The sliding window
"""
self.latest = curtime
self.edges = {} # type: Dict[Tuple[str, str], int]
self.queue = heapdict()
self.window = window

def in_window(self, ctime: int) -> bool:
"""
Is the passed in time within the window? Note that the formula is
(self.latest - ctime) >= window`
:param ctime: The time which has to be checked.
:r

Solution

Just looking at the function get_tweet:

-
The docstring says, "Parse the line into json" but actually it parses the line from JSON (and into a Python dictionary).

-
This function has multiple responsibilities: it parses a tweet, and also logs failures. This makes the code hard to re-use if you have a different logging environment or different requirements about how to handle failures. It would be better to separate these responsibilities and have the caller do the logging (or otherwise handling) of failures.

-
If you are going to log an exception and continue, it's a good idea to log a full traceback (using the traceback module) so that you have as much information as possible. At the moment, the logging in get_tweet doesn't indicate whether the failure happened in json.loads or in time.strptime.

-
The code assumes that json.loads returns a dictionary, but in fact this function can also return a number, string, Boolean, list, or None, and in these cases j.get('created_at', None) would fail with AttributeError.

-
The code parses the time and uses it to construct a Unix timestamp. But this time representation is not very convenient to use in Python. It would be better to use datetime.datetime objects throughout.

-
The name j is not very clear: presumably j is short for JSON but the value of this variable is not a JSON string, but a Python dictionary representing a tweet. So tweet would be a better name.

Context

StackExchange Code Review Q#126169, answer score: 2

Revisions (0)

No revisions yet.