patternpythonMinor
Computing the in-degree of a tweet graph
Viewed 0 times
tweetdegreethegraphcomputing
Problem
This was my entry for a recent coding challenge for computing the in-degree of a tweet graph (the competition is over).
The requirement was to compute the in-degree of a graph made from unique hash tags within a given window of time (rolling). Any hash tag (case insensitive) is a node, and two unique hash tags present in the same tweet becomes an edge. There are no weighted edges. Any time a new tweet comes in, the graph is reexamined, and tweets older than the given window time from the latest tweet is discarded.
I would like to get feed back on my coding style, any idioms that I missed, and better ways of doing this. My full repo, including the unit tests is here. The original coding challenge description is here.
``
:param ctime: The time which has to be checked.
:r
The requirement was to compute the in-degree of a graph made from unique hash tags within a given window of time (rolling). Any hash tag (case insensitive) is a node, and two unique hash tags present in the same tweet becomes an edge. There are no weighted edges. Any time a new tweet comes in, the graph is reexamined, and tweets older than the given window time from the latest tweet is discarded.
I would like to get feed back on my coding style, any idioms that I missed, and better ways of doing this. My full repo, including the unit tests is here. The original coding challenge description is here.
``
#!/usr/bin/env python3
"""
This module computes the rolling average vertex degree of a twitter
tweet hashtag graph.
"""
import itertools
import json
import sys
import time
import argparse
import logging
from typing import Dict, Tuple, List, Any, Optional, cast, Iterable
from heapdict import heapdict
TIME_FMT = "%a %b %d %H:%M:%S +0000 %Y"
logging.basicConfig(level=logging.WARNING, format='%(levelname)s %(message)s', stream=sys.stderr)
LOG = logging.getLogger(__name__)
class TweetGraph:
"""
Process the tweet, and keeps track of the time. This implementation uses
a priority queue (heap) containing a tuple (edge,created time) as the
backbone.
"""
def __init__(self, curtime: int, window: int) -> None:
"""
Initialize the TweetGraph
:param curtime: The starting time
:param window: The sliding window
"""
self.latest = curtime
self.edges = {} # type: Dict[Tuple[str, str], int]
self.queue = heapdict()
self.window = window
def in_window(self, ctime: int) -> bool:
"""
Is the passed in time within the window? Note that the formula is
(self.latest - ctime) >= window`:param ctime: The time which has to be checked.
:r
Solution
Just looking at the function
-
The docstring says, "Parse the line into json" but actually it parses the line from JSON (and into a Python dictionary).
-
This function has multiple responsibilities: it parses a tweet, and also logs failures. This makes the code hard to re-use if you have a different logging environment or different requirements about how to handle failures. It would be better to separate these responsibilities and have the caller do the logging (or otherwise handling) of failures.
-
If you are going to log an exception and continue, it's a good idea to log a full traceback (using the
-
The code assumes that
-
The code parses the time and uses it to construct a Unix timestamp. But this time representation is not very convenient to use in Python. It would be better to use
-
The name
get_tweet:-
The docstring says, "Parse the line into json" but actually it parses the line from JSON (and into a Python dictionary).
-
This function has multiple responsibilities: it parses a tweet, and also logs failures. This makes the code hard to re-use if you have a different logging environment or different requirements about how to handle failures. It would be better to separate these responsibilities and have the caller do the logging (or otherwise handling) of failures.
-
If you are going to log an exception and continue, it's a good idea to log a full traceback (using the
traceback module) so that you have as much information as possible. At the moment, the logging in get_tweet doesn't indicate whether the failure happened in json.loads or in time.strptime.-
The code assumes that
json.loads returns a dictionary, but in fact this function can also return a number, string, Boolean, list, or None, and in these cases j.get('created_at', None) would fail with AttributeError.-
The code parses the time and uses it to construct a Unix timestamp. But this time representation is not very convenient to use in Python. It would be better to use
datetime.datetime objects throughout.-
The name
j is not very clear: presumably j is short for JSON but the value of this variable is not a JSON string, but a Python dictionary representing a tweet. So tweet would be a better name.Context
StackExchange Code Review Q#126169, answer score: 2
Revisions (0)
No revisions yet.