patternpythonMinor
Using namedtuple for slice of class
Viewed 0 times
namedtupleforusingclassslice
Problem
I am working with very large datasets in a distributed environment. In particular, I am writing some code in Python for analyzing web logs in Spark. Because of the nature of Map-Reduce computing and the size of the logs I am working with, I would like to restrict the amount of data being passed around as much as possible.
For any given analysis, I am likely to only be interested in a few fields out of the several dozen available, some of which are very large. I don't want to store the whole object in memory; I only want a pared-down version containing the fields I'm interested in.
A simple data structure like a list would be possible, but it would be convenient to be able to refer to fields by name rather than keep track of what position they've been placed in the list. A dict would allow this, but with the heavy overhead of storing key names for each log line. This led me to
So here's a simplified example of what I have in mind. I have four main concerns (but any other tips are welcome, as I am quite new to Python):
The code:
```
import ast
from collections import namedtuple
from IPy import IP
class LogRecord:
def __init__(self, line):
self.parse(line)
def parse(self, line):
fields = line.split('\t')
i = 0
self.version = fields[i]
i += 1
self.date = fields[i]
i += 1
self.time = long(fields[i])
i += 1
self.ipAddress = IP(fields[i])
i +
For any given analysis, I am likely to only be interested in a few fields out of the several dozen available, some of which are very large. I don't want to store the whole object in memory; I only want a pared-down version containing the fields I'm interested in.
A simple data structure like a list would be possible, but it would be convenient to be able to refer to fields by name rather than keep track of what position they've been placed in the list. A dict would allow this, but with the heavy overhead of storing key names for each log line. This led me to
namedtuple.So here's a simplified example of what I have in mind. I have four main concerns (but any other tips are welcome, as I am quite new to Python):
- Is using
namedtuplehere the best choice? (As opposed to, possibly, another object with only a few fields initialized.)
- Is this going to create a new
namedtupletype for each line, thus creating a new list of field names each time and defeating the purpose of avoidingdict?
- The parsing method makes me long for an auto-increment syntax, so perhaps I'm missing a more pythonic way of doing it
- This method does not seem to offer a clean way of including individual values from the
propertiesdictionary
The code:
```
import ast
from collections import namedtuple
from IPy import IP
class LogRecord:
def __init__(self, line):
self.parse(line)
def parse(self, line):
fields = line.split('\t')
i = 0
self.version = fields[i]
i += 1
self.date = fields[i]
i += 1
self.time = long(fields[i])
i += 1
self.ipAddress = IP(fields[i])
i +
Solution
Just a few ideas
You could use tuple unpacking to make parse more concise.
Alternatively, you could define a list of (name, function to apply) and iterate through both lists in the same with zip to fill a dictionary.
You could use tuple unpacking to make parse more concise.
class LogRecord:
def __init__(self, line):
self.parse(line)
def parse(self, line):
fields = line.split('\t')
self.version, self.date, time, self.ipAddress, self.url, self.userAgentString, properties, self.customerId = fields
self.time = long(time)
self.properties = ast.literal_eval(properties)
def select(self, *fields):
d = { key: getattr(self, key) for key in fields }
Tuple = namedtuple('Tuple', fields)
return Tuple(**d)Alternatively, you could define a list of (name, function to apply) and iterate through both lists in the same with zip to fill a dictionary.
class LogRecord:
def __init__(self, line):
self.parse(line)
def parse(self, line):
identity = lambda x: x
desc = [('version', identity), ('date', identity), ('time', long), ('ipAddress', identity), ('url', identity), ('userAgentString', identity), ('properties', ast.literal_eval), ('customerId', identity)]
fields = line.split('\t')
assert len(desc) == len(fields)
self.values = {name: func(field) for (name, func), field in zip(desc, fields)}
def select(self, *fields):
d = { key: self.values[key] for key in fields }
Tuple = namedtuple('Tuple', fields)
return Tuple(**d)Code Snippets
class LogRecord:
def __init__(self, line):
self.parse(line)
def parse(self, line):
fields = line.split('\t')
self.version, self.date, time, self.ipAddress, self.url, self.userAgentString, properties, self.customerId = fields
self.time = long(time)
self.properties = ast.literal_eval(properties)
def select(self, *fields):
d = { key: getattr(self, key) for key in fields }
Tuple = namedtuple('Tuple', fields)
return Tuple(**d)class LogRecord:
def __init__(self, line):
self.parse(line)
def parse(self, line):
identity = lambda x: x
desc = [('version', identity), ('date', identity), ('time', long), ('ipAddress', identity), ('url', identity), ('userAgentString', identity), ('properties', ast.literal_eval), ('customerId', identity)]
fields = line.split('\t')
assert len(desc) == len(fields)
self.values = {name: func(field) for (name, func), field in zip(desc, fields)}
def select(self, *fields):
d = { key: self.values[key] for key in fields }
Tuple = namedtuple('Tuple', fields)
return Tuple(**d)Context
StackExchange Code Review Q#75327, answer score: 4
Revisions (0)
No revisions yet.