patternpythonMinor

Using namedtuple for slice of class

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

namedtupleforusingclassslice

Problem

I am working with very large datasets in a distributed environment. In particular, I am writing some code in Python for analyzing web logs in Spark. Because of the nature of Map-Reduce computing and the size of the logs I am working with, I would like to restrict the amount of data being passed around as much as possible.

For any given analysis, I am likely to only be interested in a few fields out of the several dozen available, some of which are very large. I don't want to store the whole object in memory; I only want a pared-down version containing the fields I'm interested in.

A simple data structure like a list would be possible, but it would be convenient to be able to refer to fields by name rather than keep track of what position they've been placed in the list. A dict would allow this, but with the heavy overhead of storing key names for each log line. This led me to namedtuple.

So here's a simplified example of what I have in mind. I have four main concerns (but any other tips are welcome, as I am quite new to Python):

Is using namedtuple here the best choice? (As opposed to, possibly, another object with only a few fields initialized.)

Is this going to create a new namedtuple type for each line, thus creating a new list of field names each time and defeating the purpose of avoiding dict?

The parsing method makes me long for an auto-increment syntax, so perhaps I'm missing a more pythonic way of doing it

This method does not seem to offer a clean way of including individual values from the properties dictionary

The code:

```
import ast
from collections import namedtuple
from IPy import IP

class LogRecord:
def __init__(self, line):
self.parse(line)

def parse(self, line):
fields = line.split('\t')
i = 0
self.version = fields[i]

i += 1
self.date = fields[i]

i += 1
self.time = long(fields[i])

i += 1
self.ipAddress = IP(fields[i])

i +

Solution

Just a few ideas

You could use tuple unpacking to make parse more concise.

class LogRecord:
    def __init__(self, line):
        self.parse(line)

    def parse(self, line):
        fields = line.split('\t')
        self.version, self.date, time, self.ipAddress, self.url, self.userAgentString, properties, self.customerId = fields
        self.time = long(time)
        self.properties = ast.literal_eval(properties)

    def select(self, *fields):
        d = { key: getattr(self, key) for key in fields }
        Tuple = namedtuple('Tuple', fields)
        return Tuple(**d)

Alternatively, you could define a list of (name, function to apply) and iterate through both lists in the same with zip to fill a dictionary.

class LogRecord:
    def __init__(self, line):
        self.parse(line)

    def parse(self, line):
        identity = lambda x: x
        desc = [('version', identity), ('date', identity), ('time', long), ('ipAddress', identity), ('url', identity), ('userAgentString', identity), ('properties', ast.literal_eval), ('customerId', identity)]

        fields = line.split('\t')
        assert len(desc) == len(fields)
        self.values = {name: func(field) for (name, func), field in zip(desc, fields)}

    def select(self, *fields):
        d = { key: self.values[key] for key in fields }
        Tuple = namedtuple('Tuple', fields)
        return Tuple(**d)

Code Snippets

class LogRecord:
    def __init__(self, line):
        self.parse(line)

    def parse(self, line):
        fields = line.split('\t')
        self.version, self.date, time, self.ipAddress, self.url, self.userAgentString, properties, self.customerId = fields
        self.time = long(time)
        self.properties = ast.literal_eval(properties)

    def select(self, *fields):
        d = { key: getattr(self, key) for key in fields }
        Tuple = namedtuple('Tuple', fields)
        return Tuple(**d)

class LogRecord:
    def __init__(self, line):
        self.parse(line)

    def parse(self, line):
        identity = lambda x: x
        desc = [('version', identity), ('date', identity), ('time', long), ('ipAddress', identity), ('url', identity), ('userAgentString', identity), ('properties', ast.literal_eval), ('customerId', identity)]

        fields = line.split('\t')
        assert len(desc) == len(fields)
        self.values = {name: func(field) for (name, func), field in zip(desc, fields)}

    def select(self, *fields):
        d = { key: self.values[key] for key in fields }
        Tuple = namedtuple('Tuple', fields)
        return Tuple(**d)

Context

StackExchange Code Review Q#75327, answer score: 4

Revisions (0)

No revisions yet.