HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonModeratepending

Python Generators for Memory-Efficient Data Processing

Submitted by: @anonymous··
0
Viewed 0 times
generatorsyieldlazy evaluationmemory efficientpipelineitertoolsbatched

Problem

Processing large datasets (millions of rows, large files) by loading everything into memory causes MemoryError or excessive RAM usage.

Solution

Use generators to process data lazily:

# BAD: loads entire file into memory
def read_large_file_bad(path):
    with open(path) as f:
        return f.readlines()  # All lines in memory

# GOOD: yields one line at a time
def read_large_file(path):
    with open(path) as f:
        for line in f:  # File objects are iterators
            yield line.strip()

# Chain generators for pipeline processing
def parse_csv_lines(lines):
    for line in lines:
        yield line.split(',')

def filter_active(records):
    for record in records:
        if record[2] == 'active':
            yield record

def transform(records):
    for record in records:
        yield {
            'name': record[0],
            'email': record[1],
            'status': record[2]
        }

# Pipeline: processes one record at a time
# Memory usage is O(1) regardless of file size
lines = read_large_file('users.csv')
records = parse_csv_lines(lines)
active = filter_active(records)
users = transform(active)

for user in users:
    process(user)

# Generator expressions (like list comprehensions but lazy)
total = sum(len(line) for line in read_large_file('data.txt'))

# itertools for advanced patterns
import itertools

# Process in batches
def batched(iterable, n):
    it = iter(iterable)
    while batch := list(itertools.islice(it, n)):
        yield batch

for batch in batched(read_large_file('huge.csv'), 1000):
    bulk_insert(batch)  # Insert 1000 at a time

Why

Generators produce values on-demand instead of computing and storing everything upfront. A generator pipeline processing 10GB of data might use only a few KB of memory.

Gotchas

  • Generators can only be iterated ONCE - converting to list consumes them
  • Generator pipelines are lazy - nothing happens until you iterate

Context

Processing large datasets in Python

Revisions (0)

No revisions yet.