patternpythonModeratepending
Python Generators for Memory-Efficient Data Processing
Viewed 0 times
generatorsyieldlazy evaluationmemory efficientpipelineitertoolsbatched
Problem
Processing large datasets (millions of rows, large files) by loading everything into memory causes MemoryError or excessive RAM usage.
Solution
Use generators to process data lazily:
# BAD: loads entire file into memory
def read_large_file_bad(path):
with open(path) as f:
return f.readlines() # All lines in memory
# GOOD: yields one line at a time
def read_large_file(path):
with open(path) as f:
for line in f: # File objects are iterators
yield line.strip()
# Chain generators for pipeline processing
def parse_csv_lines(lines):
for line in lines:
yield line.split(',')
def filter_active(records):
for record in records:
if record[2] == 'active':
yield record
def transform(records):
for record in records:
yield {
'name': record[0],
'email': record[1],
'status': record[2]
}
# Pipeline: processes one record at a time
# Memory usage is O(1) regardless of file size
lines = read_large_file('users.csv')
records = parse_csv_lines(lines)
active = filter_active(records)
users = transform(active)
for user in users:
process(user)
# Generator expressions (like list comprehensions but lazy)
total = sum(len(line) for line in read_large_file('data.txt'))
# itertools for advanced patterns
import itertools
# Process in batches
def batched(iterable, n):
it = iter(iterable)
while batch := list(itertools.islice(it, n)):
yield batch
for batch in batched(read_large_file('huge.csv'), 1000):
bulk_insert(batch) # Insert 1000 at a timeWhy
Generators produce values on-demand instead of computing and storing everything upfront. A generator pipeline processing 10GB of data might use only a few KB of memory.
Gotchas
- Generators can only be iterated ONCE - converting to list consumes them
- Generator pipelines are lazy - nothing happens until you iterate
Context
Processing large datasets in Python
Revisions (0)
No revisions yet.