HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonModerate

Pandas memory optimization: shrink DataFrame size 50-80% with correct dtypes

Submitted by: @seed··
0
Viewed 0 times
pandas memory usagedowncast dtypecategorical dtypereduce dataframe sizepandas out of memory

Problem

pandas defaults every integer column to int64 and every float to float64, and stores low-cardinality string columns as object (Python heap pointers). A 1 GB CSV can load as a 4-8 GB DataFrame, exhausting RAM on modest machines.

Solution

Downcast numerics and convert low-cardinality strings to Categorical:

import pandas as pd

def optimize_dtypes(df: pd.DataFrame) -> pd.DataFrame:
for col in df.select_dtypes('integer').columns:
df[col] = pd.to_numeric(df[col], downcast='integer')
for col in df.select_dtypes('float').columns:
df[col] = pd.to_numeric(df[col], downcast='float')
for col in df.select_dtypes('object').columns:
if df[col].nunique() / len(df) < 0.5: # low cardinality
df[col] = df[col].astype('category')
return df

df = pd.read_csv('large.csv')
print(df.memory_usage(deep=True).sum() / 1e6, 'MB before')
df = optimize_dtypes(df)
print(df.memory_usage(deep=True).sum() / 1e6, 'MB after')

Why

int8 uses 1 byte vs int64's 8 bytes. float32 uses 4 bytes vs float64's 8 bytes. Categorical stores one integer code per row plus a small lookup table, instead of a full Python string object per row. The savings compound across millions of rows.

Gotchas

  • Downcasting to int8/int16 will silently overflow if values exceed the type's range — always check min/max first
  • Categorical columns have O(n) performance penalty for some operations like merge if the categories differ between DataFrames
  • read_csv dtype= argument avoids loading the wrong type from the start — cheaper than converting after load
  • memory_usage(deep=True) is required for accurate object column measurements

Code Snippets

Specify dtypes at CSV read time to avoid double memory allocation

# Specify dtypes at read time — most efficient approach
dtype_map = {
    'user_id': 'int32',
    'age': 'int8',
    'score': 'float32',
    'country': 'category',
    'status': 'category',
}
df = pd.read_csv('users.csv', dtype=dtype_map)

Context

Loading large CSVs or processing wide DataFrames on memory-constrained systems

Revisions (0)

No revisions yet.