patternpythonModerate
Pandas memory optimization: shrink DataFrame size 50-80% with correct dtypes
Viewed 0 times
pandas memory usagedowncast dtypecategorical dtypereduce dataframe sizepandas out of memory
Problem
pandas defaults every integer column to int64 and every float to float64, and stores low-cardinality string columns as object (Python heap pointers). A 1 GB CSV can load as a 4-8 GB DataFrame, exhausting RAM on modest machines.
Solution
Downcast numerics and convert low-cardinality strings to Categorical:
import pandas as pd
def optimize_dtypes(df: pd.DataFrame) -> pd.DataFrame:
for col in df.select_dtypes('integer').columns:
df[col] = pd.to_numeric(df[col], downcast='integer')
for col in df.select_dtypes('float').columns:
df[col] = pd.to_numeric(df[col], downcast='float')
for col in df.select_dtypes('object').columns:
if df[col].nunique() / len(df) < 0.5: # low cardinality
df[col] = df[col].astype('category')
return df
df = pd.read_csv('large.csv')
print(df.memory_usage(deep=True).sum() / 1e6, 'MB before')
df = optimize_dtypes(df)
print(df.memory_usage(deep=True).sum() / 1e6, 'MB after')
import pandas as pd
def optimize_dtypes(df: pd.DataFrame) -> pd.DataFrame:
for col in df.select_dtypes('integer').columns:
df[col] = pd.to_numeric(df[col], downcast='integer')
for col in df.select_dtypes('float').columns:
df[col] = pd.to_numeric(df[col], downcast='float')
for col in df.select_dtypes('object').columns:
if df[col].nunique() / len(df) < 0.5: # low cardinality
df[col] = df[col].astype('category')
return df
df = pd.read_csv('large.csv')
print(df.memory_usage(deep=True).sum() / 1e6, 'MB before')
df = optimize_dtypes(df)
print(df.memory_usage(deep=True).sum() / 1e6, 'MB after')
Why
int8 uses 1 byte vs int64's 8 bytes. float32 uses 4 bytes vs float64's 8 bytes. Categorical stores one integer code per row plus a small lookup table, instead of a full Python string object per row. The savings compound across millions of rows.
Gotchas
- Downcasting to int8/int16 will silently overflow if values exceed the type's range — always check min/max first
- Categorical columns have O(n) performance penalty for some operations like merge if the categories differ between DataFrames
- read_csv dtype= argument avoids loading the wrong type from the start — cheaper than converting after load
- memory_usage(deep=True) is required for accurate object column measurements
Code Snippets
Specify dtypes at CSV read time to avoid double memory allocation
# Specify dtypes at read time — most efficient approach
dtype_map = {
'user_id': 'int32',
'age': 'int8',
'score': 'float32',
'country': 'category',
'status': 'category',
}
df = pd.read_csv('users.csv', dtype=dtype_map)Context
Loading large CSVs or processing wide DataFrames on memory-constrained systems
Revisions (0)
No revisions yet.