patternpythonMinor
A custom Pandas dataframe to_string method
Viewed 0 times
pandasmethodto_stringcustomdataframe
Problem
Oftentimes I find myself converting
Below I implement a custom
```
import pandas
class DataFrame2(pandas.DataFrame):
def __init__( self, *args, **kwargs ):
pandas.DataFrame.__init__(self, *args, **kwargs)
def get_lines_standard(self):
"""standard way to convert pandas dataframe
to lines with fomrmatted column spacing"""
lines = self.to_string(index=False).split('\n')
return lines
def get_lines_fast_unstruct(self):
""" lighter version of pandas.DataFrame.to_string()
with no special spacing format"""
df_recs = self.to_records(index=False)
col_titles = [' '.join(list(self))]
col_data = map(lambda rec:' '.join( map(str,rec) ),
df_recs.tolist())
lines = col_titles + col_data
return lines
def get_lines_fast_struct(self,col_space=1):
""" lighter version of pandas.DataFrame.to_string()
with special spacing format"""
df_recs = self.to_records(index=False) # convert dataframe to array of records
str_data = map(lambda rec: map(str,rec), df_recs ) # map each element to string
self.space = map(lambda x:len(max(x,key=len))+col_space, # returns the max string length in each column as a list
zip(*str_data))
col_titles = [self._format_line(list(self))]
col_data = [self._format_line(row) for row in str_data ]
lines = col_titles + col_data
return lines
pandas.DataFrame objects to lists of formatted row strings, so I can print the rows into, e.g. a tkinter.Listbox. To do this, I have been utilizing pandas.DataFrame.to_string. There is a lot of nice functionality built into the method, but when the number of dataframe rows/columns gets relatively large, to_string starts to tank.Below I implement a custom
pandas.DataFrame class with a few added methods for returning formatted row lines. I am looking to improve upon the get_lines_fast_struct method.```
import pandas
class DataFrame2(pandas.DataFrame):
def __init__( self, *args, **kwargs ):
pandas.DataFrame.__init__(self, *args, **kwargs)
def get_lines_standard(self):
"""standard way to convert pandas dataframe
to lines with fomrmatted column spacing"""
lines = self.to_string(index=False).split('\n')
return lines
def get_lines_fast_unstruct(self):
""" lighter version of pandas.DataFrame.to_string()
with no special spacing format"""
df_recs = self.to_records(index=False)
col_titles = [' '.join(list(self))]
col_data = map(lambda rec:' '.join( map(str,rec) ),
df_recs.tolist())
lines = col_titles + col_data
return lines
def get_lines_fast_struct(self,col_space=1):
""" lighter version of pandas.DataFrame.to_string()
with special spacing format"""
df_recs = self.to_records(index=False) # convert dataframe to array of records
str_data = map(lambda rec: map(str,rec), df_recs ) # map each element to string
self.space = map(lambda x:len(max(x,key=len))+col_space, # returns the max string length in each column as a list
zip(*str_data))
col_titles = [self._format_line(list(self))]
col_data = [self._format_line(row) for row in str_data ]
lines = col_titles + col_data
return lines
Solution
import pandas
np = pandas.npWhat you are doing here is using the numpy that pandas imports, which can lead to confusion. There is an agreed standard to import pandas and numpy:
import pandas as pd
import numpy as npAnd importing
numpy yourself does not load the module twice, as imports are cached. Your import only costs a lookup in sys.modules because numpy already gets imported on the pandas import, but you add a lot of readability.At the end you use
random.choice() but you never imported random.In
get_lines_standard() you first convert the complete DataFrame to a string, then split it on the line breaks. In your example and then you slice the top 5 off it to display. The way your code works here, there is no way to only show the top 5 rows without rendering the complete DataFrame - which applies to all 3 methods.Just to demonstrate the difference of slicing before and after (using random data generated at the end of your code but with 10k rows instead of 1k):
# both calls have the same output:
%timeit df.to_string(index=False).split('\n')[:5]
1 loops, best of 3: 1.51 s per loop
%timeit df[:5].to_string(index=False).split('\n')
100 loops, best of 3: 3.38 ms per loopPS: I don't want to pep8ify you, but please don't line up your equal signs.
/edit:
Ok, let's focus on
get_lines_fast_struct(). You are doing things manually for which there actually exist tools:- Creating a copy of a
DataFramewith the same values as strings can be accomplished bystr_df = self.astype(str)
- The maximum lengths of cells per column of such a dataframe could be determined by
self.spaces= [str_df[c].map(len).max() for c in str_df.columns]
- For
col_datayou use a list comprehension that just call a method for each element, which is basically justmap()
- In
_format_line()you fill up the strings with spaces on the left until they have lengthn+1withnbeing the maximum col length by even mixing 2 styles of string formatting (old and new).string.rjust()does the same thing and might be faster.
All those things in mind the code might look like this:
def get_lines_fast_struct2(self, col_space=1):
str_df = self.astype(str)
self.space = [str_df[c].map(len).max() for c in str_df.columns]
col_titles = map(_format_line2, [self.columns])
col_data = map(_format_line2, str_df.to_records(index=False))
return col_titles + col_data
def _format_line2(self, row_vals):
return "".join(cell.rjust(width) for (cell, width) in zip(row_vals, self.space))Let's compare this with the original in terms of speed and equality:
In [160]: %timeit df.get_lines_fast_struct()
100 loops, best of 3: 11.3 ms per loop
In [161]: %timeit df.get_lines_fast_struct2()
100 loops, best of 3: 9.78 ms per loop
In [162]: df.get_lines_fast_struct() == df.get_lines_fast_struct2()
Out[162]: TrueMaybe there is even a better way with more
pandas magic involved, but I am not that experienced with pandas yet.Code Snippets
import pandas
np = pandas.npimport pandas as pd
import numpy as np# both calls have the same output:
%timeit df.to_string(index=False).split('\n')[:5]
1 loops, best of 3: 1.51 s per loop
%timeit df[:5].to_string(index=False).split('\n')
100 loops, best of 3: 3.38 ms per loopdef get_lines_fast_struct2(self, col_space=1):
str_df = self.astype(str)
self.space = [str_df[c].map(len).max() for c in str_df.columns]
col_titles = map(_format_line2, [self.columns])
col_data = map(_format_line2, str_df.to_records(index=False))
return col_titles + col_data
def _format_line2(self, row_vals):
return "".join(cell.rjust(width) for (cell, width) in zip(row_vals, self.space))In [160]: %timeit df.get_lines_fast_struct()
100 loops, best of 3: 11.3 ms per loop
In [161]: %timeit df.get_lines_fast_struct2()
100 loops, best of 3: 9.78 ms per loop
In [162]: df.get_lines_fast_struct() == df.get_lines_fast_struct2()
Out[162]: TrueContext
StackExchange Code Review Q#99063, answer score: 3
Revisions (0)
No revisions yet.