patternpythonCriticalCanonical
Change column type in pandas
Viewed 0 times
changecolumnpandastype
Problem
I created a DataFrame from a list of lists:
How do I convert the columns to specific types? In this case, I want to convert columns 2 and 3 into floats.
Is there a way to specify the types while converting the list to DataFrame? Or is it better to create the DataFrame first and then loop through the columns to change the dtype for each column? Ideally I would like to do this in a dynamic way because there can be hundreds of columns, and I don't want to specify exactly which columns are of which type. All I can guarantee is that each column contains values of the same type.
table = [
['a', '1.2', '4.2' ],
['b', '70', '0.03'],
['x', '5', '0' ],
]
df = pd.DataFrame(table)
How do I convert the columns to specific types? In this case, I want to convert columns 2 and 3 into floats.
Is there a way to specify the types while converting the list to DataFrame? Or is it better to create the DataFrame first and then loop through the columns to change the dtype for each column? Ideally I would like to do this in a dynamic way because there can be hundreds of columns, and I don't want to specify exactly which columns are of which type. All I can guarantee is that each column contains values of the same type.
Solution
You have four main options for converting types in pandas:
-
-
-
-
Read on for more detailed explanations and usage of each of these methods.
The best way to convert one or more columns of a DataFrame to numeric values is to use
This function will try to change non-numeric objects (such as strings) into integers or floating-point numbers as appropriate.
Basic usage
The input to
As you can see, a new Series is returned. Remember to assign this output to a variable or column name to continue using it:
You can also use it to convert multiple columns of a DataFrame via the
As long as your values can all be converted, that's probably all you need.
Error handling
But what if some values can't be converted to a numeric type?
Here's an example using a Series of strings
The default behaviour is to raise if it can't convert a value. In this case, it can't cope with the string 'pandas':
Rather than fail, we might want 'pandas' to be considered a missing/bad numeric value. We can coerce invalid values to
The third option for
This last option is particularly useful for converting your entire DataFrame, but don't know which of our columns can be converted reliably to a numeric type. In that case, just write:
The function will be applied to each column of the DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.
Downcasting
By default, conversion with
That's usually what you want, but what if you wanted to save some memory and use a more compact dtype, like
Downcasting to
Downcasting to
The
Basic usage
Just pick a type: you can use a NumPy dtype (e.g.
Call the method on the object you want to convert and
```
# convert all DataFrame columns to the i
-
to_numeric() - provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See also to_datetime() and to_timedelta().)-
astype() - convert (almost) any type to (almost) any other type (even if it's not necessarily sensible to do so). Also allows you to convert to categorial types (very useful).-
infer_objects() - a utility method to convert object columns holding Python objects to a pandas type if possible.-
convert_dtypes() - convert DataFrame columns to the "best possible" dtype that supports pd.NA (pandas' object to indicate a missing value).Read on for more detailed explanations and usage of each of these methods.
to_numeric()
The best way to convert one or more columns of a DataFrame to numeric values is to use
pandas.to_numeric().This function will try to change non-numeric objects (such as strings) into integers or floating-point numbers as appropriate.
Basic usage
The input to
to_numeric() is a Series or a single column of a DataFrame.>>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
>>> s
0 8
1 6
2 7.5
3 3
4 0.9
dtype: object
>>> pd.to_numeric(s) # convert everything to float values
0 8.0
1 6.0
2 7.5
3 3.0
4 0.9
dtype: float64As you can see, a new Series is returned. Remember to assign this output to a variable or column name to continue using it:
# convert Series
my_series = pd.to_numeric(my_series)
# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])You can also use it to convert multiple columns of a DataFrame via the
apply() method:# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame
# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)As long as your values can all be converted, that's probably all you need.
Error handling
But what if some values can't be converted to a numeric type?
to_numeric() also takes an errors keyword argument that allows you to force non-numeric values to be NaN, or simply ignore columns containing these values.Here's an example using a Series of strings
s which has the object dtype:>>> s = pd.Series(['1', '2', '4.7', 'pandas', '10'])
>>> s
0 1
1 2
2 4.7
3 pandas
4 10
dtype: objectThe default behaviour is to raise if it can't convert a value. In this case, it can't cope with the string 'pandas':
>>> pd.to_numeric(s) # or pd.to_numeric(s, errors='raise')
ValueError: Unable to parse stringRather than fail, we might want 'pandas' to be considered a missing/bad numeric value. We can coerce invalid values to
NaN as follows using the errors keyword argument:>>> pd.to_numeric(s, errors='coerce')
0 1.0
1 2.0
2 4.7
3 NaN
4 10.0
dtype: float64The third option for
errors is just to ignore the operation if an invalid value is encountered:>>> pd.to_numeric(s, errors='ignore')
# the original Series is returned untouchedThis last option is particularly useful for converting your entire DataFrame, but don't know which of our columns can be converted reliably to a numeric type. In that case, just write:
df.apply(pd.to_numeric, errors='ignore')The function will be applied to each column of the DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.
Downcasting
By default, conversion with
to_numeric() will give you either an int64 or float64 dtype (or whatever integer width is native to your platform).That's usually what you want, but what if you wanted to save some memory and use a more compact dtype, like
float32, or int8?to_numeric() gives you the option to downcast to either 'integer', 'signed', 'unsigned', 'float'. Here's an example for a simple series s of integer type:>>> s = pd.Series([1, 2, -7])
>>> s
0 1
1 2
2 -7
dtype: int64Downcasting to
'integer' uses the smallest possible integer that can hold the values:>>> pd.to_numeric(s, downcast='integer')
0 1
1 2
2 -7
dtype: int8Downcasting to
'float' similarly picks a smaller than normal floating type:>>> pd.to_numeric(s, downcast='float')
0 1.0
1 2.0
2 -7.0
dtype: float32astype()
The
astype() method enables you to be explicit about the dtype you want your DataFrame or Series to have. It's very versatile in that you can try and go from one type to any other.Basic usage
Just pick a type: you can use a NumPy dtype (e.g.
np.int16), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype).Call the method on the object you want to convert and
astype() will try and convert it for you:```
# convert all DataFrame columns to the i
Code Snippets
>>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
>>> s
0 8
1 6
2 7.5
3 3
4 0.9
dtype: object
>>> pd.to_numeric(s) # convert everything to float values
0 8.0
1 6.0
2 7.5
3 3.0
4 0.9
dtype: float64# convert Series
my_series = pd.to_numeric(my_series)
# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame
# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)>>> s = pd.Series(['1', '2', '4.7', 'pandas', '10'])
>>> s
0 1
1 2
2 4.7
3 pandas
4 10
dtype: object>>> pd.to_numeric(s) # or pd.to_numeric(s, errors='raise')
ValueError: Unable to parse stringContext
Stack Overflow Q#15891038, score: 2631
Revisions (0)
No revisions yet.