Sometimes, you just have to use a for-loop:
for col in ['parks', 'playgrounds', 'sports', 'roading']:
public[col] = public[col].astype('category')
Answer from unutbu on Stack OverflowSometimes, you just have to use a for-loop:
for col in ['parks', 'playgrounds', 'sports', 'roading']:
public[col] = public[col].astype('category')
No need for loops, Pandas can do it directly now, just pass a list of columns you want to convert and Pandas will convert them all.
cols = ['parks', 'playgrounds', 'sports', 'roading']
public[cols] = public[cols].astype('category')
df = pd.DataFrame({'a': ['a', 'b', 'c'], 'b': ['c', 'd', 'e']})
>> a b
>> 0 a c
>> 1 b d
>> 2 c e
df.dtypes
>> a object
>> b object
>> dtype: object
df[df.columns] = df[df.columns].astype('category')
df.dtypes
>> a category
>> b category
>> dtype: object
python - Converting multiple columns to categories in Pandas. apply? - Stack Overflow
scikit learn - Mass convert categorical columns in Pandas (not one-hot encoding) - Data Science Stack Exchange
pandas - Python: Converting multiple columns to a single column with categorical data - Stack Overflow
How to Transform Categorical Data to Numerical Data Using Pandas
Videos
I'm working with sentiment data. I'm trying to find the most pythonic way to do a transformation like so:
Original Dataframe
| Index | SENTIMENT | CONFIDENCE |
|---|---|---|
| 0 | Positive | .99 |
| 1 | Negative | .98 |
| 2 | Positive | .9 |
| 3 | Neutral | .8 |
Converted to New Dataframe
| Index | Positive | Negative | Neutral |
|---|---|---|---|
| 0 | .99 | NaN | NaN |
| 1 | NaN | .98 | NaN |
| 2 | .9 | NaN | NaN |
| 3 | NaN | NaN | .8 |
I've been doing this with nested loops forever, and I just know that there's some one-line or two-line solution.
I appreciate the help.
This was just fixed in master, and so will be in 0.17.0, see the issue here
In [7]: df = DataFrame({'A' : list('aabbcd'), 'B' : list('ffghhe')})
In [8]: df
Out[8]:
A B
0 a f
1 a f
2 b g
3 b h
4 c h
5 d e
In [9]: df.dtypes
Out[9]:
A object
B object
dtype: object
In [10]: df.apply(lambda x: x.astype('category'))
Out[10]:
A B
0 a f
1 a f
2 b g
3 b h
4 c h
5 d e
In [11]: df.apply(lambda x: x.astype('category')).dtypes
Out[11]:
A category
B category
dtype: object
Note that since pandas 0.23.0 you no longer apply to convert multiple columns to categorical data types. Now you can simply do df[to_convert].astype('category') instead (where to_convert is a set of columns as defined in the question).
If your categorical columns are currently character/object you can use something like this to do each one:
char_cols = df.dtypes.pipe(lambda x: x[x == 'object']).index
for c in char_cols:
df[c] = pd.factorize(df[c])[0]
If you need to be able to get back to the categories I'd create a dictionary to save the encoding; something like:
char_cols = df.dtypes.pipe(lambda x: x[x == 'object']).index
label_mapping = {}
for c in char_cols:
df[c], label_mapping[c] = pd.factorize(df[c])
Using Julien's mcve will output:
In [3]: print(df)
Out[3]:
a b c d
0 0 0 0 0.155463
1 1 1 1 0.496427
2 0 0 2 0.168625
3 2 0 1 0.209681
4 0 2 1 0.661857
In [4]: print(label_mapping)
Out[4]:
{'a': Index(['Var2', 'Var3', 'Var1'], dtype='object'),
'b': Index(['Var2', 'Var1', 'Var3'], dtype='object'),
'c': Index(['Var3', 'Var2', 'Var1'], dtype='object')}
First, let's create a mcve to play with:
import pandas as pd
import numpy as np
In [1]: categorical_array = np.random.choice(['Var1','Var2','Var3'],
size=(5,3), p=[0.25,0.5,0.25])
df = pd.DataFrame(categorical_array,
columns=map(lambda x:chr(97+x), range(categorical_array.shape[1])))
# Add another column that isn't categorical but float
df['d'] = np.random.rand(len(df))
print(df)
Out[1]:
a b c d
0 Var3 Var3 Var3 0.953153
1 Var1 Var2 Var1 0.924896
2 Var2 Var2 Var2 0.273205
3 Var2 Var1 Var3 0.459676
4 Var2 Var1 Var1 0.114358
Now we can use pd.get_dummies to encode the first three columns.
Note that I'm using the drop_firstparameter because N-1 dummies are sufficient to fully describe N possibilities (eg: if a_Var2 and a_Var3 are 0, then it's a_Var1).
Also, I'm specifically specifying the columns but I don't have to as it will be columns with dtype either object or categorical (more below).
In [2]: df_encoded = pd.get_dummies(df, columns=['a','b', 'c'], drop_first=True)
print(df_encoded]
Out[2]:
d a_Var2 a_Var3 b_Var2 b_Var3 c_Var2 c_Var3
0 0.953153 0 1 0 1 0 1
1 0.924896 0 0 1 0 0 0
2 0.273205 1 0 1 0 1 0
3 0.459676 1 0 0 0 0 1
4 0.114358 1 0 0 0 0 0
In your specific application, you'll have to provide a list of column that are Categorical, or you'll have to infer which columns are Categorical.
Best case scenario your dataframe already has these columns with a dtype=category and you can pass columns=df.columns[df.dtypes == 'category'] to get_dummies.
Otherwise I suggest setting the dtype of all other columns as appropriate (hint: pd.to_numeric, pd.to_datetime, etc) and you'll be left with columns that have an object dtype and these should be your categorical columns.
The pd.get_dummies parameter columns defaults as follows:
columns : list-like, default None
Column names in the DataFrame to be encoded.
If `columns` is None then all the columns with
`object` or `category` dtype will be converted.