First, to convert a Categorical column to its numerical codes, you can do this easier with: dataframe['c'].cat.codes.
Further, it is possible to select automatically all columns with a certain dtype in a dataframe using select_dtypes. This way, you can apply above operation on multiple and automatically selected columns.
First making an example dataframe:
CopyIn [75]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'), 'col3':list('ababb')})
In [76]: df['col2'] = df['col2'].astype('category')
In [77]: df['col3'] = df['col3'].astype('category')
In [78]: df.dtypes
Out[78]:
col1 int64
col2 category
col3 category
dtype: object
Then by using select_dtypes to select the columns, and then applying .cat.codes on each of these columns, you can get the following result:
CopyIn [80]: cat_columns = df.select_dtypes(['category']).columns
In [81]: cat_columns
Out[81]: Index([u'col2', u'col3'], dtype='object')
In [83]: df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)
In [84]: df
Out[84]:
col1 col2 col3
0 1 0 0
1 2 1 1
2 3 2 0
3 4 0 1
4 5 1 1
Note:
- NaN becomes -1
- This method is fast because the relationship between code and category is readily available and do not need to be computed.
First, to convert a Categorical column to its numerical codes, you can do this easier with: dataframe['c'].cat.codes.
Further, it is possible to select automatically all columns with a certain dtype in a dataframe using select_dtypes. This way, you can apply above operation on multiple and automatically selected columns.
First making an example dataframe:
CopyIn [75]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'), 'col3':list('ababb')})
In [76]: df['col2'] = df['col2'].astype('category')
In [77]: df['col3'] = df['col3'].astype('category')
In [78]: df.dtypes
Out[78]:
col1 int64
col2 category
col3 category
dtype: object
Then by using select_dtypes to select the columns, and then applying .cat.codes on each of these columns, you can get the following result:
CopyIn [80]: cat_columns = df.select_dtypes(['category']).columns
In [81]: cat_columns
Out[81]: Index([u'col2', u'col3'], dtype='object')
In [83]: df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)
In [84]: df
Out[84]:
col1 col2 col3
0 1 0 0
1 2 1 1
2 3 2 0
3 4 0 1
4 5 1 1
Note:
- NaN becomes -1
- This method is fast because the relationship between code and category is readily available and do not need to be computed.
This works for me:
Copypandas.factorize( ['B', 'C', 'D', 'B'] )[0]
Output:
Copy[0, 1, 2, 0]
How to Transform Categorical Data to Numerical Data Using Pandas
In Pandas, how do I transform a categorical column with a related numeric column into several numeric columns with the categories as headers?
Videos
I am writing a python program that uses logistic regression to predict an outcome based on survey data from a csv. However, I'm running into the issue that some survey data is non-numerical. I need to:
-
transform categorical data to numerical data, without knowing which columns are categorical or how many categories per column there are ahead of time
-
be able to map the numerical data onto the category labels later
Any suggestions on how to approach this? I sincerely appreciate any thoughts!
Example data:
| weight | systolic blood pressure | has diabetes? |
|---|---|---|
| 155 | 119 | no |
| 210 | 131 | yes |
| 301 | 143 | yes |
Example output:
| weight | systolic blood pressure | has diabetes? |
|---|---|---|
| 155 | 119 | 0 |
| 210 | 131 | 1 |
| 301 | 143 | 1 |
diabetes_dict = {
0: "no",
1: "yes"
}First, change the type of the column:
df.cc = pd.Categorical(df.cc)
Now the data look similar but are stored categorically. To capture the category codes:
df['code'] = df.cc.codes
Now you have:
cc temp code
0 US 37.0 2
1 CA 12.0 1
2 US 35.0 2
3 AU 20.0 0
If you don't want to modify your DataFrame but simply get the codes(.cat is used to access categorical methods):
df.cc.astype('category').cat.codes
Or use the categorical column as an index:
df2 = pd.DataFrame(df.temp)
df2.index = pd.CategoricalIndex(df.cc)
If you wish only to transform your series into integer identifiers, you can use pd.factorize.
Note this solution, unlike pd.Categorical, will not sort alphabetically. So the first country will be assigned 0. If you wish to start from 1, you can add a constant:
df['code'] = pd.factorize(df['cc'])[0] + 1
print(df)
cc temp code
0 US 37.0 1
1 CA 12.0 2
2 US 35.0 1
3 AU 20.0 3
If you wish to sort alphabetically, specify sort=True:
df['code'] = pd.factorize(df['cc'], sort=True)[0] + 1