You can convert most of the columns by just calling convert_objects:
CopyIn [36]:
df = df.convert_objects(convert_numeric=True)
df.dtypes
Out[36]:
Date object
WD int64
Manpower float64
2nd object
CTR object
2ndU float64
T1 int64
T2 int64
T3 int64
T4 float64
dtype: object
For column '2nd' and 'CTR' we can call the vectorised str methods to replace the thousands separator and remove the '%' sign and then astype to convert:
CopyIn [39]:
df['2nd'] = df['2nd'].str.replace(',','').astype(int)
df['CTR'] = df['CTR'].str.replace('%','').astype(np.float64)
df.dtypes
Out[39]:
Date object
WD int64
Manpower float64
2nd int32
CTR float64
2ndU float64
T1 int64
T2 int64
T3 int64
T4 object
dtype: object
In [40]:
df.head()
Out[40]:
Date WD Manpower 2nd CTR 2ndU T1 T2 T3 T4
0 2013/4/6 6 NaN 2645 5.27 0.29 407 533 454 368
1 2013/4/7 7 NaN 2118 5.89 0.31 257 659 583 369
2 2013/4/13 6 NaN 2470 5.38 0.29 354 531 473 383
3 2013/4/14 7 NaN 2033 6.77 0.37 396 748 681 458
4 2013/4/20 6 NaN 2690 5.38 0.29 361 528 541 381
Or you can do the string handling operations above without the call to astype and then call convert_objects to convert everything in one go.
UPDATE
Since version 0.17.0 convert_objects is deprecated and there isn't a top-level function to do this so you need to do:
df.apply(lambda col:pd.to_numeric(col, errors='coerce'))
See the docs and this related question: pandas: to_numeric for multiple columns
Answer from EdChum on Stack OverflowYou can convert most of the columns by just calling convert_objects:
CopyIn [36]:
df = df.convert_objects(convert_numeric=True)
df.dtypes
Out[36]:
Date object
WD int64
Manpower float64
2nd object
CTR object
2ndU float64
T1 int64
T2 int64
T3 int64
T4 float64
dtype: object
For column '2nd' and 'CTR' we can call the vectorised str methods to replace the thousands separator and remove the '%' sign and then astype to convert:
CopyIn [39]:
df['2nd'] = df['2nd'].str.replace(',','').astype(int)
df['CTR'] = df['CTR'].str.replace('%','').astype(np.float64)
df.dtypes
Out[39]:
Date object
WD int64
Manpower float64
2nd int32
CTR float64
2ndU float64
T1 int64
T2 int64
T3 int64
T4 object
dtype: object
In [40]:
df.head()
Out[40]:
Date WD Manpower 2nd CTR 2ndU T1 T2 T3 T4
0 2013/4/6 6 NaN 2645 5.27 0.29 407 533 454 368
1 2013/4/7 7 NaN 2118 5.89 0.31 257 659 583 369
2 2013/4/13 6 NaN 2470 5.38 0.29 354 531 473 383
3 2013/4/14 7 NaN 2033 6.77 0.37 396 748 681 458
4 2013/4/20 6 NaN 2690 5.38 0.29 361 528 541 381
Or you can do the string handling operations above without the call to astype and then call convert_objects to convert everything in one go.
UPDATE
Since version 0.17.0 convert_objects is deprecated and there isn't a top-level function to do this so you need to do:
df.apply(lambda col:pd.to_numeric(col, errors='coerce'))
See the docs and this related question: pandas: to_numeric for multiple columns
convert_objects is deprecated.
For pandas >= 0.17.0, use pd.to_numeric
Copydf["2nd"] = pd.to_numeric(df["2nd"])
How do I convert a pandas column to float?
How do I convert all columns in a pandas DataFrame to float?
How do I convert an object column to float in pandas?
- If the dataframe (say
df) wholly consists offloat64dtypes, you can do:
df = df.astype('float32')
- Only if some columns are
float64, then you'd have to select those columns and change their dtype:
# Select columns with 'float64' dtype
float64_cols = list(df.select_dtypes(include='float64'))
# The same code again calling the columns
df[float64_cols] = df[float64_cols].astype('float32')
Try this:
df[df.select_dtypes(np.float64).columns] = df.select_dtypes(np.float64).astype(np.float32)
I'm creating a pandas dataframe using US census data for a project I'm working on, and I'm having a problem converting the datatype of my columns to a numeric value. I've tried a couple different ways and keep running into the same issue.
dp03_emp = pd.read_csv('DP03_employment.csv',header=1).iloc[1:, :]
dp03_cleaned = dp03_emp[['Percent!!EMPLOYMENT STATUS!!Percent Unemployed',
'Estimate!!INCOME AND BENEFITS (IN 2011 INFLATION-ADJUSTED DOLLARS)!!Median household income (dollars)',
'Estimate!!INCOME AND BENEFITS (IN 2011 INFLATION-ADJUSTED DOLLARS)!!Mean household income (dollars)',
'Percent!!PERCENTAGE OF FAMILIES AND PEOPLE WHOSE INCOME IN THE PAST 12 MONTHS IS BELOW THE POVERTY LEVEL!!All people',
'Geographic Area Name',
'Survey Year']]
columns = ['Percent Unemployed','Median Household Income','Mean Household Income', 'Percent below poverty line']
dp03_cleaned[columns]=pd.to_numeric(columns,errors='coerce')I know that using the 'coerce' argument will assign everything that is not recognized as a number either null or NaN, which is what happens to every value.
dp03_cleaned.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 20110 entries, 1 to 20110 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Percent Unemployed 0 non-null float64 1 Median Household Income 0 non-null float64 2 Mean Household Income 0 non-null float64 3 Percent below poverty line 0 non-null float64 4 Geographic Area Name 20109 non-null object 5 Survey Year 20109 non-null float64 dtypes: float64(5), object(1) memory usage: 942.8+ KB
I've also tried changing the datatype at the point of import in the read_csv dtype option, but it gives an error as well.
Sorry if this is a simple question, I've spent a fair bit of time trying to Google it and haven't had any luck. Thanks in advance for the help.