- If the dataframe (say
df) wholly consists offloat64dtypes, you can do:
df = df.astype('float32')
- Only if some columns are
float64, then you'd have to select those columns and change their dtype:
# Select columns with 'float64' dtype
float64_cols = list(df.select_dtypes(include='float64'))
# The same code again calling the columns
df[float64_cols] = df[float64_cols].astype('float32')
Answer from Shiva Govindaswamy on Stack Overflow- If the dataframe (say
df) wholly consists offloat64dtypes, you can do:
df = df.astype('float32')
- Only if some columns are
float64, then you'd have to select those columns and change their dtype:
# Select columns with 'float64' dtype
float64_cols = list(df.select_dtypes(include='float64'))
# The same code again calling the columns
df[float64_cols] = df[float64_cols].astype('float32')
Try this:
df[df.select_dtypes(np.float64).columns] = df.select_dtypes(np.float64).astype(np.float32)
I think this does what you want:
pd.read_csv('Filename.csv').dropna().astype(np.float32)
To keep rows that only have some NaN values, do this:
pd.read_csv('Filename.csv').dropna(how='all').astype(np.float32)
To replace each NaN with a number instead of dropping rows, do this:
pd.read_csv('Filename.csv').fillna(1e6).astype(np.float32)
(I replaced NaN with 1,000,000 just as an example.)
You can also specify the dtype when you read the csv file:
dtype : Type name or dict of column -> type Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
pd.read_csv(my_file, dtype={col: np.float32 for col in ['col_1', 'col_2']})
Example:
df_out = pd.DataFrame(np.random.random([5,5]), columns=list('ABCDE'))
df_out.iat[1,0] = np.nan
df_out.iat[2,1] = np.nan
df_out.to_csv('my_file.csv')
df = pd.read_csv('my_file.csv', dtype={col: np.float32 for col in list('ABCDE')})
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 6 columns):
Unnamed: 0 5 non-null int64
A 4 non-null float32
B 4 non-null float32
C 5 non-null float32
D 5 non-null float32
E 5 non-null float32
dtypes: float32(5), int64(1)
memory usage: 180.0 bytes
>>> df.dropna(axis=0, how='any')
Unnamed: 0 A B C D E
0 0 0.176224 0.943918 0.322430 0.759862 0.028605
3 3 0.723643 0.105813 0.884290 0.589643 0.913065
4 4 0.654378 0.400152 0.763818 0.416423 0.847861
The problem is that you do not do any type conversion of the numpy array. You calculate a float32 variable and put it as an entry into a float64 numpy array. numpy then converts it properly back to float64
Try someting like this:
a = np.zeros(4,dtype="float64")
print a.dtype
print type(a[0])
a = np.float32(a)
print a.dtype
print type(a[0])
The output (tested with python 2.7)
float64
<type 'numpy.float64'>
float32
<type 'numpy.float32'>
a is in your case the array tree.tree_.threshold
Actually i tried hard but not able to do as the 'sklearn.tree._tree.Tree' objects is not writable.
It is causing a precision issue while generating a PMML file, so i raised a bug over there and they gave an updated solution for it by not converting it in to the Float64 internally.
For more info, you can follow this link: Precision Issue
Use numpy.float32:
In [320]:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a':np.random.randn(10)})
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
a 10 non-null float64
dtypes: float64(1)
memory usage: 160.0 bytes
In [323]:
df['a'].astype(np.float32)
Out[323]:
0 0.966618
1 -0.331942
2 0.906349
3 -0.089582
4 -0.722004
5 0.668103
6 0.230314
7 -1.707631
8 1.806862
9 1.783765
Name: a, dtype: float32
You can see that the dtype is now float32
There is now a simpler solution than the accepted answer, without needing to import numpy:
.astype('float32')
Examples:
df['store'] = pd.DataFrame(data).astype('float32')
df['rating'] = (df['rating']/2).astype('float32')
You can convert most of the columns by just calling convert_objects:
In [36]:
df = df.convert_objects(convert_numeric=True)
df.dtypes
Out[36]:
Date object
WD int64
Manpower float64
2nd object
CTR object
2ndU float64
T1 int64
T2 int64
T3 int64
T4 float64
dtype: object
For column '2nd' and 'CTR' we can call the vectorised str methods to replace the thousands separator and remove the '%' sign and then astype to convert:
In [39]:
df['2nd'] = df['2nd'].str.replace(',','').astype(int)
df['CTR'] = df['CTR'].str.replace('%','').astype(np.float64)
df.dtypes
Out[39]:
Date object
WD int64
Manpower float64
2nd int32
CTR float64
2ndU float64
T1 int64
T2 int64
T3 int64
T4 object
dtype: object
In [40]:
df.head()
Out[40]:
Date WD Manpower 2nd CTR 2ndU T1 T2 T3 T4
0 2013/4/6 6 NaN 2645 5.27 0.29 407 533 454 368
1 2013/4/7 7 NaN 2118 5.89 0.31 257 659 583 369
2 2013/4/13 6 NaN 2470 5.38 0.29 354 531 473 383
3 2013/4/14 7 NaN 2033 6.77 0.37 396 748 681 458
4 2013/4/20 6 NaN 2690 5.38 0.29 361 528 541 381
Or you can do the string handling operations above without the call to astype and then call convert_objects to convert everything in one go.
UPDATE
Since version 0.17.0 convert_objects is deprecated and there isn't a top-level function to do this so you need to do:
df.apply(lambda col:pd.to_numeric(col, errors='coerce'))
See the docs and this related question: pandas: to_numeric for multiple columns
convert_objects is deprecated.
For pandas >= 0.17.0, use pd.to_numeric
df["2nd"] = pd.to_numeric(df["2nd"])