Comparison with astype(int)
Tentatively convert your column to int and test with np.array_equal:
np.array_equal(df.v, df.v.astype(int))
True
float.is_integer
You can use this python function in conjunction with an apply:
df.v.apply(float.is_integer).all()
True
Or, using python's all in a generator comprehension, for space efficiency:
all(x.is_integer() for x in df.v)
True
Answer from coldspeed95 on Stack OverflowComparison with astype(int)
Tentatively convert your column to int and test with np.array_equal:
np.array_equal(df.v, df.v.astype(int))
True
float.is_integer
You can use this python function in conjunction with an apply:
df.v.apply(float.is_integer).all()
True
Or, using python's all in a generator comprehension, for space efficiency:
all(x.is_integer() for x in df.v)
True
Here's a simpler, and probably faster, approach:
(df[col] % 1 == 0).all()
To ignore nulls:
(df[col].fillna(-9999) % 1 == 0).all()
python - Check if a column contains object containing float values in pandas data frame - Stack Overflow
python - Find all columns of dataframe in Pandas whose type is float, or a particular type? - Stack Overflow
python - How to check for each column in a pandas dataframe whether it is a float or can be transformed into an integer - Stack Overflow
python - Function to check if object-dtype column value is float or string - Stack Overflow
You can use pd.to_numeric to try to convert the strings to numeric values. Then, check for each element in the converted columns whether it is an instance of float type by using .applymap() and isinstance(x, float). Finally, check for any column value in a column is of type float by .any():
df.apply(pd.to_numeric, errors="ignore").applymap(lambda x: isinstance(x, float), na_action='ignore').any()
Result:
column1 True
column2 False
dtype: bool
True value of column1 corresponds to it has at least one element of type float
Actually, this solution can also check for individual elements whether of float type by removing the .any() at the end:
df.apply(pd.to_numeric, errors="ignore").applymap(lambda x: isinstance(x, float), na_action='ignore')
Result:
column1 column2
0 True False
1 True False
2 True False
You can try to convert column by column and catch errors along the way. Columns that won't convert are unchanged because the exception is raised before the column is reassigned.
import pandas as pd
df = pd.DataFrame({"column1":[ "12.44", "56.78", "45.87"],
"column2":["cat", "dog", "horse"],
"column3":["1", "2", "3"]})
for colname in df.columns:
try:
df[colname] = df[colname].astype('float')
print(f"{colname} converted")
except ValueError:
print(f"{colname} failed")
print(df.dtypes)
This is conciser:
# select the float columns
df_num = df.select_dtypes(include=[np.float])
# select non-numeric columns
df_num = df.select_dtypes(exclude=[np.number])
You can see what the dtype is for all the columns using the dtypes attribute:
In [11]: df = pd.DataFrame([[1, 'a', 2.]])
In [12]: df
Out[12]:
0 1 2
0 1 a 2
In [13]: df.dtypes
Out[13]:
0 int64
1 object
2 float64
dtype: object
In [14]: df.dtypes == object
Out[14]:
0 False
1 True
2 False
dtype: bool
To access the object columns:
In [15]: df.loc[:, df.dtypes == object]
Out[15]:
1
0 a
I think it's most explicit to use (I'm not sure that inplace would work here):
In [16]: df.loc[:, df.dtypes == object] = df.loc[:, df.dtypes == object].fillna('')
Saying that, I recommend you use NaN for missing data.
One way to do this will be to divide the value in the column by 1. if the mod returns 0, then its an integer. If all the values of the mod is 0, then the column is an integer. With that, you can do the following to convert the column to integers.
Check if dataframe contains only float & ints
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4,5,6],
'col2':[1.0,2.0,3.0,4.0,5.0,6.0],
'col3':[1.1,2.1,3.1,4.1,5.1,6.1]})
print (df.info())
for col in df.columns:
if all(df[col]%1==0): df[col] = df[col].astype(int)
print (df.info())
Sample df is:
col1 col2 col3
0 1 1.0 1.1
1 2 2.0 2.1
2 3 3.0 3.1
3 4 4.0 4.1
4 5 5.0 5.1
5 6 6.0 6.1
Info gives us:
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 6 non-null int64
1 col2 6 non-null float64
2 col3 6 non-null float64
After the check and conversion, the dtypes are:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 6 non-null int64
1 col2 6 non-null int64
2 col3 6 non-null float64
Check if dataframe contains varying dtypes
Since you want to test this only for floats and ints, you can add the condition before you check for mod of 1.
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4,5,6],
'col2':[1.0,2.0,3.0,4.0,5.0,6.0],
'col3':[1.1,2.1,3.1,4.1,5.1,6.1],
'col4':['a','b','c','d','e','f'],
'col5':[True,True,True,False,False,False],
'col6':pd.date_range('2021.01.01', '2021.01.06').tolist()
})
print (df)
print (df.info())
for col in df.columns:
if df[col].dtype.kind in ('if') and all(df[col]%1==0): df[col] = df[col].astype(int)
print (df.info())
dtype.kind can be used to check i for int, f for float, b for bool, O for object and M for datetime. More details about numpy.dtype.kind available in the link here.
The output of this will be:
col1 col2 col3 col4 col5 col6
0 1 1.0 1.1 a True 2021-01-01
1 2 2.0 2.1 b True 2021-01-02
2 3 3.0 3.1 c True 2021-01-03
3 4 4.0 4.1 d False 2021-01-04
4 5 5.0 5.1 e False 2021-01-05
5 6 6.0 6.1 f False 2021-01-06
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 6 non-null int64
1 col2 6 non-null float64
2 col3 6 non-null float64
3 col4 6 non-null object
4 col5 6 non-null bool
5 col6 6 non-null datetime64[ns]
dtypes: bool(1), datetime64ns, float64(2), int64(1), object(1)
memory usage: 374.0+ bytes
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 6 non-null int64
1 col2 6 non-null int64
2 col3 6 non-null float64
3 col4 6 non-null object
4 col5 6 non-null bool
5 col6 6 non-null datetime64[ns]
dtypes: bool(1), datetime64ns, float64(1), int64(2), object(1)
memory usage: 374.0+ bytes
None
A quick way is to iterate through the dataframe, check if it is a float, compare to the integer version and finally concatenate:
from pandas.api.types import is_float_dtype
columns = [col.astype(int)
if is_float_dtype(col)
and (col.eq(col.astype(int)).all())
else col
for _, col in df.items()]
pd.concat(columns, axis=1)
no col_1 col_2 col_3
0 1 1 0 5.4
1 2 0 1 1.0
There may be a better way though. So, let's wait for more answers from the community.
Dumping into numpy gives a bit more speed:
arr = df.to_numpy()
filters = df.columns[np.any(arr != arr.astype(int), axis=0)]
df.astype({col: int
for col in df
if col not in filters})
Iteritems is returning a tuple, ((123, '1.07'), 1.07) and since you want to loop over each value try the below code.
You just need to remove .iteritems() and it will work like a charm.
df['feature2']=[1.07,2.08,'ab',3.04,'cde']
for item in df.feature2:
if isinstance(item,float):
print('yes')
else:
print('no')
Here is your output:
yes
yes
no
yes
no
I think there are two things you need to consider here:
- Methods for
DictvsDataFrame - Difference between dtype (array-scalar types) and type (built-in Python types) - Reference (https://numpy.org/devdocs/reference/arrays.dtypes.html)
Point 1:
.iteritems() / .items() are methods for dictionaries, whereas if you're dealing with dtypes (and judging by the data you've provided), you're likely to be going through a DataFrame, in which you don't need to use the .iteritems() method to loop through each value. Side note, .iteritems() has been phased out by Python and is replaced by .items() (See discussion: When should iteritems() be used instead of items()?)
Point 2:
When using numpy or Pandas, the data type of values imported into the DataFrames are called dtypes. These need to be differentiated from their direct comparisons in Pythons, which Python refers to as just type. You should use the table under "Pandas Data Types" heading for mapping of dtype to type (Ref: https://pbpython.com/pandas_dtypes.html)
Now, in response to your question, this bit of code should solve your issue:
import pandas as pd
columns = ['feature1', 'feature2', 'feature3']
data = [[123, 1.07, 1],
[231, 2.08, 3],
[122, 'ab', 4],
[111, 3.04, 6],
[555, 'cde', 8]]
df = pd.DataFrame(data, columns=columns)
for value in df.feature2:
if isinstance(value,float):
print('yes')
else:
print('no')
I'm working on a classifier based on a log regression model. There's a lot to it but I don't really know know how to use python (this is a class project, and I haven't had any prior Python experience). I'm trying to normalize my data and am using example code from keras.
My normalization is failing due to an error that reads this:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type int).
So my questions are- how can I see which columns in my dataframe are datatype int?
How can I convert an into to a float? (I assume this will fix my problem)
I've tried to print using my_dataframe.dtype but that hasn't worked.
Thanks in advance for the help, and sorry for the basic question. Python syntax and stuff has me totally lost. I'm having a really hard time understanding the python/tf/keras/numpy doc and I am running out of time + can essentially only work this project in my free time.
Due to imprecise float comparison you can or your comparison with np.isclose, isclose takes a relative and absolute tolerance param so the following should work:
df['result'] = df['actual_credit'].ge(df['min_required_credit']) | np.isclose(df['actual_credit'], df['min_required_credit'])
@EdChum's answer works great, but using the pandas.DataFrame.round function is another clean option that works well without the use of numpy.
df = pd.DataFrame( # adding a small difference at the thousandths place to reproduce the issue
data=[[0.3, 0.4], [0.5, 0.2], [0.400, 0.401], [0.2, 0.3]],
columns=['actual_credit', 'min_required_credit'])
df['result'] = df['actual_credit'].round(1) >= df['min_required_credit'].round(1)
print(df)
actual_credit min_required_credit result
0 0.3 0.400 False
1 0.5 0.200 True
2 0.4 0.401 True
3 0.2 0.300 False
You might consider using round() to more permanently edit your dataframe, depending if you desire that precision or not. In this example, it seems like the OP suggests this is probably just noise and is just causing confusion.
df = pd.DataFrame( # adding a small difference at the thousandths place to reproduce the issue
data=[[0.3, 0.4], [0.5, 0.2], [0.400, 0.401], [0.2, 0.3]],
columns=['actual_credit', 'min_required_credit'])
df = df.round(1)
df['result'] = df['actual_credit'] >= df['min_required_credit']
print(df)
actual_credit min_required_credit result
0 0.3 0.4 False
1 0.5 0.2 True
2 0.4 0.4 True
3 0.2 0.3 False
Try:
frame[pd.to_numeric(frame.event, errors='coerce').notnull()]
Or even:
frame.query("event != 'None'")
Outputs:
state year event
0 Ohio 2000 1.5
2 Texas 2000 3.6
3 Washington 2000 2.4
5 Nevada 2000 3.2
You can create a subset of the dataframe by checking which values are float in the 'event' column. The reset index is to renumber the index rows.
frame[frame['event'].apply(lambda x: isinstance(x,float))].reset_index(drop=True)