Correct me if I am wrong:
- Performance: the methods of Pandas are highly optimized for operating on Pandas objects (C-level speed optimization).
- Handling NaN values: NaN values are treated as
False. - Axis specification: you can specify the axis along which to perform the method on.
- Different behavior:
any(df)checks the truthiness of the columns themselves, not the individual values within the DataFrame. - Output is still a Series (or DF).
- It ensures more consistency when working within the Pandas framework.
Videos
Correct me if I am wrong:
- Performance: the methods of Pandas are highly optimized for operating on Pandas objects (C-level speed optimization).
- Handling NaN values: NaN values are treated as
False. - Axis specification: you can specify the axis along which to perform the method on.
- Different behavior:
any(df)checks the truthiness of the columns themselves, not the individual values within the DataFrame. - Output is still a Series (or DF).
- It ensures more consistency when working within the Pandas framework.
These do two completely different things, so you cannot compare them directly. This is easily verifiable:
In [2]: import numpy as np, pandas as pd
In [3]: df = pd.DataFrame(data=np.random.randint(0,2, size=(10,3)), columns=('a','b','c'))
In [4]: df
Out[4]:
a b c
0 0 1 0
1 1 1 1
2 1 1 0
3 1 1 1
4 0 1 1
5 0 0 0
6 1 0 1
7 0 0 0
8 1 1 0
9 1 0 1
In [5]: df.any()
Out[5]:
a True
b True
c True
dtype: bool
In [6]: any(df)
Out[6]: True
pandas.DataFrame.any is a method that does an "or" reduction operation across some dimension (by default, the 0th axis) which results in some pandas.Series object. In contrast, the built-in any takes an iterable, and does this reduction on an iterable. The result is always a bool object. When you iterate over a pandas dataframe, you iterate over the columns. So for the above df, the operation any(df) is equivalent to:
In [8]: list(df)
Out[8]: ['a', 'b', 'c']
In [9]: any(['a', 'b', 'c'])
Out[9]: True
Again, note, you can choose the axis for the .any method, like most methods in pandas:
In [10]: df.any(axis=1)
Out[10]:
0 True
1 True
2 True
3 True
4 True
5 False
6 True
7 False
8 True
9 True
dtype: bool
Note, if you worked with a pd.Series, which iterates over the values in the series, the operation would be (almost) the same:
In [12]: any(df['a'])
Out[12]: True
In [13]: all(df['a'])
Out[13]: False
In [14]: df['a'].any()
Out[14]: True
In [15]: df['a'].all()
Out[15]: False
Barring how in vanilla Python, float('nan') is treated as truthy, whereas in pandas by default, are skipped.
However, you should use the pandas methods for pandas data structures, because they are heavily optimized.
Pandas suggests you to use Series methods any() and all(), not Python in-build functions.
I don't quite understand the source of the strange output you have (I get True in both cases in Python 2.7 and Pandas 0.17.0). But try the following, it should work. This uses Series.any() and Series.all() methods.
import pandas as pd
df = pd.DataFrame()
df['x'] = [1,2,3]
df['y'] = [3,4,5]
print (df['x'] < df['y']).all() # more pythonic way of
print (df['x'] < df['y']).any() # doing the same thing
This should print:
True
True
To compare two pd.DataFrame objects for both content and structure equality you can use:
import pandas as pd
def are_df_equal(df: pd.DataFrame, df2: pd.DataFrame) -> bool:
return df.equals(df2) and (df.all() == df2.all()).all()