As pointed out above, it gives "Down" arbitrarily, but not randomly. On the same machine with the same Pandas version, running the above code should always yield the same result (although it's not guaranteed by the docs, see comments below).
Let's reproduce what's happening.
Given this series:
abc = pd.Series(list("abcdefghijklmnoppqq"))
The value_counts implementation boils down to this:
import pandas._libs.hashtable as htable
keys, counts = htable.value_count_object(np.asarray(abc), True)
result = pd.Series(counts, index=keys)
result:
g 1
e 1
f 1
h 1
o 1
d 1
b 1
q 2
j 1
k 1
i 1
p 2
n 1
l 1
c 1
m 1
a 1
dtype: int64
The order of the result is given by the implementation of the hash table. It is the same for every call.
You could look into the implementation of value_count_object, which calls build_count_table_object, which uses the khash implementation to get more details about the hashing.
After computing the table, the value_counts implementation is sorting the results with quicksort. This sort is not stable and with this specially constructed example reorders "p" and "q":
result.sort_values(ascending=False)
q 2
p 2
a 1
e 1
f 1
h 1
o 1
d 1
b 1
j 1
m 1
k 1
i 1
n 1
l 1
c 1
g 1
dtype: int64
Thus there are potentially two factors for the ordering: first the hashing, and second the non-stable sort.
The displayed top value is then just the first entry of the sorted list, in this case, "q".
On my machine, quicksort becomes non-stable at 17 entries, this is why I chose the example above.
We can test the non-stable sort with this direct comparison:
pd.Series(list("abcdefghijklmnoppqq")).describe().top
'q'
pd.Series(list( "ppqq")).describe().top
'p'
Answer from w-m on Stack OverflowVideos
As of pandas v15.0, use the parameter, DataFrame.describe(include = 'all') to get a summary of all the columns when the dataframe has mixed column types. The default behavior is to only provide a summary for the numerical columns.
Example:
In[1]:
df = pd.DataFrame({'$a':['a', 'b', 'c', 'd', 'a'], '$b': np.arange(5)})
df.describe(include = 'all')
Out[1]:
$a $b
count 5 5.000000
unique 4 NaN
top a NaN
freq 2 NaN
mean NaN 2.000000
std NaN 1.581139
min NaN 0.000000
25% NaN 1.000000
50% NaN 2.000000
75% NaN 3.000000
max NaN 4.000000
The numerical columns will have NaNs for summary statistics pertaining to objects (strings) and vice versa.
Summarizing only numerical or object columns
- To call
describe()on just the numerical columns usedescribe(include = [np.number]) To call
describe()on just the objects (strings) usingdescribe(include = ['O']).In[2]: df.describe(include = [np.number]) Out[3]: $b count 5.000000 mean 2.000000 std 1.581139 min 0.000000 25% 1.000000 50% 2.000000 75% 3.000000 max 4.000000 In[3]: df.describe(include = ['O']) Out[3]: $a count 5 unique 4 top a freq 2
pd.options.display.max_columns = DATA.shape[1] will work.
Here DATA is a 2d matrix, and above code will display stats vertically.