Brave Search

How pandas describe() - top works when multiple elements have highest count?

stackoverflow.com › questions › 56443324 › how-pandas-describe-top-works-when-multiple-elements-have-highest-count

As pointed out above, it gives "Down" arbitrarily, but not randomly. On the same machine with the same Pandas version, running the above code should always yield the same result (although it's not guaranteed by the docs, see comments below).

Let's reproduce what's happening.

Given this series:

abc = pd.Series(list("abcdefghijklmnoppqq"))

The value_counts implementation boils down to this:

import pandas._libs.hashtable as htable
keys, counts = htable.value_count_object(np.asarray(abc), True)
result = pd.Series(counts, index=keys)

result:

g    1
e    1
f    1
h    1
o    1
d    1
b    1
q    2
j    1
k    1
i    1
p    2
n    1
l    1
c    1
m    1
a    1
dtype: int64

The order of the result is given by the implementation of the hash table. It is the same for every call.

You could look into the implementation of value_count_object, which calls build_count_table_object, which uses the khash implementation to get more details about the hashing.

After computing the table, the value_counts implementation is sorting the results with quicksort. This sort is not stable and with this specially constructed example reorders "p" and "q":

result.sort_values(ascending=False)

q    2
p    2
a    1
e    1
f    1
h    1
o    1
d    1
b    1
j    1
m    1
k    1
i    1
n    1
l    1
c    1
g    1
dtype: int64

Thus there are potentially two factors for the ordering: first the hashing, and second the non-stable sort.

The displayed top value is then just the first entry of the sorted list, in this case, "q".

On my machine, quicksort becomes non-stable at 17 entries, this is why I chose the example above.

We can test the non-stable sort with this direct comparison:

pd.Series(list("abcdefghijklmnoppqq")).describe().top
'q'

pd.Series(list(               "ppqq")).describe().top
'p'

Answer from w-m on Stack Overflow

Pandas

pandas.pydata.org › docs › reference › api › pandas.DataFrame.describe.html

pandas.DataFrame.describe — pandas 3.0.1 documentation

>>> df.describe(include=[object]) object count 3 unique 3 top a freq 1

Pandas

pandas.pydata.org › pandas-docs › stable › reference › api › pandas.DataFrame.describe.html

pandas.DataFrame.describe — pandas 3.0.2 documentation

>>> df.describe(include=[object]) object count 3 unique 3 top a freq 1

Videos

01:15

YouTube

Pandas Describe Method in 1 minute! - YouTube

May 4, 2025

youtube.com

Master Pandas: Analyze Data with Df.describe() in English #python ...

September 16, 2024

04:19

YouTube

Pandas Describe | pd.DataFrame.describe() - YouTube

September 9, 2020

View all

Stack Overflow

stackoverflow.com › questions › 56443324 › how-pandas-describe-top-works-when-multiple-elements-have-highest-count

python - How pandas describe() - top works when multiple elements have highest count? - Stack Overflow

Top answer

1 of 1

Let's reproduce what's happening.

Given this series:

abc = pd.Series(list("abcdefghijklmnoppqq"))

The value_counts implementation boils down to this:

import pandas._libs.hashtable as htable
keys, counts = htable.value_count_object(np.asarray(abc), True)
result = pd.Series(counts, index=keys)

result:

g    1
e    1
f    1
h    1
o    1
d    1
b    1
q    2
j    1
k    1
i    1
p    2
n    1
l    1
c    1
m    1
a    1
dtype: int64

The order of the result is given by the implementation of the hash table. It is the same for every call.

You could look into the implementation of value_count_object, which calls build_count_table_object, which uses the khash implementation to get more details about the hashing.

After computing the table, the value_counts implementation is sorting the results with quicksort. This sort is not stable and with this specially constructed example reorders "p" and "q":

result.sort_values(ascending=False)

q    2
p    2
a    1
e    1
f    1
h    1
o    1
d    1
b    1
j    1
m    1
k    1
i    1
n    1
l    1
c    1
g    1
dtype: int64

Thus there are potentially two factors for the ordering: first the hashing, and second the non-stable sort.

The displayed top value is then just the first entry of the sorted list, in this case, "q".

On my machine, quicksort becomes non-stable at 17 entries, this is why I chose the example above.

We can test the non-stable sort with this direct comparison:

pd.Series(list("abcdefghijklmnoppqq")).describe().top
'q'

pd.Series(list(               "ppqq")).describe().top
'p'

Machine Learning Plus

machinelearningplus.com › blog › pandas describe

Pandas Describe - machinelearningplus

March 8, 2022 - # create a datetime series series = pd.date_range(start='27/05/2021', periods=len(df)) # adding dates series to dataframe df['dates'] = series # describe function on dates df.dates.describe() ... count 5 unique 5 top 2021-05-28 00:00:00 freq 1 first 2021-05-27 00:00:00 last 2021-05-31 00:00:00 Name: dates, dtype: object · You can make pandas recognize date-time values as numeric using datetime_is_numeric.

Medium

medium.com › @heyamit10 › understanding-pandas-describe-9048cb198aa4

Understanding pandas.describe(). I understand that learning data science… | by Hey Amit | Medium

March 6, 2025 - import pandas as pd # Sample DataFrame with non-numeric data data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'], 'Department': ['HR', 'IT', 'Finance', 'IT', 'HR'] } df = pd.DataFrame(data) # Describing non-numerical (object) data print(df.describe(include='object')) Output: Name Department count 5 5 unique 5 3 top Alice HR freq 1 2 ·

W3Schools

w3schools.com › python › pandas › ref_df_describe.asp

Pandas DataFrame describe() Method

import pandas as pd data = [[10, 18, 11], [13, 15, 8], [9, 20, 3]] df = pd.DataFrame(data) print(df.describe()) Try it Yourself »

Stack Overflow

stackoverflow.com › questions › 53982252 › how-do-i-extract-the-top-value-from-pandas-dataframe-describe

How do I extract the top value from pandas.DataFrame.describe()? - Stack Overflow

Top answer

1 of 1

As with any series, you can access a value by label via the dot notation or the square brackets __getitem__ notation.

In this case, it's simply df.describe().top or df.describe()['top'].

GeeksforGeeks

geeksforgeeks.org › python-pandas-dataframe-describe-method

Pandas DataFrame describe() Method - GeeksforGeeks

June 12, 2025 - The describe() method in Pandas generates descriptive statistics of DataFrame columns which provides key metrics like mean, standard deviation, percentiles and more. It works with numeric data by default but can also handle categorical data ...

Find elsewhere

Google Bing Mojeek

datagy

datagy.io › home › pandas tutorials › data analysis in pandas › pandas describe: descriptive statistics on your dataframe

Pandas Describe: Descriptive Statistics on Your Dataframe • datagy

December 15, 2022 - We can see now that all columns are included in the describe method’s output. We can see that this actually this includes different metrics, such as unique and top. In Pandas version 1.1, a new argument was introduced.

Note.nkmk.me

note.nkmk.me › home › python › pandas

pandas: Get summary statistics for each column with describe() | note.nkmk.me

January 20, 2024 - top: Mode · freq: Frequency of the mode · mean: Arithmetic mean · std: Sample standard deviation · min: Minimum Value · max: Maximum Value · 50%: Median (50th percentile) 25%, 75%: 25th and 75th percentiles · Specify percentiles to calculate in describe(): percentiles · For datetime64[ns] type · The pandas version used in this article is as follows.

Stack Overflow

stackoverflow.com › questions › 24524104 › pandas-describe-is-not-returning-summary-of-all-columns

python - Pandas 'describe' is not returning summary of all columns - Stack Overflow

Top answer

1 of 7

109

As of pandas v15.0, use the parameter, DataFrame.describe(include = 'all') to get a summary of all the columns when the dataframe has mixed column types. The default behavior is to only provide a summary for the numerical columns.

Example:

In[1]:

df = pd.DataFrame({'$a':['a', 'b', 'c', 'd', 'a'], '$b': np.arange(5)})
df.describe(include = 'all')

Out[1]:

        $a    $b
count   5   5.000000
unique  4   NaN
top     a   NaN
freq    2   NaN
mean    NaN 2.000000
std     NaN 1.581139
min     NaN 0.000000
25%     NaN 1.000000
50%     NaN 2.000000
75%     NaN 3.000000
max     NaN 4.000000

The numerical columns will have NaNs for summary statistics pertaining to objects (strings) and vice versa.

Summarizing only numerical or object columns

To call describe() on just the numerical columns use describe(include = [np.number])

To call describe() on just the objects (strings) using describe(include = ['O']).

In[2]:

df.describe(include = [np.number])

Out[3]:

         $b
count   5.000000
mean    2.000000
std     1.581139
min     0.000000
25%     1.000000
50%     2.000000
75%     3.000000
max     4.000000

In[3]:

df.describe(include = ['O'])

Out[3]:

    $a
count   5
unique  4
top     a
freq    2

2 of 7

pd.options.display.max_columns = DATA.shape[1] will work.

Here DATA is a 2d matrix, and above code will display stats vertically.

w3resource

w3resource.com › pandas › dataframe › dataframe-describe.php

Pandas DataFrame: describe() function - w3resource

DataFrame.describe(self, percentiles=None, include=None, exclude=None) ... Returns: Series or DataFrame Summary statistics of the Series or Dataframe provided. ... Download the Pandas DataFrame Notebooks from here.

Spark By {Examples}

sparkbyexamples.com › home › pandas › pandas dataframe describe() method

Pandas DataFrame describe() Method - Spark By {Examples}

July 29, 2024 - Use describe(include=all) provides summary statistics for all columns, including count, unique values, the most frequent value (top), and its frequency (freq) for categorical data.

Snowflake Documentation

docs.snowflake.com › en › developer-guide › snowpark › reference › python › 1.8.0 › modin › pandas_api › modin.pandas.DataFrame.describe

modin.pandas.DataFrame.describe | Snowflake Documentation

Generate descriptive statistics for columns in the dataset · For non-numeric columns, computes count (# of non-null items), unique (# of unique items), top (the mode; the element at the lowest position if multiple), and freq (# of times the mode appears) for each column

Apache

spark.apache.org › docs › latest › api › python › reference › pyspark.pandas › api › pyspark.pandas.DataFrame.describe.html

pyspark.pandas.DataFrame.describe — PySpark 4.1.1 documentation

For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items. ... Describing a numeric Series.

Javatpoint

javatpoint.com › pandas-dataframe-describe

Pandas DataFrame.describe() - javatpoint

Pandas DataFrame.describe() with What is Python Pandas, Reading Multiple Files, Null values, Multiple index, Application, Application Basics, Resampling, Plotting the data, Moving windows functions, Series, Read the file, Data operations, Filter Data etc.

Statology

statology.org › home › pandas: how to use describe() for categorical variables

Pandas: How to Use describe() for Categorical Variables

March 8, 2023 - By default, the describe() function in pandas calculates descriptive statistics for all numeric variables in a DataFrame. However, you can use the following methods to calculate descriptive statistics for categorical variables as well: Method 1: Calculate Descriptive Statistics for Categorical Variables ... This method will calculate count, unique, top and freq for each categorical variable in a DataFrame.

Sharp Sight

sharpsight.ai › blog › pandas-describe

Pandas Describe, Explained - Sharp Sight

February 6, 2024 - We’re telling Pandas describe to do this with the code include = ['category']. Notice that the output is similar to the output for string variables (which we saw in example 4). The output includes the count, the number of unique values, the most frequent value (i.e., the ‘top’ value), and the frequency of the most frequent value.

Stack Overflow

stackoverflow.com › questions › 54885821 › what-are-is-the-use-of-top-function-in-describeinclude-all-in-python

pandas - What are is the use of 'top' function in describe(include='all') in python? - Stack Overflow

Top answer

1 of 1

what is the purpose of top function in output, and how it will work?

If you execute:

df.Name.value_counts()

You will see the value of a person in the Name column and their counts. top gives the highest counted value of the categorical values.

Example:

d ={'Name':pd.Series(['Tom','Steve','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),   
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,
2.98,4.80,4.10, 
3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print(df.describe(include='all'))

        Name        Age     Rating
count      12  12.000000  12.000000
unique     11        NaN        NaN
top     Steve        NaN        NaN
freq        2        NaN        NaN
mean      NaN  31.833333   3.743333
std       NaN   9.232682   0.661628
min       NaN  23.000000   2.560000
25%       NaN  25.000000   3.230000
50%       NaN  29.500000   3.790000
75%       NaN  35.500000   4.132500
max       NaN  51.000000   4.800000

print(df.Name.value_counts())

Steve     2
Ricky     1
Tom       1
Andres    1
Jack      1
Smith     1
Lee       1
Betina    1
Vin       1
Gasper    1
David     1

Since Name count for Steve is highest, it comes in top.

Statology

statology.org › home › how to use describe() function in pandas (with examples)

How to Use describe() Function in Pandas (With Examples)

August 9, 2021 - Note: If there are missing values in any columns, pandas will automatically exclude these values when calculating the descriptive statistics. To calculate descriptive statistics for every column in the DataFrame, we can use the include=’all’ argument: #generate descriptive statistics for all columns df.describe(include='all') team points assists rebounds count 8 8.000000 8.00000 8.000000 unique 3 NaN NaN NaN top B NaN NaN NaN freq 3 NaN NaN NaN mean NaN 20.250000 7.75000 8.375000 std NaN 6.158618 2.54951 2.559994 min NaN 12.000000 4.00000 5.000000 25% NaN 14.750000 6.50000 6.000000 50% NaN 21.000000 8.00000 8.500000 75% NaN 25.000000 9.00000 10.250000 max NaN 29.000000 12.00000 12.000000