When should I (not) want to use pandas apply() in my code?

stackoverflow.com › questions › 54432583 › when-should-i-not-want-to-use-pandas-apply-in-my-code

`apply`, the Convenience Function you Never Needed

We start by addressing the questions in the OP, one by one.

"If apply is so bad, then why is it in the API?"

DataFrame.apply and Series.apply are convenience functions defined on DataFrame and Series object respectively. apply accepts any user defined function that applies a transformation/aggregation on a DataFrame. apply is effectively a silver bullet that does whatever any existing pandas function cannot do.

Some of the things apply can do:

Run any user-defined function on a DataFrame or Series
Apply a function either row-wise (axis=1) or column-wise (axis=0) on a DataFrame
Perform index alignment while applying the function
Perform aggregation with user-defined functions (however, we usually prefer agg or transform in these cases)
Perform element-wise transformations
Broadcast aggregated results to original rows (see the result_type argument).
Accept positional/keyword arguments to pass to the user-defined functions.

...Among others. For more information, see Row or Column-wise Function Application in the documentation.

So, with all these features, why is apply bad? It is because apply is slow. Pandas makes no assumptions about the nature of your function, and so iteratively applies your function to each row/column as necessary. Additionally, handling all of the situations above means apply incurs some major overhead at each iteration. Further, apply consumes a lot more memory, which is a challenge for memory bounded applications.

There are very few situations where apply is appropriate to use (more on that below). If you're not sure whether you should be using apply, you probably shouldn't.

pandas 2.2 update: `apply` now supports `engine='numba'`

More info in the release notes as well as GH54666

Choose between the python (default) engine or the numba engine in apply.

The numba engine will attempt to JIT compile the passed function, which may result in speedups for large DataFrames. It also supports the following engine_kwargs :

nopython (compile the function in nopython mode)

nogil (release the GIL inside the JIT compiled function)

parallel (try to apply the function in parallel over the DataFrame)

Note: Due to limitations within numba/how pandas interfaces with numba, you should only use this if raw=True

Let's address the next question.

"How and when should I make my code apply-free?"

To rephrase, here are some common situations where you will want to get rid of any calls to apply.

Numeric Data

If you're working with numeric data, there is likely already a vectorized cython function that does exactly what you're trying to do (if not, please either ask a question on Stack Overflow or open a feature request on GitHub).

Contrast the performance of apply for a simple addition operation.

df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
df

   A   B
0  9  12
1  4   7
2  2   5
3  1   4

<!- ->

df.apply(np.sum)

A    16
B    28
dtype: int64

df.sum()

A    16
B    28
dtype: int64

Performance wise, there's no comparison, the cythonized equivalent is much faster. There's no need for a graph, because the difference is obvious even for toy data.

%timeit df.apply(np.sum)
%timeit df.sum()
2.22 ms ± 41.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
471 µs ± 8.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Even if you enable passing raw arrays with the raw argument, it's still twice as slow.

%timeit df.apply(np.sum, raw=True)
840 µs ± 691 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Another example:

df.apply(lambda x: x.max() - x.min())

A    8
B    8
dtype: int64

df.max() - df.min()

A    8
B    8
dtype: int64

%timeit df.apply(lambda x: x.max() - x.min())
%timeit df.max() - df.min()

2.43 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.23 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In general, seek out vectorized alternatives if possible.

String/Regex

Pandas provides "vectorized" string functions in most situations, but there are rare cases where those functions do not... "apply", so to speak.

A common problem is to check whether a value in a column is present in another column of the same row.

df = pd.DataFrame({
    'Name': ['mickey', 'donald', 'minnie'],
    'Title': ['wonderland', "welcome to donald's castle", 'Minnie mouse clubhouse'],
    'Value': [20, 10, 86]})
df

     Name  Value                       Title
0  mickey     20                  wonderland
1  donald     10  welcome to donald's castle
2  minnie     86      Minnie mouse clubhouse

This should return the row second and third row, since "donald" and "minnie" are present in their respective "Title" columns.

Using apply, this would be done using

df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)

0    False
1     True
2     True
dtype: bool
 
df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]

     Name                       Title  Value
1  donald  welcome to donald's castle     10
2  minnie      Minnie mouse clubhouse     86

However, a better solution exists using list comprehensions.

df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]

     Name                       Title  Value
1  donald  welcome to donald's castle     10
2  minnie      Minnie mouse clubhouse     86

<!- ->

%timeit df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
%timeit df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]

2.85 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
788 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The thing to note here is that iterative routines happen to be faster than apply, because of the lower overhead. If you need to handle NaNs and invalid dtypes, you can build on this using a custom function you can then call with arguments inside the list comprehension.

For more information on when list comprehensions should be considered a good option, see my writeup: Are for-loops in pandas really bad? When should I care?.

Note
Date and datetime operations also have vectorized versions. So, for example, you should prefer pd.to_datetime(df['date']), over, say, df['date'].apply(pd.to_datetime).

Read more at the docs.

A Common Pitfall: Exploding Columns of Lists

s = pd.Series([[1, 2]] * 3)
s

0    [1, 2]
1    [1, 2]
2    [1, 2]
dtype: object

People are tempted to use apply(pd.Series). This is horrible in terms of performance.

s.apply(pd.Series)

   0  1
0  1  2
1  1  2
2  1  2

A better option is to listify the column and pass it to pd.DataFrame.

pd.DataFrame(s.tolist())

   0  1
0  1  2
1  1  2
2  1  2

<!- ->

%timeit s.apply(pd.Series)
%timeit pd.DataFrame(s.tolist())

2.65 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
816 µs ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Lastly,

"Are there any situations where apply is good?"

Apply is a convenience function, so there are situations where the overhead is negligible enough to forgive. It really depends on how many times the function is called.

Functions that are Vectorized for Series, but not DataFrames
What if you want to apply a string operation on multiple columns? What if you want to convert multiple columns to datetime? These functions are vectorized for Series only, so they must be applied over each column that you want to convert/operate on.

df = pd.DataFrame(
         pd.date_range('2018-12-31','2019-01-31', freq='2D').date.astype(str).reshape(-1, 2), 
         columns=['date1', 'date2'])
df

       date1      date2
0 2018-12-31 2019-01-02
1 2019-01-04 2019-01-06
2 2019-01-08 2019-01-10
3 2019-01-12 2019-01-14
4 2019-01-16 2019-01-18
5 2019-01-20 2019-01-22
6 2019-01-24 2019-01-26
7 2019-01-28 2019-01-30

df.dtypes

date1    object
date2    object
dtype: object

This is an admissible case for apply:

df.apply(pd.to_datetime, errors='coerce').dtypes

date1    datetime64[ns]
date2    datetime64[ns]
dtype: object

Note that it would also make sense to stack, or just use an explicit loop. All these options are slightly faster than using apply, but the difference is small enough to forgive.

%timeit df.apply(pd.to_datetime, errors='coerce')
%timeit pd.to_datetime(df.stack(), errors='coerce').unstack()
%timeit pd.concat([pd.to_datetime(df[c], errors='coerce') for c in df], axis=1)
%timeit for c in df.columns: df[c] = pd.to_datetime(df[c], errors='coerce')

5.49 ms ± 247 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.94 ms ± 48.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.16 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.41 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can make a similar case for other operations such as string operations, or conversion to category.

u = df.apply(lambda x: x.str.contains(...))
v = df.apply(lambda x: x.astype(category))

v/s

u = pd.concat([df[c].str.contains(...) for c in df], axis=1)
v = df.copy()
for c in df:
    v[c] = df[c].astype(category)

And so on...

Converting Series to `str`: `astype` versus `apply`

This seems like an idiosyncrasy of the API. Using apply to convert integers in a Series to string is comparable (and sometimes faster) than using astype.

The graph was plotted using the perfplot library.

import perfplot

perfplot.show(
    setup=lambda n: pd.Series(np.random.randint(0, n, n)),
    kernels=[
        lambda s: s.astype(str),
        lambda s: s.apply(str)
    ],
    labels=['astype', 'apply'],
    n_range=[2**k for k in range(1, 20)],
    xlabel='N',
    logx=True,
    logy=True,
    equality_check=lambda x, y: (x == y).all())

With floats, I see the astype is consistently as fast as, or slightly faster than apply. So this has to do with the fact that the data in the test is integer type.

`GroupBy` operations with chained transformations

GroupBy.apply has not been discussed until now, but GroupBy.apply is also an iterative convenience function to handle anything that the existing GroupBy functions do not.

One common requirement is to perform a GroupBy and then two prime operations such as a "lagged cumsum":

df = pd.DataFrame({"A": list('aabcccddee'), "B": [12, 7, 5, 4, 5, 4, 3, 2, 1, 10]})
df

   A   B
0  a  12
1  a   7
2  b   5
3  c   4
4  c   5
5  c   4
6  d   3
7  d   2
8  e   1
9  e  10

<!- ->

You'd need two successive groupby calls here:

df.groupby('A').B.cumsum().groupby(df.A).shift()
 
0     NaN
1    12.0
2     NaN
3     NaN
4     4.0
5     9.0
6     NaN
7     3.0
8     NaN
9     1.0
Name: B, dtype: float64

Using apply, you can shorten this to a a single call.

df.groupby('A').B.apply(lambda x: x.cumsum().shift())

0     NaN
1    12.0
2     NaN
3     NaN
4     4.0
5     9.0
6     NaN
7     3.0
8     NaN
9     1.0
Name: B, dtype: float64

It is very hard to quantify the performance because it depends on the data. But in general, apply is an acceptable solution if the goal is to reduce a groupby call (because groupby is also quite expensive).

Other Caveats

Aside from the caveats mentioned above, it is also worth mentioning that apply operates on the first row (or column) twice. This is done to determine whether the function has any side effects. If not, apply may be able to use a fast-path for evaluating the result, else it falls back to a slow implementation.

df = pd.DataFrame({
    'A': [1, 2],
    'B': ['x', 'y']
})

def func(x):
    print(x['A'])
    return x

df.apply(func, axis=1)

# 1
# 1
# 2
   A  B
0  1  x
1  2  y

This behaviour is also seen in GroupBy.apply on pandas versions <0.25 (it was fixed for 0.25, see here for more information.)

Answer from coldspeed95 on Stack Overflow

Pandas

pandas.pydata.org › docs › reference › api › pandas.DataFrame.apply.html

pandas.DataFrame.apply — pandas 3.0.1 documentation

However if the apply function returns a Series these are expanded to columns. ... Positional arguments to pass to func in addition to the array/series. ... Only has an effect when func is a listlike or dictlike of funcs and the func isn’t a string. If “compat”, will if possible first translate the func into pandas methods (e.g.

Pandas

pandas.pydata.org › pandas-docs › version › 0.23 › generated › pandas.DataFrame.apply.html

pandas.DataFrame.apply — pandas 0.23.1 documentation

>>> df.apply(lambda x: [1, 2], axis=1, result_type='broadcast') A B 0 1 2 1 1 2 2 1 2 · index · modules | next | previous | pandas 0.23.1 documentation » · API Reference » ·

Discussions

python - When should I (not) want to use pandas apply() in my code? - Stack Overflow

I have seen many answers posted to questions on Stack Overflow involving the use of the Pandas method apply. I have also seen users commenting under them saying that "apply is slow, and should be a... More on stackoverflow.com

stackoverflow.com

Pandas how do you use apply()?

When I try to apply pandas.Series.nunique() as the function, I get an error saying pandas.Series.ununique() takes 1 argument (0 given) Can you show the code you're using? >>> df name age 0 bob 34 1 alice 42 2 tom 19 3 bob 20 [4 rows x 2 columns] >>> df.name.nunique() 3 >>> df.apply(pandas.Series.nunique) name 3 age 4 dtype: int64 More on reddit.com

r/learnpython

April 30, 2017

How to use apply function on DataFrame in pandas? - Data Science, AI and ML - Discussion Forum | Board Infinity

Pandas DataFrame.apply() The Pandas apply() function allows the user to pass a function and apply it to every single value of the Pandas series. This function improves the capabilities of the panda’s library because it helps to segregate data according to the conditions required. More on discuss.boardinfinity.com

discuss.boardinfinity.com

May 21, 2021

Optimal way to apply functions for different columns in pandas

You need to reformat your data. It should be {column: [values], column2: [values]} and then it’s just pd.DataFrame(mydict) Doing it your way is not scalable because you’re hardcoding the column names. If you ever add or remove columns you’re going to get errors. More on reddit.com

r/learnpython

January 30, 2023

Videos

03:49

YouTube

How to use the apply function in pandas - YouTube

November 4, 2024

24:43

YouTube

Pandas Apply Function: Simplify Data Transformations - YouTube

April 14, 2025

12:11

YouTube

Pandas Functions: Three Ways to Use the Apply Function - YouTube

How To Apply Functions To DataFrames - Pandas For Machine Learning ...

December 12, 2022

youtube.com

Mastering Python's 'apply' Method: Simplify Data Transformations ...

July 21, 2023

View all

VDCI

vdci.edu › transforming dataframes with apply and lambda functions

Transforming DataFrames with Apply and Lambda Functions - Free Video Tutorial

May 19, 2025 - Use the Pandas apply method to apply a function to each value in a DataFrame column, returning a newly transformed column.

W3Schools

w3schools.com › python › pandas › ref_df_apply.asp

Pandas DataFrame apply() Method

Pandas Editor Pandas Quiz Pandas Exercises Pandas Syllabus Pandas Study Plan Pandas Certificate · DataFrames Reference · ❮ DataFrame Reference · Return the sum of each row by applying a function: import pandas as pd def calc_sum(x): return x.sum() data = { "x": [50, 40, 30], "y": [300, 1112, 42] } df = pd.DataFrame(data) x = df.apply(calc_sum) print(x) Try it Yourself » ·

DataCamp

datacamp.com › tutorial › pandas-apply

Pandas .apply(): What It Does, When It Helps, and Faster Alternatives | DataCamp

October 6, 2025 - Learn what Python pandas .apply is and how to use it for DataFrames. Learn how to iterate over DataFrames using the .apply() function today!

Medium

deallen7.medium.com › how-to-apply-lambda-apply-function-in-a-pandas-dataframe-a6bf5c74dc1c

How to Apply Lambda & Apply Function in a Pandas Dataframe | by David Allen | Medium

July 26, 2022 - Strategy 1.2: Write a better function, and apply that function. I’ve been working with a lot of canine health data lately, and one thing that has been fascinating to watch develop is the change “before COVID-19” vs. “after COVID-19”. One of the ways to cut the data, therefore, is to tag each row (I’m dealing with a huge trough of daily data) with “before” or “after” what I see as the inflection date: 3/6/2020 · So, like the Python/Pandas newb that I am, I wrote this function and then applied it to my multi-million row DF.

Find elsewhere

Google Bing Mojeek

Stack Overflow

stackoverflow.com › questions › 54432583 › when-should-i-not-want-to-use-pandas-apply-in-my-code

python - When should I (not) want to use pandas apply() in my code? - Stack Overflow

`apply`, the Convenience Function you Never Needed

We start by addressing the questions in the OP, one by one.

"If apply is so bad, then why is it in the API?"

Some of the things apply can do:

Run any user-defined function on a DataFrame or Series
Apply a function either row-wise (axis=1) or column-wise (axis=0) on a DataFrame
Perform index alignment while applying the function
Perform aggregation with user-defined functions (however, we usually prefer agg or transform in these cases)
Perform element-wise transformations
Broadcast aggregated results to original rows (see the result_type argument).
Accept positional/keyword arguments to pass to the user-defined functions.

...Among others. For more information, see Row or Column-wise Function Application in the documentation.

There are very few situations where apply is appropriate to use (more on that below). If you're not sure whether you should be using apply, you probably shouldn't.

pandas 2.2 update: `apply` now supports `engine='numba'`

More info in the release notes as well as GH54666

Choose between the python (default) engine or the numba engine in apply.

The numba engine will attempt to JIT compile the passed function, which may result in speedups for large DataFrames. It also supports the following engine_kwargs :

nopython (compile the function in nopython mode)

nogil (release the GIL inside the JIT compiled function)

parallel (try to apply the function in parallel over the DataFrame)

Note: Due to limitations within numba/how pandas interfaces with numba, you should only use this if raw=True

Let's address the next question.

"How and when should I make my code apply-free?"

To rephrase, here are some common situations where you will want to get rid of any calls to apply.

Numeric Data

Contrast the performance of apply for a simple addition operation.

df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
df

   A   B
0  9  12
1  4   7
2  2   5
3  1   4

<!- ->

df.apply(np.sum)

A    16
B    28
dtype: int64

df.sum()

A    16
B    28
dtype: int64

Performance wise, there's no comparison, the cythonized equivalent is much faster. There's no need for a graph, because the difference is obvious even for toy data.

%timeit df.apply(np.sum)
%timeit df.sum()
2.22 ms ± 41.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
471 µs ± 8.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Even if you enable passing raw arrays with the raw argument, it's still twice as slow.

%timeit df.apply(np.sum, raw=True)
840 µs ± 691 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Another example:

df.apply(lambda x: x.max() - x.min())

A    8
B    8
dtype: int64

df.max() - df.min()

A    8
B    8
dtype: int64

%timeit df.apply(lambda x: x.max() - x.min())
%timeit df.max() - df.min()

2.43 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.23 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In general, seek out vectorized alternatives if possible.

String/Regex

Pandas provides "vectorized" string functions in most situations, but there are rare cases where those functions do not... "apply", so to speak.

A common problem is to check whether a value in a column is present in another column of the same row.

df = pd.DataFrame({
    'Name': ['mickey', 'donald', 'minnie'],
    'Title': ['wonderland', "welcome to donald's castle", 'Minnie mouse clubhouse'],
    'Value': [20, 10, 86]})
df

     Name  Value                       Title
0  mickey     20                  wonderland
1  donald     10  welcome to donald's castle
2  minnie     86      Minnie mouse clubhouse

This should return the row second and third row, since "donald" and "minnie" are present in their respective "Title" columns.

Using apply, this would be done using

df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)

0    False
1     True
2     True
dtype: bool
 
df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]

     Name                       Title  Value
1  donald  welcome to donald's castle     10
2  minnie      Minnie mouse clubhouse     86

However, a better solution exists using list comprehensions.

df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]

     Name                       Title  Value
1  donald  welcome to donald's castle     10
2  minnie      Minnie mouse clubhouse     86

<!- ->

%timeit df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
%timeit df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]

2.85 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
788 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

For more information on when list comprehensions should be considered a good option, see my writeup: Are for-loops in pandas really bad? When should I care?.

Note
Date and datetime operations also have vectorized versions. So, for example, you should prefer pd.to_datetime(df['date']), over, say, df['date'].apply(pd.to_datetime).

Read more at the docs.

A Common Pitfall: Exploding Columns of Lists

s = pd.Series([[1, 2]] * 3)
s

0    [1, 2]
1    [1, 2]
2    [1, 2]
dtype: object

People are tempted to use apply(pd.Series). This is horrible in terms of performance.

s.apply(pd.Series)

   0  1
0  1  2
1  1  2
2  1  2

A better option is to listify the column and pass it to pd.DataFrame.

pd.DataFrame(s.tolist())

   0  1
0  1  2
1  1  2
2  1  2

<!- ->

%timeit s.apply(pd.Series)
%timeit pd.DataFrame(s.tolist())

2.65 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
816 µs ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Lastly,

"Are there any situations where apply is good?"

Apply is a convenience function, so there are situations where the overhead is negligible enough to forgive. It really depends on how many times the function is called.

df = pd.DataFrame(
         pd.date_range('2018-12-31','2019-01-31', freq='2D').date.astype(str).reshape(-1, 2), 
         columns=['date1', 'date2'])
df

       date1      date2
0 2018-12-31 2019-01-02
1 2019-01-04 2019-01-06
2 2019-01-08 2019-01-10
3 2019-01-12 2019-01-14
4 2019-01-16 2019-01-18
5 2019-01-20 2019-01-22
6 2019-01-24 2019-01-26
7 2019-01-28 2019-01-30

df.dtypes

date1    object
date2    object
dtype: object

This is an admissible case for apply:

df.apply(pd.to_datetime, errors='coerce').dtypes

date1    datetime64[ns]
date2    datetime64[ns]
dtype: object

Note that it would also make sense to stack, or just use an explicit loop. All these options are slightly faster than using apply, but the difference is small enough to forgive.

%timeit df.apply(pd.to_datetime, errors='coerce')
%timeit pd.to_datetime(df.stack(), errors='coerce').unstack()
%timeit pd.concat([pd.to_datetime(df[c], errors='coerce') for c in df], axis=1)
%timeit for c in df.columns: df[c] = pd.to_datetime(df[c], errors='coerce')

5.49 ms ± 247 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.94 ms ± 48.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.16 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.41 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can make a similar case for other operations such as string operations, or conversion to category.

u = df.apply(lambda x: x.str.contains(...))
v = df.apply(lambda x: x.astype(category))

v/s

u = pd.concat([df[c].str.contains(...) for c in df], axis=1)
v = df.copy()
for c in df:
    v[c] = df[c].astype(category)

And so on...

Converting Series to `str`: `astype` versus `apply`

This seems like an idiosyncrasy of the API. Using apply to convert integers in a Series to string is comparable (and sometimes faster) than using astype.

The graph was plotted using the perfplot library.

import perfplot

perfplot.show(
    setup=lambda n: pd.Series(np.random.randint(0, n, n)),
    kernels=[
        lambda s: s.astype(str),
        lambda s: s.apply(str)
    ],
    labels=['astype', 'apply'],
    n_range=[2**k for k in range(1, 20)],
    xlabel='N',
    logx=True,
    logy=True,
    equality_check=lambda x, y: (x == y).all())

With floats, I see the astype is consistently as fast as, or slightly faster than apply. So this has to do with the fact that the data in the test is integer type.

`GroupBy` operations with chained transformations

GroupBy.apply has not been discussed until now, but GroupBy.apply is also an iterative convenience function to handle anything that the existing GroupBy functions do not.

One common requirement is to perform a GroupBy and then two prime operations such as a "lagged cumsum":

df = pd.DataFrame({"A": list('aabcccddee'), "B": [12, 7, 5, 4, 5, 4, 3, 2, 1, 10]})
df

   A   B
0  a  12
1  a   7
2  b   5
3  c   4
4  c   5
5  c   4
6  d   3
7  d   2
8  e   1
9  e  10

<!- ->

You'd need two successive groupby calls here:

df.groupby('A').B.cumsum().groupby(df.A).shift()
 
0     NaN
1    12.0
2     NaN
3     NaN
4     4.0
5     9.0
6     NaN
7     3.0
8     NaN
9     1.0
Name: B, dtype: float64

Using apply, you can shorten this to a a single call.

df.groupby('A').B.apply(lambda x: x.cumsum().shift())

0     NaN
1    12.0
2     NaN
3     NaN
4     4.0
5     9.0
6     NaN
7     3.0
8     NaN
9     1.0
Name: B, dtype: float64

Other Caveats

df = pd.DataFrame({
    'A': [1, 2],
    'B': ['x', 'y']
})

def func(x):
    print(x['A'])
    return x

df.apply(func, axis=1)

# 1
# 1
# 2
   A  B
0  1  x
1  2  y

This behaviour is also seen in GroupBy.apply on pandas versions <0.25 (it was fixed for 0.25, see here for more information.)

2 of 5

Not all `apply`s are alike

The below chart suggests when to consider apply¹. Green means possibly efficient; red avoid.

Some of this is intuitive: pd.Series.apply is a Python-level row-wise loop, ditto pd.DataFrame.apply row-wise (axis=1). The misuses of these are many and wide-ranging. The other post deals with them in more depth. Popular solutions are to use vectorised methods, list comprehensions (assumes clean data), or efficient tools such as the pd.DataFrame constructor (e.g. to avoid apply(pd.Series)).

If you are using pd.DataFrame.apply row-wise, specifying raw=True (where possible) is often beneficial. At this stage, numba is usually a better choice.

`GroupBy.apply`: generally favoured

Repeating groupby operations to avoid apply will hurt performance. GroupBy.apply is usually fine here, provided the methods you use in your custom function are themselves vectorised. Sometimes there is no native Pandas method for a groupwise aggregation you wish to apply. In this case, for a small number of groups apply with a custom function may still offer reasonable performance.

`pd.DataFrame.apply` column-wise: a mixed bag

pd.DataFrame.apply column-wise (axis=0) is an interesting case. For a small number of rows versus a large number of columns, it's almost always expensive. For a large number of rows relative to columns, the more common case, you may sometimes see significant performance improvements using apply:

# Python 3.7, Pandas 0.23.4
np.random.seed(0)
df = pd.DataFrame(np.random.random((10**7, 3)))     # Scenario_1, many rows
df = pd.DataFrame(np.random.random((10**4, 10**3))) # Scenario_2, many columns

                                               # Scenario_1  | Scenario_2
%timeit df.sum()                               # 800 ms      | 109 ms
%timeit df.apply(pd.Series.sum)                # 568 ms      | 325 ms

%timeit df.max() - df.min()                    # 1.63 s      | 314 ms
%timeit df.apply(lambda x: x.max() - x.min())  # 838 ms      | 473 ms

%timeit df.mean()                              # 108 ms      | 94.4 ms
%timeit df.apply(pd.Series.mean)               # 276 ms      | 233 ms

¹ There are exceptions, but these are usually marginal or uncommon. A couple of examples:

df['col'].apply(str) may slightly outperform df['col'].astype(str).
df.apply(pd.to_datetime) working on strings doesn't scale well with rows versus a regular for loop.

LabEx

labex.io › tutorials › pandas-dataframe-apply-method-68582

Pandas DataFrame Apply Method: Powerful Data Transformation | LabEx

In this lab, we learned how to use the apply() method in Pandas to apply a function to each row or column of a DataFrame. We saw how to define a function and apply it to a column or a row, as well as how to use a lambda function inline with the apply() method.

Pandas

pandas.pydata.org › pandas-docs › version › 0.21 › generated › pandas.DataFrame.apply.html

pandas.DataFrame.apply — pandas 0.21.1 documentation

DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)[source]¶

GeeksforGeeks

geeksforgeeks.org › pandas › python-pandas-apply

Python | Pandas.Series.apply() - GeeksforGeeks

3 weeks ago - Example 1: This example uses apply() with a custom function to classify each mark in the Series as Pass or Fail based on a condition. ... import pandas as pd s = pd.Series([35, 67, 90, 45]) def f(x): return "Pass" if x >= 50 else "Fail" r = s.apply(f) print(r)

Scicoding

scicoding.com › transforming-dataframes-with-pandas-apply

Transforming Dataframes with Pandas apply()

September 10, 2023 - It's often the go-to method when you need to perform custom operations that aren't directly available through Pandas' built-in functions. Flexibility: apply() can handle a wide range of tasks, from simple transformations to more complex row or column-wise operations.

reddit.com › r/learnpython › pandas how do you use apply()?

r/learnpython on Reddit: Pandas how do you use apply()?

April 30, 2017 -

How do you use apply() on a pandas dataframe?

I understand it takes a function...but I don't understand what kind of function. When I try to apply pandas.Series.nunique() as the function, I get an error saying pandas.Series.ununique() takes 1 argument (0 given). Which okay, makes sense, but isn't the whole point of using apply() to apply a function to an entire column (or row) of the dataframe, so I would expect it to input the entire column as the input for pandas.Series.unique()?

What am I misunderstanding about this? I have literally spend all day trying to figure out pandas dataframes. From the million stack exchange questions I've read related to this, NONE of them address this basic question. My entire google search history for any combination of pandas, apply(), unique, and column is purple - yet it has taught me literally nothing I need to know. I don't know if I'm just an idiot that can't understand this, or if there simply aren't any good explanations out there...everything seems written either for someone familiar with pandas/python, or just doesn't address the issue I'm having.

`apply`, the Convenience Function you Never Needed

"If `apply` is so bad, then why is it in the API?"

pandas 2.2 update: `apply` now supports `engine='numba'`

"How and when should I make my code `apply`-free?"

Numeric Data

String/Regex

A Common Pitfall: Exploding Columns of Lists

"Are there any situations where `apply` is good?"

Converting Series to `str`: `astype` versus `apply`

`GroupBy` operations with chained transformations

Other Caveats

Videos

`apply`, the Convenience Function you Never Needed

"If `apply` is so bad, then why is it in the API?"

pandas 2.2 update: `apply` now supports `engine='numba'`

"How and when should I make my code `apply`-free?"

Numeric Data

String/Regex

A Common Pitfall: Exploding Columns of Lists

"Are there any situations where `apply` is good?"

Converting Series to `str`: `astype` versus `apply`

`GroupBy` operations with chained transformations

Other Caveats

Not all `apply`s are alike

`GroupBy.apply`: generally favoured

`pd.DataFrame.apply` column-wise: a mixed bag

apply, the Convenience Function you Never Needed

"If apply is so bad, then why is it in the API?"

pandas 2.2 update: apply now supports engine='numba'

"How and when should I make my code apply-free?"

Numeric Data

String/Regex

A Common Pitfall: Exploding Columns of Lists

"Are there any situations where apply is good?"

Converting Series to str: astype versus apply

GroupBy operations with chained transformations

Other Caveats

Videos

apply, the Convenience Function you Never Needed

"If apply is so bad, then why is it in the API?"

pandas 2.2 update: apply now supports engine='numba'

"How and when should I make my code apply-free?"

Numeric Data

String/Regex

A Common Pitfall: Exploding Columns of Lists

"Are there any situations where apply is good?"

Converting Series to str: astype versus apply

GroupBy operations with chained transformations

Other Caveats

Not all applys are alike

GroupBy.apply: generally favoured

pd.DataFrame.apply column-wise: a mixed bag

`apply`, the Convenience Function you Never Needed

"If `apply` is so bad, then why is it in the API?"

pandas 2.2 update: `apply` now supports `engine='numba'`

"How and when should I make my code `apply`-free?"

"Are there any situations where `apply` is good?"

Converting Series to `str`: `astype` versus `apply`

`GroupBy` operations with chained transformations

`apply`, the Convenience Function you Never Needed

"If `apply` is so bad, then why is it in the API?"

pandas 2.2 update: `apply` now supports `engine='numba'`

"How and when should I make my code `apply`-free?"

"Are there any situations where `apply` is good?"

Converting Series to `str`: `astype` versus `apply`

`GroupBy` operations with chained transformations

Not all `apply`s are alike

`GroupBy.apply`: generally favoured

`pd.DataFrame.apply` column-wise: a mixed bag