I suspect what you are looking for is the new df.expanding(..., method='table') in the upcoming pandas=1.3 (see "Other enhancements").
In the meantime, you can do it "by hand", using a loop (sorry):
xy = df.values
df['c1 c2 c3'.split()] = np.stack([
func2(*xy[:n].T) if n >= 3 else np.empty(3)*np.nan
for n in range(xy.shape[0])
])
Example:
np.random.seed(0)
df = pd.DataFrame(np.random.rand(10, 2).round(2),
columns=['Input', 'Response'])
# the code above, then
>>> df
Input Response c1 c2 c3
0 0.55 0.72 NaN NaN NaN
1 0.60 0.54 NaN NaN NaN
2 0.42 0.65 NaN NaN NaN
3 0.44 0.89 -22.991453 22.840171 -4.887179
4 0.96 0.38 -29.759096 29.213620 -6.298277
5 0.79 0.53 0.454036 -1.369701 1.272156
6 0.57 0.93 0.122450 -0.874260 1.113586
7 0.07 0.09 -1.010312 0.623331 0.696287
8 0.02 0.83 -2.687387 2.995143 -0.079214
9 0.78 0.87 -1.425030 1.294210 0.442684
Answer from Pierre D on Stack Overflowpython - Pandas' expanding with apply function on multiple columns - Stack Overflow
python - Apply pandas function to column to create multiple new columns? - Stack Overflow
(Pandas) apply a function to a pd.Series to create two new columns in the pd.DataFrame
python - Expand pandas DataFrame column into multiple rows - Stack Overflow
I usually do this using zip:
>>> df = pd.DataFrame([[i] for i in range(10)], columns=['num'])
>>> df
num
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
>>> def powers(x):
>>> return x, x**2, x**3, x**4, x**5, x**6
>>> df['p1'], df['p2'], df['p3'], df['p4'], df['p5'], df['p6'] = \
>>> zip(*df['num'].map(powers))
>>> df
num p1 p2 p3 p4 p5 p6
0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1
2 2 2 4 8 16 32 64
3 3 3 9 27 81 243 729
4 4 4 16 64 256 1024 4096
5 5 5 25 125 625 3125 15625
6 6 6 36 216 1296 7776 46656
7 7 7 49 343 2401 16807 117649
8 8 8 64 512 4096 32768 262144
9 9 9 81 729 6561 59049 531441
In 2020, I use apply() with argument result_type='expand'
applied_df = df.apply(lambda row: fn(row.text), axis='columns', result_type='expand')
df = pd.concat([df, applied_df], axis='columns')
fn() should return a dict; its keys will be the new column names.
Alternatively you can do a one-liner by also specifying the column names:
df[["col1", "col2", ...]] = df.apply(lambda row: fn(row.text), axis='columns', result_type='expand')
As a toy example, I have a function that takes a string, and returns the string without spaces + a list of words (separated by spaces) in the string:
def split_string(text):
list_of_words = text.split(' ')
melted_text = ''.join(list_of_words)
return list_of_words, melted_textAnd I want to be able to apply this function to a pd.Series (a column of all string values) and return the list_of_words and melted_text columns:
df = pd.DataFrame(data={'example_text': [
'hello! I love my dog!',
'hi! I like my cat!',
'greetings! I hate my goldfish!']})I then try to apply the function on the example_text column:
df[['list_of_words','melted_text']] = df['example_text'].apply(split_string)
but receive this error:
ValueError: Columns must be same length as key
Any idea what I'm doing wrong?
You could use df.itertuples to iterate through each row, and use a list comprehension to reshape the data into the desired form:
import pandas as pd
df = pd.DataFrame( {"name" : ["John", "Eric"],
"days" : [[1, 3, 5, 7], [2,4]]})
result = pd.DataFrame([(d, tup.name) for tup in df.itertuples() for d in tup.days])
print(result)
yields
0 1
0 1 John
1 3 John
2 5 John
3 7 John
4 2 Eric
5 4 Eric
Divakar's solution, using_repeat, is fastest:
In [48]: %timeit using_repeat(df)
1000 loops, best of 3: 834 µs per loop
In [5]: %timeit using_itertuples(df)
100 loops, best of 3: 3.43 ms per loop
In [7]: %timeit using_apply(df)
1 loop, best of 3: 379 ms per loop
In [8]: %timeit using_append(df)
1 loop, best of 3: 3.59 s per loop
Here is the setup used for the above benchmark:
import numpy as np
import pandas as pd
N = 10**3
df = pd.DataFrame( {"name" : np.random.choice(list('ABCD'), size=N),
"days" : [np.random.randint(10, size=np.random.randint(5))
for i in range(N)]})
def using_itertuples(df):
return pd.DataFrame([(d, tup.name) for tup in df.itertuples() for d in tup.days])
def using_repeat(df):
lens = [len(item) for item in df['days']]
return pd.DataFrame( {"name" : np.repeat(df['name'].values,lens),
"days" : np.concatenate(df['days'].values)})
def using_apply(df):
return (df.apply(lambda x: pd.Series(x.days), axis=1)
.stack()
.reset_index(level=1, drop=1)
.to_frame('day')
.join(df['name']))
def using_append(df):
df2 = pd.DataFrame(columns = df.columns)
for i,r in df.iterrows():
for e in r.days:
new_r = r.copy()
new_r.days = e
df2 = df2.append(new_r)
return df2
New since pandas 0.25 you can use the function explode()
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html
import pandas as pd
df = pd.DataFrame( {"name" : "John",
"days" : [[1, 3, 5, 7]]})
print(df.explode('days'))
prints
name days
0 John 1
0 John 3
0 John 5
0 John 7
This addresses the problem, not the general issue of passing multiple columns: I would use groupby and cummax, and then see whether we hit a new value. For example:
grouped = df.groupby("id")["value"]
cummax = grouped.cummax()
cummax_is_new_value = cummax != cummax.groupby(df.id).shift()
df["new_max"] = cummax_is_new_value.astype(int)
gives me
>>> df
id value new_max
0 0 1 1
1 0 3 1
2 0 2 0
3 0 5 1
4 0 4 0
5 1 4 1
6 1 3 0
7 1 2 0
8 1 1 0
9 1 5 1
10 2 1 1
10 2 1 0
10 2 0 0
10 2 1 0
10 3 1 1
Originally I was only checking whether the value was the same as the previous value, but that failed on cases like [1, 0, 1], where the second 1 is both equal to the cumulative maximum and not the same as the previous value. This way we're always working with the grouped cumulative values, and so we really are only picking up the new cumulative values by group.
Its been a long time since I worked with apply like a couple releases ago minimum, so my recollection may be bad, or things may have changed. However, as I remember it the grouped data is passed automatically as the first argument.
The temptation when passing your own function to apply is to do this:
def user_func(df, arg1, arg2):
return whatever_you_like
DF = pd.DataFrame(your_data)
DF.groupby('col1').appy(user_func(arg1, arg2))
but this is not the correct syntax. In fact the correct syntax for the last line is
DF.groupby('col1').apply(user_func, arg1, arg2)
Whether expanding_apply works in the same way I do not know and this may all be totally out of date, but might be worth a shot.
An possible solution is to make the expanding part of the function, and use GroupBy.apply:
def foo1(_df):
return _df['x1'].expanding().max() * _df['x2'].expanding().apply(lambda x: x[-1], raw=True)
df['foo_result'] = df.groupby('group').apply(foo1).reset_index(level=0, drop=True)
print (df)
group time x1 x2 foo_result
0 A 1 10 1 10.0
3 B 1 100 2 200.0
1 A 2 40 2 80.0
4 B 2 200 0 0.0
2 A 3 30 1 40.0
5 B 3 300 3 900.0
This is not a direct solution to the problem of applying a dataframe function to an expanding dataframe, but it achieves the same functionality.
Applying a dataframe function on an expanding window is apparently not possible (at least not for pandas version 0.23.0; EDITED - and also not 1.3.0), as one can see by plugging a print statement into the function.
Running df.groupby('group').expanding().apply(lambda x: bool(print(x)) , raw=False) on the given DataFrame (where the bool around the print is just to get a valid return value) returns:
0 1.0
dtype: float64
0 1.0
1 2.0
dtype: float64
0 1.0
1 2.0
2 3.0
dtype: float64
0 10.0
dtype: float64
0 10.0
1 40.0
dtype: float64
0 10.0
1 40.0
2 30.0
dtype: float64
(and so on - and also returns a dataframe with '0.0' in each cell, of course).
This shows that the expanding window works on a column-by-column basis (we see that first the expanding time series is printed, then x1, and so on), and does not really work on a dataframe - so a dataframe function can't be applied to it.
So, to get the obtained functionality, one would have to put the expanding inside the dataframe function, like in the accepted answer.