On checking I found my polars version :

pl.__version__

0.17.3

https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.groupby.html

I need to do:

df.groupby("a").agg(pl.col("b").sum())  # there is no underscore in groupby

#output

shape: (3, 2)
a   b
str i64
"a" 2
"c" 3
"b" 5

and the document says :

Deprecated since version 0.19.0: This method has been renamed to DataFrame.group_by().

This is the new document for polars version 0.19

https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.group_by.html#polars-dataframe-group-by

Answer from Talha Tayyab on Stack Overflow
🌐
GitHub
github.com β€Ί pola-rs β€Ί polars β€Ί issues β€Ί 16499 β€Ί linked_closing_reference
`next()` on GroupBy raises `AttributeError` object has no attribute `_current_index` Β· Issue #12868 Β· pola-rs/polars
December 2, 2023 - import polars as pl next(pl.DataFrame().group_by(1)) # AttributeError: 'GroupBy' object has no attribute '_current_index' No response Β· Not sure if next() is intended to work or not, it seems like it should raise a TypeError instead if it isn't. "Work" or return a TypeError?
Author Β  pola-rs
Discussions

python - GroupBy, column selection and mean in Polars - Stack Overflow
Communities for your favorite technologies. Explore all Collectives Β· Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work More on stackoverflow.com
🌐 stackoverflow.com
python - How to use group_by and apply a custom function with Polars? - Stack Overflow
I am breaking my head trying to figure out how to use group_by and apply a custom function using Polars. Coming from Pandas, I was using: import polars as pl import pandas as pd from scipy.stats im... More on stackoverflow.com
🌐 stackoverflow.com
python - Sample from each group in polars dataframe? - Stack Overflow
That approximate solution is just the same as sampling the whole dataframe and doing a groupby after. No good. ... @creanion someone who didn't know better must have edited it to add it. It was tagged as python-polars only, originally. More on stackoverflow.com
🌐 stackoverflow.com
Error on LazyFrame when using unique().groupby()
Polars version checks I have checked that this issue has not already been reported. I have confirmed this bug exists on the latest version of Polars. Issue description Running a unique() on a LazyF... More on github.com
🌐 github.com
2
March 15, 2023
🌐
Polars
docs.pola.rs β€Ί py-polars β€Ί html β€Ί reference β€Ί dataframe β€Ί group_by.html
GroupBy β€” Polars documentation
This namespace is available after calling DataFrame.group_by(...) Β· Allows iteration over the groups of the group by operation
Top answer
1 of 2
35

Polars has the pl.corr() function which supports method="spearman"

If you want to use a custom function you could do it like this:

Custom function on multiple columns/expressions

Copyimport polars as pl
from typing import List
from scipy import stats

df = pl.DataFrame({
    "g": [1, 1, 1, 2, 2, 2, 5],
    "a": [2, 4, 5, 190, 1, 4, 1],
    "b": [1, 3, 2, 1, 43, 3, 1]
})

def get_score(args: List[pl.Series]) -> pl.Series:
    return pl.Series([stats.spearmanr(args[0], args[1]).correlation], dtype=pl.Float64)

(df.group_by("g", maintain_order=True)
 .agg(
    pl.map_groups(
        exprs=["a", "b"], 
        function=get_score).alias("corr")
 ))

Polars provided function

Copy(df.group_by("g", maintain_order=True)
 .agg(
     pl.corr("a", "b", method="spearman").alias("corr")
 ))

Both output:

Copyshape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ g   ┆ corr β”‚
β”‚ --- ┆ ---  β”‚
β”‚ i64 ┆ f64  β”‚
β•žβ•β•β•β•β•β•ͺ══════║
β”‚ 1   ┆ 0.5  β”‚
β”‚ 2   ┆ -1.0 β”‚
β”‚ 5   ┆ NaN  β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜

Custom function on a a single column/expression

We can also apply custom functions on single expressions, via .map_elements

Below is an example of how we can square a column with a custom function and with normal polars expressions. The expression syntax should always be preferred, as its a lot faster.

Copy(df.group_by("g")
 .agg(
     pl.col("a").map_elements(lambda group: group**2).alias("squared1"),
     (pl.col("a")**2).alias("squared2")
 ))
2 of 2
2

This seems to be a gap in the Polars API relative to pandas. While pandas is able to do grouped operations with arbitrary functions and return the result as a DataFrame with the groups, it seems .map_groups() gets no information about the groups and so this gets lost.

Here's an approach using a pl.DataFrame namespace:

Copyimport polars as pl
from collections.abc import Callable
from scipy.stats import spearmanr
 
df = pl.DataFrame({
    "era": [1, 1, 1, 2, 2, 2, 5],
    "prediction": [2, 4, 5, 190, 1, 4, 1],
    "target": [1, 3, 2, 1, 43, 3, 1]
})

def with_group_keys(fun: Callable[[pl.DataFrame], pl.DataFrame], by: list[str]):
    def wrapped(g: pl.DataFrame) -> pl.DataFrame:
        keys = g.select(by).row(0, named=True)
        res = fun(g)
        if not isinstance(res, pl.DataFrame):
            raise TypeError("fun(g) must return a Polars DataFrame")
        if res.height != 1:
            raise ValueError("fun(g) must return exactly one row per group")
        return pl.DataFrame({k: [keys[k]] for k in by}).hstack(res)
    return wrapped

@pl.api.register_dataframe_namespace("groups")
class EraPLNamespace:
    def __init__(self, df: pl.DataFrame):
        self._df = df

    def map(self, by: list[str], fun: Callable[[pl.DataFrame], pl.DataFrame]) -> pl.DataFrame:
        return self._df.group_by(*by).map_groups(with_group_keys(fun, by))

def get_score(g: pl.DataFrame) -> pl.DataFrame:
    return pl.DataFrame({"corr": [spearmanr(g["prediction"], g["target"]).correlation]})

# usage
out = df.groups.map(["era"], get_score)

out
era corr
2 -1.0
1 0.5
5 NaN

Of course, a more direct answer to the specific question would be the following, but I assume OP might have been interested in the answer to a more general question.

Copycorrelations = df.group_by("era").agg(
    pl.corr("prediction", "target", method="spearman").alias("corr")
)
🌐
Polars
docs.pola.rs β€Ί api β€Ί python β€Ί version β€Ί 0.18 β€Ί reference β€Ί expressions β€Ί api β€Ί polars.Expr.apply.html
polars.Expr.apply β€” Polars documentation
return_dtype: PolarsDataType | None = None, *, skip_nulls: bool = True, pass_name: bool = False, strategy: ApplyStrategy = 'thread_local', ) β†’ Self[source]# Apply a custom/user-defined function (UDF) in a GroupBy or Projection context. Warning Β· This method is much slower than the native expressions API. Only use it if you cannot implement your logic otherwise. Depending on the context it has the following behavior: Selection Β·
Find elsewhere
🌐
Towards Data Science
towardsdatascience.com β€Ί home β€Ί latest β€Ί understanding groupby in polars dataframe by examples
Understanding GroupBy in Polars DataFrame by Examples | Towards Data Science
January 20, 2025 - As you would have noticed from ... you have to remember when using Polars is that you should avoid using the apply() function as it will affect performance of your query....
🌐
Polars
docs.pola.rs β€Ί py-polars β€Ί html β€Ί reference β€Ί dataframe β€Ί api β€Ί polars.dataframe.group_by.GroupBy.len.html
polars.dataframe.group_by.GroupBy.len β€” Polars documentation
>>> df = pl.DataFrame({"a": ["Apple", "Apple", "Orange"], "b": [1, None, 2]}) >>> df.group_by("a").len() shape: (2, 2) β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β” β”‚ a ┆ len β”‚ β”‚ --- ┆ --- β”‚ β”‚ str ┆ u32 β”‚ β•žβ•β•β•β•β•β•β•β•β•ͺ═════║ β”‚ Apple ┆ 2 β”‚ β”‚ Orange ┆ 1 β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜ >>> df.group_by("a").len(name="n") shape: (2, 2) β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β” β”‚ a ┆ n β”‚ β”‚ --- ┆ --- β”‚ β”‚ str ┆ u32 β”‚ β•žβ•β•β•β•β•β•β•β•β•ͺ═════║ β”‚ Apple ┆ 2 β”‚ β”‚ Orange ┆ 1 β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
🌐
Polars
docs.pola.rs β€Ί py-polars β€Ί html β€Ί reference β€Ί dataframe β€Ί api β€Ί polars.DataFrame.group_by_dynamic.html
polars.DataFrame.group_by_dynamic β€” Polars documentation
When the group_by argument is given, polars can not check sortedness by the metadata and has to do a full scan on the index column to verify data is sorted. This is expensive. If you are sure the data within the groups is sorted, you can set this to False. Doing so incorrectly will lead to ...
🌐
Polars
docs.pola.rs β€Ί api β€Ί python β€Ί version β€Ί 0.18 β€Ί reference β€Ί dataframe β€Ί api β€Ί polars.DataFrame.groupby_dynamic.html
polars.DataFrame.groupby_dynamic β€” Polars documentation
When the by argument is given, polars can not check sortedness by the metadata and has to do a full scan on the index column to verify data is sorted. This is expensive. If you are sure the data within the by groups is sorted, you can set this to False. Doing so incorrectly will lead to incorrect output ... Object you can call .agg on to aggregate by groups, the result of which will be sorted by index_column (but note that if by columns are passed, it will only be sorted within each by group).
Top answer
1 of 4
16

Let start with some dummy data:

n = 100
seed = 0

df = pl.DataFrame({
    "groups": (pl.int_range(n, eager=True) % 5).shuffle(seed=seed),
    "values": pl.int_range(n, eager=True).shuffle(seed=seed)
})
shape: (100, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ groups ┆ values β”‚
β”‚ ---    ┆ ---    β”‚
β”‚ i64    ┆ i64    β”‚
β•žβ•β•β•β•β•β•β•β•β•ͺ════════║
β”‚ 0      ┆ 55     β”‚
β”‚ 0      ┆ 40     β”‚
β”‚ 2      ┆ 57     β”‚
β”‚ 4      ┆ 99     β”‚
β”‚ 4      ┆ 4      β”‚
β”‚ …      ┆ …      β”‚
β”‚ 0      ┆ 90     β”‚
β”‚ 2      ┆ 87     β”‚
β”‚ 1      ┆ 96     β”‚
β”‚ 3      ┆ 43     β”‚
β”‚ 4      ┆ 44     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This gives us 100 / 5, is 5 groups of 20 elements. Let's verify that:

df.group_by("groups").agg(pl.len())
shape: (5, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ groups ┆ len β”‚
β”‚ ---    ┆ --- β”‚
β”‚ i64    ┆ u32 β”‚
β•žβ•β•β•β•β•β•β•β•β•ͺ═════║
β”‚ 0      ┆ 20  β”‚
β”‚ 4      ┆ 20  β”‚
β”‚ 2      ┆ 20  β”‚
β”‚ 3      ┆ 20  β”‚
β”‚ 1      ┆ 20  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜

Sample our data

Now we are going to use a window function to take a sample of our data.

df.filter(
    pl.int_range(pl.len()).shuffle().over("groups") < 10
)
shape: (50, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ groups ┆ values β”‚
β”‚ ---    ┆ ---    β”‚
β”‚ i64    ┆ i64    β”‚
β•žβ•β•β•β•β•β•β•β•β•ͺ════════║
β”‚ 0      ┆ 55     β”‚
β”‚ 2      ┆ 57     β”‚
β”‚ 4      ┆ 99     β”‚
β”‚ 4      ┆ 4      β”‚
β”‚ 1      ┆ 81     β”‚
β”‚ …      ┆ …      β”‚
β”‚ 2      ┆ 22     β”‚
β”‚ 1      ┆ 76     β”‚
β”‚ 3      ┆ 98     β”‚
β”‚ 0      ┆ 90     β”‚
β”‚ 4      ┆ 44     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜

For every group in over("group") the pl.int_range(pl.len()) expression creates an index row. We then shuffle that range so that we take a sample and not a slice. Then we only want to take the index values that are lower than 10. This creates a boolean mask that we can pass to the filter method.

2 of 4
5

This worked better for me:

sampled_df = pl.concat(
    df.sample(fraction=0.001) for df in 
    df.partition_by(["column"], include_key=True)
)

The problem with .agg(pl.col("column").sample(2) was that it seemed to select different values for each column. What I needed was randomly selected rows.

🌐
GitHub
github.com β€Ί pola-rs β€Ί polars β€Ί issues β€Ί 7578
Error on LazyFrame when using unique().groupby() Β· Issue #7578 Β· pola-rs/polars
March 15, 2023 - This does not happen on a DataFrame. import polars as pl df = pl.DataFrame({ "foo": ["0", "1", "2", "1", "2"], "bar": ["a", "a", "a", "b", "b"], }) # LazyFrame lazy_result = ( df.lazy() .unique() .groupby("bar").agg(pl.count()) ).collect() print(lazy_result) """ |-----|-------| | bar | count | |-----|-------| | a | 1 | |-----|-------| | b | 1 | |-----|-------| """ # DataFrame data_result = ( df .unique() .groupby("bar").agg(pl.count()) ) print(data_result) """ |-----|-------| | bar | count | |-----|-------| | a | 3 | |-----|-------| | b | 2 | |-----|-------| """
Author Β  mhendrey
Top answer
1 of 2
2

There is a dedicated .rolling() method to perform the group_by/rolling operation.

You can then perform your calculations inside the .agg() context.

lookback_period = 5

window = dict(
   by = ("ticker", "timeframe"),               # group_by these columns
   index_column = pl.int_range(0, pl.count()), # a "row count" to use as the index
   period = f"{lookback_period}i"              # window "size"
)

df.rolling(**window).agg(
   pl.when(pl.count() == lookback_period)
     .then(
        (pl.col("close-LDPM")
         / (pl.col("close-LDPM").cum_count().reverse() + 1)).sum()
     )
)

shape: (21, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ticker ┆ timeframe ┆ int ┆ close-LDPM β”‚
β”‚ ---    ┆ ---       ┆ --- ┆ ---        β”‚
β”‚ str    ┆ str       ┆ i64 ┆ f64        β”‚
β•žβ•β•β•β•β•β•β•β•β•ͺ═══════════β•ͺ═════β•ͺ════════════║
β”‚ ERIC   ┆ 1 W       ┆ 0   ┆ null       β”‚
β”‚ ERIC   ┆ 1 W       ┆ 1   ┆ null       β”‚
β”‚ ERIC   ┆ 1 W       ┆ 2   ┆ null       β”‚
β”‚ ERIC   ┆ 1 W       ┆ 3   ┆ null       β”‚
β”‚ ERIC   ┆ 1 W       ┆ 4   ┆ 26.295667  β”‚
β”‚ ERIC   ┆ 1 W       ┆ 5   ┆ 27.193     β”‚
β”‚ ERIC   ┆ 1 W       ┆ 6   ┆ 27.647833  β”‚
β”‚ ERIC   ┆ 1 W       ┆ 7   ┆ 25.616167  β”‚
β”‚ ERIC   ┆ 1 W       ┆ 8   ┆ 24.800667  β”‚
β”‚ ERIC   ┆ 1 W       ┆ 9   ┆ 22.096333  β”‚
β”‚ ERIC   ┆ 1 W       ┆ 10  ┆ 20.864333  β”‚
β”‚ ERIC   ┆ 1 W       ┆ 11  ┆ 20.517     β”‚
β”‚ ERIC   ┆ 1 W       ┆ 12  ┆ 20.660667  β”‚
β”‚ ERIC   ┆ 1 W       ┆ 13  ┆ 20.894167  β”‚
β”‚ ERIC   ┆ 1 W       ┆ 14  ┆ 21.4575    β”‚
β”‚ ERIC   ┆ 1 W       ┆ 15  ┆ 20.6175    β”‚
β”‚ ERIC   ┆ 1 W       ┆ 16  ┆ 20.2265    β”‚
β”‚ ERIC   ┆ 1 W       ┆ 17  ┆ 19.372     β”‚
β”‚ ERIC   ┆ 1 W       ┆ 18  ┆ 18.587833  β”‚
β”‚ ERIC   ┆ 1 W       ┆ 19  ┆ 17.988833  β”‚
β”‚ ERIC   ┆ 1 W       ┆ 20  ┆ 17.861     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Notes

Then when/then condition is used to null out the smaller windows.

The reverse cum_count is one way to emulate the range() behaviour in your example.

df.rolling(**window).agg(
   value = pl.col("close-LDPM"),
   weight = pl.col("close-LDPM").cum_count().reverse() + 1
)
shape: (21, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ticker ┆ timeframe ┆ int ┆ value                               ┆ weight          β”‚
β”‚ ---    ┆ ---       ┆ --- ┆ ---                                 ┆ ---             β”‚
β”‚ str    ┆ str       ┆ i64 ┆ list[f64]                           ┆ list[u32]       β”‚
β•žβ•β•β•β•β•β•β•β•β•ͺ═══════════β•ͺ═════β•ͺ═════════════════════════════════════β•ͺ═════════════════║
β”‚ ERIC   ┆ 1 W       ┆ 0   ┆ [10.87]                             ┆ [1]             β”‚
β”‚ ERIC   ┆ 1 W       ┆ 1   ┆ [10.87, 11.04]                      ┆ [2, 1]          β”‚
β”‚ ERIC   ┆ 1 W       ┆ 2   ┆ [10.87, 11.04, 11.36]               ┆ [3, 2, 1]       β”‚
β”‚ ERIC   ┆ 1 W       ┆ 3   ┆ [10.87, 11.04, 11.36, 11.01]        ┆ [4, 3, 2, 1]    β”‚
β”‚ ERIC   ┆ 1 W       ┆ 4   ┆ [10.87, 11.04, 11.36, 11.01, 12.07] ┆ [5, 4, 3, 2, 1] β”‚
β”‚ ERIC   ┆ 1 W       ┆ 5   ┆ [11.04, 11.36, 11.01, 12.07, 12.44] ┆ [5, 4, 3, 2, 1] β”‚
β”‚ ERIC   ┆ 1 W       ┆ 6   ┆ [11.36, 11.01, 12.07, 12.44, 12.38] ┆ [5, 4, 3, 2, 1] β”‚
β”‚ ERIC   ┆ 1 W       ┆ 7   ┆ [11.01, 12.07, 12.44, 12.38, 10.06] ┆ [5, 4, 3, 2, 1] β”‚
β”‚ ERIC   ┆ 1 W       ┆ 8   ┆ [12.07, 12.44, 12.38, 10.06, 10.12] ┆ [5, 4, 3, 2, 1] β”‚
β”‚ ERIC   ┆ 1 W       ┆ 9   ┆ [12.44, 12.38, 10.06, 10.12, 8.1]   ┆ [5, 4, 3, 2, 1] β”‚
β”‚ ERIC   ┆ 1 W       ┆ 10  ┆ [12.38, 10.06, 10.12, 8.1, 8.45]    ┆ [5, 4, 3, 2, 1] β”‚
β”‚ ERIC   ┆ 1 W       ┆ 11  ┆ [10.06, 10.12, 8.1, 8.45, 9.05]     ┆ [5, 4, 3, 2, 1] β”‚
β”‚ ERIC   ┆ 1 W       ┆ 12  ┆ [10.12, 8.1, 8.45, 9.05, 9.27]      ┆ [5, 4, 3, 2, 1] β”‚
β”‚ ERIC   ┆ 1 W       ┆ 13  ┆ [8.1, 8.45, 9.05, 9.27, 9.51]       ┆ [5, 4, 3, 2, 1] β”‚
β”‚ ERIC   ┆ 1 W       ┆ 14  ┆ [8.45, 9.05, 9.27, 9.51, 9.66]      ┆ [5, 4, 3, 2, 1] β”‚
β”‚ ERIC   ┆ 1 W       ┆ 15  ┆ [9.05, 9.27, 9.51, 9.66, 8.49]      ┆ [5, 4, 3, 2, 1] β”‚
β”‚ ERIC   ┆ 1 W       ┆ 16  ┆ [9.27, 9.51, 9.66, 8.49, 8.53]      ┆ [5, 4, 3, 2, 1] β”‚
β”‚ ERIC   ┆ 1 W       ┆ 17  ┆ [9.51, 9.66, 8.49, 8.53, 7.96]      ┆ [5, 4, 3, 2, 1] β”‚
β”‚ ERIC   ┆ 1 W       ┆ 18  ┆ [9.66, 8.49, 8.53, 7.96, 7.71]      ┆ [5, 4, 3, 2, 1] β”‚
β”‚ ERIC   ┆ 1 W       ┆ 19  ┆ [8.49, 8.53, 7.96, 7.71, 7.65]      ┆ [5, 4, 3, 2, 1] β”‚
β”‚ ERIC   ┆ 1 W       ┆ 20  ┆ [8.53, 7.96, 7.71, 7.65, 7.77]      ┆ [5, 4, 3, 2, 1] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Multiple columns

Assuming all columns follow a similar naming pattern we can:

  • select all close- columns by regex to process them all together.

  • use .name.map to extract the final part of the column name and add the _w suffix.

  • use regex again to select the newly created _w columns.

weighted_sums = (
   df.with_columns(pl.col("close-LDPM").reverse().alias("close-ABCD")) # add dummy column
     .rolling(**window).agg(
        pl.when(pl.count() == lookback_period)
          .then(
           (pl.col("^close-.+$") # select all `close-` columns
            / (pl.col("^close-.+$").cum_count().reverse() + 1)).sum()
        )
        .name.map(lambda col: col.rsplit('-', 1)[1] + "_w") # extract everything after last `-` and add `_w` suffix
   )
   .select("^.+_w$") # select all `_w` columns
)
shape: (21, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LDPM_w    ┆ ABCD_w    β”‚
β”‚ ---       ┆ ---       β”‚
β”‚ f64       ┆ f64       β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•ͺ═══════════║
β”‚ null      ┆ null      β”‚
β”‚ null      ┆ null      β”‚
β”‚ null      ┆ null      β”‚
β”‚ null      ┆ null      β”‚
β”‚ 26.295667 ┆ 18.5465   β”‚
β”‚ 27.193    ┆ 18.865833 β”‚
β”‚ 27.647833 ┆ 20.280333 β”‚
β”‚ 25.616167 ┆ 20.8945   β”‚
β”‚ 24.800667 ┆ 21.0735   β”‚
β”‚ 22.096333 ┆ 20.968    β”‚
β”‚ 20.864333 ┆ 20.3745   β”‚
β”‚ 20.517    ┆ 19.561167 β”‚
β”‚ 20.660667 ┆ 21.103167 β”‚
β”‚ 20.894167 ┆ 21.7425   β”‚
β”‚ 21.4575   ┆ 24.498333 β”‚
β”‚ 20.6175   ┆ 26.133333 β”‚
β”‚ 20.2265   ┆ 26.955667 β”‚
β”‚ 19.372    ┆ 26.298667 β”‚
β”‚ 18.587833 ┆ 26.474333 β”‚
β”‚ 17.988833 ┆ 25.8955   β”‚
β”‚ 17.861    ┆ 25.343167 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Add result back to original dataframe

In this case (with a "row count" index) the order is guaranteed, so we can simply .with_columns to add the result.

df = df.with_columns(weighted_sums)

Otherwise, you would a join: https://stackoverflow.com/a/77489932

2 of 2
1

final code with the invaluable help from @jqurious:

df_data = df_data.with_columns([pl.col("close-LDPM").alias("LDPS_w")])
window = dict(
by = ("ticker", "timeframe"),               # group_by these columns
index_column = pl.int_range(0, pl.count()), # a "row count" to use as the index
period = f"{lookback_period}i"              # window "size"
)

weighted_sums = df_data.rolling(**window).agg(
pl.when(pl.count() == lookback_period)
    .then(
        (pl.col("LDPS_w")
        / (pl.col("LDPS_w").cum_count().reverse() + 1)).sum()
    )
)

df_data = df_data.with_columns(weighted_sums)
df_data = df_data.drop(["int"])

I duplicated the column "close-LDPM" that I wanted to run the weighted average on, so I got to keep the original column and the new one.

Thanx again @jqurious

Top answer
1 of 2
8

pl.Expr.apply was deprecated in favour of pl.Expr.map_elements in Polars release 0.19.0. Recently, pl.Expr.apply was removed in the release of Polars 1.0.0.

You can adapt your code to the new version as follows.

df.with_columns(
    pl.col("AH_PROC_REALIZADO")
    .map_elements(get_procedure_description, return_dtype=pl.String)
    .alias("proced_descr")
)
2 of 2
3

If you really want to apply python function then you can use map_elements(). However, using native polars expression is always preferrable.

In your case I'd suggest to look at replace() or replace_strict().

If you would want to just search by AH_PROC_REALIZADO column you could use simple replace_strict():

df = pl.DataFrame({
    "AH_PROC_REALIZADO": ["30408", "410010065", "410010111", "XXXX"]
})

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ AH_PROC_REALIZADO β”‚
β”‚ ---               β”‚
β”‚ str               β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘
β”‚ 30408             β”‚
β”‚ 410010065         β”‚
β”‚ 410010111         β”‚
β”‚ XXXX              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

df.with_columns(
    pl.col("AH_PROC_REALIZADO")
    .replace_strict(proceds, default=None)
    .alias("proced_descr")
)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ AH_PROC_REALIZADO ┆ proced_descr                   β”‚
β”‚ ---               ┆ ---                            β”‚
β”‚ str               ┆ str                            β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ════════════════════════════════║
β”‚ 30408             ┆ QUIMIOTERAPIA                  β”‚
β”‚ 410010065         ┆ MASTECTOMIA SIMPLES            β”‚
β”‚ 410010111         ┆ SETORECTOMIA / QUADRANTECTOMIA β”‚
β”‚ XXXX              ┆ null                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The problem with your use case is that, as far as I understand, you want to search by prefix of the strings in AH_PROC_REALIZADO column. In that case you could probably adjust the solution to:

  • itertools.groupby() to transform proceds dictionary into dictionary of dictionaries where high level keys are length of the key.
  • replace_strict() to search for product description.
  • coalesce() to combine results into final column.
from itertools import groupby

mappings = {k: dict(g) for k, g in groupby(proceds.items(), lambda x: len(x[0]))}

df = pl.DataFrame({
    "AH_PROC_REALIZADO": ["30408_____", "410010065_____", "410010111____", "XXXX"]
})

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ AH_PROC_REALIZADO β”‚
β”‚ ---               β”‚
β”‚ str               β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘
β”‚ 30408_____        β”‚
β”‚ 410010065_____    β”‚
β”‚ 410010111____     β”‚
β”‚ XXXX              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

df.with_columns(
    pl.coalesce(
        pl.col("AH_PROC_REALIZADO").str.head(k).replace_strict(m, default=None) for k, m in mappings.items()
    )
    .alias("proced_descr")
)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ AH_PROC_REALIZADO ┆ proced_descr                   β”‚
β”‚ ---               ┆ ---                            β”‚
β”‚ str               ┆ str                            β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ════════════════════════════════║
β”‚ 30408_____        ┆ QUIMIOTERAPIA                  β”‚
β”‚ 410010065_____    ┆ MASTECTOMIA SIMPLES            β”‚
β”‚ 410010111____     ┆ SETORECTOMIA / QUADRANTECTOMIA β”‚
β”‚ XXXX              ┆ null                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
🌐
Polars
docs.pola.rs β€Ί api β€Ί python β€Ί version β€Ί 0.18 β€Ί reference β€Ί dataframe β€Ί api β€Ί polars.DataFrame.groupby_rolling.html
polars.DataFrame.groupby_rolling β€” Polars documentation
In case of a rolling groupby on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column. ... Define which sides of the temporal interval are closed (inclusive). ... When the by argument is given, polars can not check sortedness by the metadata and has ...
🌐
Quansight
labs.quansight.org β€Ί blog β€Ί dataframe-group-by
The Polars vs pandas difference nobody is talking about | Labs
November 11, 2024 - df[df['sales'] > df.groupby('id')['sales'].transform('mean')].groupby('id')['views'].max() It's not as bad as the apply solution above, but it still looks overly complicated and requires two group-bys. ... Realistically, few users would come up with it (most would jump straight to apply), but for completeness, we present it: ... The Polars API lets us pass expressions to GroupBy.agg.