On checking I found my polars version :
pl.__version__
0.17.3
https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.groupby.html
I need to do:
df.groupby("a").agg(pl.col("b").sum()) # there is no underscore in groupby
#output
shape: (3, 2)
a b
str i64
"a" 2
"c" 3
"b" 5
and the document says :
Deprecated since version 0.19.0: This method has been renamed to
DataFrame.group_by().
This is the new document for polars version 0.19
https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.group_by.html#polars-dataframe-group-by
Answer from Talha Tayyab on Stack OverflowI keep receiving this error: [AttributeError: βDataFrameβ object has no attribute βgroupbyβ]. When I ask ChatGPT for guidance, the only response i get is that polars is out of date, which it definitely isnβt. Need to perform them for a university assignment, so any guidance would be appreciated, thanks!
python - GroupBy, column selection and mean in Polars - Stack Overflow
python - How to use group_by and apply a custom function with Polars? - Stack Overflow
python - Sample from each group in polars dataframe? - Stack Overflow
Error on LazyFrame when using unique().groupby()
Polars has the pl.corr() function which supports method="spearman"
If you want to use a custom function you could do it like this:
Custom function on multiple columns/expressions
Copyimport polars as pl
from typing import List
from scipy import stats
df = pl.DataFrame({
"g": [1, 1, 1, 2, 2, 2, 5],
"a": [2, 4, 5, 190, 1, 4, 1],
"b": [1, 3, 2, 1, 43, 3, 1]
})
def get_score(args: List[pl.Series]) -> pl.Series:
return pl.Series([stats.spearmanr(args[0], args[1]).correlation], dtype=pl.Float64)
(df.group_by("g", maintain_order=True)
.agg(
pl.map_groups(
exprs=["a", "b"],
function=get_score).alias("corr")
))
Polars provided function
Copy(df.group_by("g", maintain_order=True)
.agg(
pl.corr("a", "b", method="spearman").alias("corr")
))
Both output:
Copyshape: (3, 2)
βββββββ¬βββββββ
β g β corr β
β --- β --- β
β i64 β f64 β
βββββββͺβββββββ‘
β 1 β 0.5 β
β 2 β -1.0 β
β 5 β NaN β
βββββββ΄βββββββ
Custom function on a a single column/expression
We can also apply custom functions on single expressions, via .map_elements
Below is an example of how we can square a column with a custom function and with normal polars expressions. The expression syntax should always be preferred, as its a lot faster.
Copy(df.group_by("g")
.agg(
pl.col("a").map_elements(lambda group: group**2).alias("squared1"),
(pl.col("a")**2).alias("squared2")
))
This seems to be a gap in the Polars API relative to pandas. While pandas is able to do grouped operations with arbitrary functions and return the result as a DataFrame with the groups, it seems .map_groups() gets no information about the groups and so this gets lost.
Here's an approach using a pl.DataFrame namespace:
Copyimport polars as pl
from collections.abc import Callable
from scipy.stats import spearmanr
df = pl.DataFrame({
"era": [1, 1, 1, 2, 2, 2, 5],
"prediction": [2, 4, 5, 190, 1, 4, 1],
"target": [1, 3, 2, 1, 43, 3, 1]
})
def with_group_keys(fun: Callable[[pl.DataFrame], pl.DataFrame], by: list[str]):
def wrapped(g: pl.DataFrame) -> pl.DataFrame:
keys = g.select(by).row(0, named=True)
res = fun(g)
if not isinstance(res, pl.DataFrame):
raise TypeError("fun(g) must return a Polars DataFrame")
if res.height != 1:
raise ValueError("fun(g) must return exactly one row per group")
return pl.DataFrame({k: [keys[k]] for k in by}).hstack(res)
return wrapped
@pl.api.register_dataframe_namespace("groups")
class EraPLNamespace:
def __init__(self, df: pl.DataFrame):
self._df = df
def map(self, by: list[str], fun: Callable[[pl.DataFrame], pl.DataFrame]) -> pl.DataFrame:
return self._df.group_by(*by).map_groups(with_group_keys(fun, by))
def get_score(g: pl.DataFrame) -> pl.DataFrame:
return pl.DataFrame({"corr": [spearmanr(g["prediction"], g["target"]).correlation]})
# usage
out = df.groups.map(["era"], get_score)
out
| era | corr |
|---|---|
| 2 | -1.0 |
| 1 | 0.5 |
| 5 | NaN |
Of course, a more direct answer to the specific question would be the following, but I assume OP might have been interested in the answer to a more general question.
Copycorrelations = df.group_by("era").agg(
pl.corr("prediction", "target", method="spearman").alias("corr")
)
Let start with some dummy data:
n = 100
seed = 0
df = pl.DataFrame({
"groups": (pl.int_range(n, eager=True) % 5).shuffle(seed=seed),
"values": pl.int_range(n, eager=True).shuffle(seed=seed)
})
shape: (100, 2)
ββββββββββ¬βββββββββ
β groups β values β
β --- β --- β
β i64 β i64 β
ββββββββββͺβββββββββ‘
β 0 β 55 β
β 0 β 40 β
β 2 β 57 β
β 4 β 99 β
β 4 β 4 β
β β¦ β β¦ β
β 0 β 90 β
β 2 β 87 β
β 1 β 96 β
β 3 β 43 β
β 4 β 44 β
ββββββββββ΄βββββββββ
This gives us 100 / 5, is 5 groups of 20 elements. Let's verify that:
df.group_by("groups").agg(pl.len())
shape: (5, 2)
ββββββββββ¬ββββββ
β groups β len β
β --- β --- β
β i64 β u32 β
ββββββββββͺββββββ‘
β 0 β 20 β
β 4 β 20 β
β 2 β 20 β
β 3 β 20 β
β 1 β 20 β
ββββββββββ΄ββββββ
Sample our data
Now we are going to use a window function to take a sample of our data.
df.filter(
pl.int_range(pl.len()).shuffle().over("groups") < 10
)
shape: (50, 2)
ββββββββββ¬βββββββββ
β groups β values β
β --- β --- β
β i64 β i64 β
ββββββββββͺβββββββββ‘
β 0 β 55 β
β 2 β 57 β
β 4 β 99 β
β 4 β 4 β
β 1 β 81 β
β β¦ β β¦ β
β 2 β 22 β
β 1 β 76 β
β 3 β 98 β
β 0 β 90 β
β 4 β 44 β
ββββββββββ΄ββββββββ
For every group in over("group") the pl.int_range(pl.len()) expression creates an index row. We then shuffle that range so that we take a sample and not a slice. Then we only want to take the index values that are lower than 10. This creates a boolean mask that we can pass to the filter method.
This worked better for me:
sampled_df = pl.concat(
df.sample(fraction=0.001) for df in
df.partition_by(["column"], include_key=True)
)
The problem with .agg(pl.col("column").sample(2) was that it seemed to select different values for each column. What I needed was randomly selected rows.
There is a dedicated .rolling() method to perform the group_by/rolling operation.
You can then perform your calculations inside the .agg() context.
lookback_period = 5
window = dict(
by = ("ticker", "timeframe"), # group_by these columns
index_column = pl.int_range(0, pl.count()), # a "row count" to use as the index
period = f"{lookback_period}i" # window "size"
)
df.rolling(**window).agg(
pl.when(pl.count() == lookback_period)
.then(
(pl.col("close-LDPM")
/ (pl.col("close-LDPM").cum_count().reverse() + 1)).sum()
)
)
shape: (21, 4)
ββββββββββ¬ββββββββββββ¬ββββββ¬βββββββββββββ
β ticker β timeframe β int β close-LDPM β
β --- β --- β --- β --- β
β str β str β i64 β f64 β
ββββββββββͺββββββββββββͺββββββͺβββββββββββββ‘
β ERIC β 1 W β 0 β null β
β ERIC β 1 W β 1 β null β
β ERIC β 1 W β 2 β null β
β ERIC β 1 W β 3 β null β
β ERIC β 1 W β 4 β 26.295667 β
β ERIC β 1 W β 5 β 27.193 β
β ERIC β 1 W β 6 β 27.647833 β
β ERIC β 1 W β 7 β 25.616167 β
β ERIC β 1 W β 8 β 24.800667 β
β ERIC β 1 W β 9 β 22.096333 β
β ERIC β 1 W β 10 β 20.864333 β
β ERIC β 1 W β 11 β 20.517 β
β ERIC β 1 W β 12 β 20.660667 β
β ERIC β 1 W β 13 β 20.894167 β
β ERIC β 1 W β 14 β 21.4575 β
β ERIC β 1 W β 15 β 20.6175 β
β ERIC β 1 W β 16 β 20.2265 β
β ERIC β 1 W β 17 β 19.372 β
β ERIC β 1 W β 18 β 18.587833 β
β ERIC β 1 W β 19 β 17.988833 β
β ERIC β 1 W β 20 β 17.861 β
ββββββββββ΄ββββββββββββ΄ββββββ΄βββββββββββββ
Notes
Then when/then condition is used to null out the smaller windows.
The reverse cum_count is one way to emulate the range() behaviour in your example.
df.rolling(**window).agg(
value = pl.col("close-LDPM"),
weight = pl.col("close-LDPM").cum_count().reverse() + 1
)
shape: (21, 5)
ββββββββββ¬ββββββββββββ¬ββββββ¬ββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββ
β ticker β timeframe β int β value β weight β
β --- β --- β --- β --- β --- β
β str β str β i64 β list[f64] β list[u32] β
ββββββββββͺββββββββββββͺββββββͺββββββββββββββββββββββββββββββββββββββͺββββββββββββββββββ‘
β ERIC β 1 W β 0 β [10.87] β [1] β
β ERIC β 1 W β 1 β [10.87, 11.04] β [2, 1] β
β ERIC β 1 W β 2 β [10.87, 11.04, 11.36] β [3, 2, 1] β
β ERIC β 1 W β 3 β [10.87, 11.04, 11.36, 11.01] β [4, 3, 2, 1] β
β ERIC β 1 W β 4 β [10.87, 11.04, 11.36, 11.01, 12.07] β [5, 4, 3, 2, 1] β
β ERIC β 1 W β 5 β [11.04, 11.36, 11.01, 12.07, 12.44] β [5, 4, 3, 2, 1] β
β ERIC β 1 W β 6 β [11.36, 11.01, 12.07, 12.44, 12.38] β [5, 4, 3, 2, 1] β
β ERIC β 1 W β 7 β [11.01, 12.07, 12.44, 12.38, 10.06] β [5, 4, 3, 2, 1] β
β ERIC β 1 W β 8 β [12.07, 12.44, 12.38, 10.06, 10.12] β [5, 4, 3, 2, 1] β
β ERIC β 1 W β 9 β [12.44, 12.38, 10.06, 10.12, 8.1] β [5, 4, 3, 2, 1] β
β ERIC β 1 W β 10 β [12.38, 10.06, 10.12, 8.1, 8.45] β [5, 4, 3, 2, 1] β
β ERIC β 1 W β 11 β [10.06, 10.12, 8.1, 8.45, 9.05] β [5, 4, 3, 2, 1] β
β ERIC β 1 W β 12 β [10.12, 8.1, 8.45, 9.05, 9.27] β [5, 4, 3, 2, 1] β
β ERIC β 1 W β 13 β [8.1, 8.45, 9.05, 9.27, 9.51] β [5, 4, 3, 2, 1] β
β ERIC β 1 W β 14 β [8.45, 9.05, 9.27, 9.51, 9.66] β [5, 4, 3, 2, 1] β
β ERIC β 1 W β 15 β [9.05, 9.27, 9.51, 9.66, 8.49] β [5, 4, 3, 2, 1] β
β ERIC β 1 W β 16 β [9.27, 9.51, 9.66, 8.49, 8.53] β [5, 4, 3, 2, 1] β
β ERIC β 1 W β 17 β [9.51, 9.66, 8.49, 8.53, 7.96] β [5, 4, 3, 2, 1] β
β ERIC β 1 W β 18 β [9.66, 8.49, 8.53, 7.96, 7.71] β [5, 4, 3, 2, 1] β
β ERIC β 1 W β 19 β [8.49, 8.53, 7.96, 7.71, 7.65] β [5, 4, 3, 2, 1] β
β ERIC β 1 W β 20 β [8.53, 7.96, 7.71, 7.65, 7.77] β [5, 4, 3, 2, 1] β
ββββββββββ΄ββββββββββββ΄ββββββ΄ββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββ
Multiple columns
Assuming all columns follow a similar naming pattern we can:
select all
close-columns by regex to process them all together.use
.name.mapto extract the final part of the column name and add the_wsuffix.use regex again to select the newly created
_wcolumns.
weighted_sums = (
df.with_columns(pl.col("close-LDPM").reverse().alias("close-ABCD")) # add dummy column
.rolling(**window).agg(
pl.when(pl.count() == lookback_period)
.then(
(pl.col("^close-.+$") # select all `close-` columns
/ (pl.col("^close-.+$").cum_count().reverse() + 1)).sum()
)
.name.map(lambda col: col.rsplit('-', 1)[1] + "_w") # extract everything after last `-` and add `_w` suffix
)
.select("^.+_w$") # select all `_w` columns
)
shape: (21, 2)
βββββββββββββ¬ββββββββββββ
β LDPM_w β ABCD_w β
β --- β --- β
β f64 β f64 β
βββββββββββββͺββββββββββββ‘
β null β null β
β null β null β
β null β null β
β null β null β
β 26.295667 β 18.5465 β
β 27.193 β 18.865833 β
β 27.647833 β 20.280333 β
β 25.616167 β 20.8945 β
β 24.800667 β 21.0735 β
β 22.096333 β 20.968 β
β 20.864333 β 20.3745 β
β 20.517 β 19.561167 β
β 20.660667 β 21.103167 β
β 20.894167 β 21.7425 β
β 21.4575 β 24.498333 β
β 20.6175 β 26.133333 β
β 20.2265 β 26.955667 β
β 19.372 β 26.298667 β
β 18.587833 β 26.474333 β
β 17.988833 β 25.8955 β
β 17.861 β 25.343167 β
βββββββββββββ΄ββββββββββββ
Add result back to original dataframe
In this case (with a "row count" index) the order is guaranteed, so we can simply .with_columns to add the result.
df = df.with_columns(weighted_sums)
Otherwise, you would a join: https://stackoverflow.com/a/77489932
final code with the invaluable help from @jqurious:
df_data = df_data.with_columns([pl.col("close-LDPM").alias("LDPS_w")])
window = dict(
by = ("ticker", "timeframe"), # group_by these columns
index_column = pl.int_range(0, pl.count()), # a "row count" to use as the index
period = f"{lookback_period}i" # window "size"
)
weighted_sums = df_data.rolling(**window).agg(
pl.when(pl.count() == lookback_period)
.then(
(pl.col("LDPS_w")
/ (pl.col("LDPS_w").cum_count().reverse() + 1)).sum()
)
)
df_data = df_data.with_columns(weighted_sums)
df_data = df_data.drop(["int"])
I duplicated the column "close-LDPM" that I wanted to run the weighted average on, so I got to keep the original column and the new one.
Thanx again @jqurious
pl.Expr.apply was deprecated in favour of pl.Expr.map_elements in Polars release 0.19.0. Recently, pl.Expr.apply was removed in the release of Polars 1.0.0.
You can adapt your code to the new version as follows.
df.with_columns(
pl.col("AH_PROC_REALIZADO")
.map_elements(get_procedure_description, return_dtype=pl.String)
.alias("proced_descr")
)
If you really want to apply python function then you can use map_elements(). However, using native polars expression is always preferrable.
In your case I'd suggest to look at replace() or replace_strict().
If you would want to just search by AH_PROC_REALIZADO column you could use simple replace_strict():
df = pl.DataFrame({
"AH_PROC_REALIZADO": ["30408", "410010065", "410010111", "XXXX"]
})
βββββββββββββββββββββ
β AH_PROC_REALIZADO β
β --- β
β str β
βββββββββββββββββββββ‘
β 30408 β
β 410010065 β
β 410010111 β
β XXXX β
βββββββββββββββββββββ
df.with_columns(
pl.col("AH_PROC_REALIZADO")
.replace_strict(proceds, default=None)
.alias("proced_descr")
)
βββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β AH_PROC_REALIZADO β proced_descr β
β --- β --- β
β str β str β
βββββββββββββββββββββͺβββββββββββββββββββββββββββββββββ‘
β 30408 β QUIMIOTERAPIA β
β 410010065 β MASTECTOMIA SIMPLES β
β 410010111 β SETORECTOMIA / QUADRANTECTOMIA β
β XXXX β null β
βββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββ
The problem with your use case is that, as far as I understand, you want to search by prefix of the strings in AH_PROC_REALIZADO column. In that case you could probably adjust the solution to:
itertools.groupby()to transformprocedsdictionary into dictionary of dictionaries where high level keys are length of the key.replace_strict()to search for product description.coalesce()to combine results into final column.
from itertools import groupby
mappings = {k: dict(g) for k, g in groupby(proceds.items(), lambda x: len(x[0]))}
df = pl.DataFrame({
"AH_PROC_REALIZADO": ["30408_____", "410010065_____", "410010111____", "XXXX"]
})
βββββββββββββββββββββ
β AH_PROC_REALIZADO β
β --- β
β str β
βββββββββββββββββββββ‘
β 30408_____ β
β 410010065_____ β
β 410010111____ β
β XXXX β
βββββββββββββββββββββ
df.with_columns(
pl.coalesce(
pl.col("AH_PROC_REALIZADO").str.head(k).replace_strict(m, default=None) for k, m in mappings.items()
)
.alias("proced_descr")
)
βββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β AH_PROC_REALIZADO β proced_descr β
β --- β --- β
β str β str β
βββββββββββββββββββββͺβββββββββββββββββββββββββββββββββ‘
β 30408_____ β QUIMIOTERAPIA β
β 410010065_____ β MASTECTOMIA SIMPLES β
β 410010111____ β SETORECTOMIA / QUADRANTECTOMIA β
β XXXX β null β
βββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββ