🌐
Polars
docs.pola.rs β€Ί api β€Ί python β€Ί version β€Ί 0.18 β€Ί reference β€Ί expressions β€Ί api β€Ί polars.Expr.apply.html
polars.Expr.apply β€” Polars documentation
In a selection context, the function is applied by row. >>> df.with_columns( ... pl.col("a").apply(lambda x: x * 2).alias("a_times_2"), ...
🌐
TypeThePipe
typethepipe.com β€Ί vizs-and-tips β€Ί python-polars-suggest-efficient-expressions-lambda-function
Polars new feature. Suggest more efficient Polars method for apply lambda functions | TypeThePipe
July 20, 2023 - import polars as pl df = pl.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie'], 'offensive_skill': [5, 30, 85], 'defensive_skill': [92, 30, 10] }) df.with_columns( pl.col("defensive_skill").apply(lambda x: x/3) )
Discussions

python - creating a new column in polars applying a function to a column - Stack Overflow
8 Using Polars with Python and being thrown the following exception: AttributeError: 'Expr' object has no attribute 'apply' 4048 How can I use a global variable in a function? More on stackoverflow.com
🌐 stackoverflow.com
python - polars apply a lambda with list comprehension like pandas: Any other better way? - Stack Overflow
2 Is it possible to reference another ... using a lambda? 2 How to create a cumulated list of a column's elements Β· 5 polars dropna equivalent using a subset of columns with threshold Β· 6 How to multiply each element in a list with a value in a different column? 8 Python Polars: how to convert a list of dictionaries to polars dataframe without using pandas Β· 1 Python Polars - add element to columns of lists which has value equal to a function of the ... More on stackoverflow.com
🌐 stackoverflow.com
December 21, 2022
python - Improving polars statement that adds a column applying a lambda function on each row - Stack Overflow
The idea is to use Polars Expressions instead of applying custom Python functions/lambdas. More on stackoverflow.com
🌐 stackoverflow.com
python - Apply function to all columns of a Polars-DataFrame - Stack Overflow
I know how to apply a function to all columns present in a Pandas-DataFrame. However, I have not figured out yet how to achieve this when using a Polars-DataFrame. I checked the section from the Po... More on stackoverflow.com
🌐 stackoverflow.com
🌐
Towards Data Science
towardsdatascience.com β€Ί home β€Ί latest β€Ί manipulating values in polars dataframes
Manipulating Values in Polars DataFrames | Towards Data Science
January 29, 2025 - This means that the apply() function, when applied to a dataframe, sends the values of each row as a tuple to the receiving function. This is useful for some use cases. For example, say you need to perform an integer division of all the numbers ...
🌐
Polars
docs.pola.rs β€Ί api β€Ί python β€Ί version β€Ί 0.18 β€Ί reference β€Ί dataframe β€Ί api β€Ί polars.DataFrame.apply.html
polars.DataFrame.apply β€” Polars documentation
If your function is expensive and you don’t want it to be called more than once for a given input, consider applying an @lru_cache decorator to it. With suitable data you may achieve order-of-magnitude speedups (or more). ... >>> df.apply(lambda t: (t[0] * 2, t[1] * 3)) shape: (3, 2) β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ column_0 ┆ column_1 β”‚ β”‚ --- ┆ --- β”‚ β”‚ i64 ┆ i64 β”‚ β•žβ•β•β•β•β•β•β•β•β•β•β•ͺ══════════║ β”‚ 2 ┆ -3 β”‚ β”‚ 4 ┆ 15 β”‚ β”‚ 6 ┆ 24 β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
🌐
Polars
docs.pola.rs β€Ί api β€Ί python β€Ί dev β€Ί reference β€Ί expressions β€Ί api β€Ί polars.Expr.map_elements.html
polars.Expr.map_elements β€” Polars documentation
Polars may call the function with arbitrary input data. Examples Β· >>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 1], ... "b": ["a", "b", "c", "c"], ... } ... ) The function is applied to each element of column 'a': >>> df.with_columns( ... pl.col("a") ... .map_elements(lambda x: x * 2, return_dtype=pl.self_dtype()) ...
🌐
Rho Signal
rhosignal.com β€Ί posts β€Ί polars-aws-lambda
AWS Lambda with Polars | Rho Signal
November 14, 2024 - Then you can create a lambda function that uses your image as a container. See this AWS tutorial for more details on these steps. There’s a lot more to say about optimising Polars and AWS Lambda. For example, you can use Polars to read and write from S3 in lazy mode and this allows Polars to apply query optimisations.
Top answer
1 of 2
4

Your first example works because a Series has a multiplication method.

For example if you do

Copyfunc(pl.Series([1,2,3,4,5]))

then you get back a series of the original multiplied by 2.

Your func2 is just an anonymous function. To use map_batches, your function needs to operate on the entire column and return something like a Series.

For instance:

Copyfrom lxml import etree as ET
def func2_series(xml_strings):
    ret_List=[]
    for xml_string in xml_strings:
        root = ET.fromstring(xml_string, ET.XMLParser(recover=True))
        text_list = []
        for elem in root.iter():
            text = elem.text.strip() if elem.text else ''
            text_list.append(text)
        ret_List.append(text_list)
    return pl.Series(ret_List)

followed by

Copydf.with_columns(pl.col("B").map_batches(func2_series).alias('new_col2'))

will work.

Alternatively if you have

Copydef func2(xml_string):
    root = ET.fromstring(xml_string, ET.XMLParser(recover=True))
    text_list = []
    for elem in root.iter():
        text = elem.text.strip() if elem.text else ''
        text_list.append(text)
    return text_list

then you can use map_elements and polars will do the looping for you.

Copydfpl.with_columns(pl.col("B").map_elements(func2))

btw, you don't need to use a lambda if the function you're passing accepts the exact x that you have. In other words where you have .map_batches(lambda x: func2(x)) you can just do .map_batches(func2). The lambda comes into play if you need to transform the parameters.

2 of 2
2

As stated in the comments, use .map_elements instead of .map_batches. Also, if you want only list of strings I recommend to use beautifulsoups method .stripped_strings:

Copyimport polars as pl
from bs4 import BeautifulSoup

# create a sample dataframe
df = pl.DataFrame({
    'A': [1, 2, 3],
    'B': ['<p>some text</p><p>bla</p>', '<p>some text<p><p>foo</p>', '<p>some text<p>']
})

def func(mystring):
    return mystring*2

def func2(xml_string):
    soup = BeautifulSoup(xml_string, 'html.parser')
    return list(soup.stripped_strings)

# create a sample series to add as a new column
df = df.with_columns((pl.col("A").map_elements(lambda x: func(x)).alias('new_col')))
df = df.with_columns((pl.col("B").map_elements(lambda x: func2(x)).alias('new_col2')))

print(df)

Prints:

shape: (3, 4)
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ A   ┆ B                          ┆ new_col ┆ new_col2             β”‚
β”‚ --- ┆ ---                        ┆ ---     ┆ ---                  β”‚
β”‚ i64 ┆ str                        ┆ i64     ┆ list[str]            β”‚
β•žβ•β•β•β•β•β•ͺ════════════════════════════β•ͺ═════════β•ͺ══════════════════════║
β”‚ 1   ┆ <p>some text</p><p>bla</p> ┆ 2       ┆ ["some text", "bla"] β”‚
β”‚ 2   ┆ <p>some text<p><p>foo</p>  ┆ 4       ┆ ["some text", "foo"] β”‚
β”‚ 3   ┆ <p>some text<p>            ┆ 6       ┆ ["some text"]        β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Top answer
1 of 1
4

The functionality is available natively in Polars via the .str namespace.

.str.split() doesn't support regex.

But similar behaviour can be achieved with .extract_all() and .replace_all()

df = pl.DataFrame({"content": ["o neHItw oHIIIIIth ree", "fo urHIIfi veHIIIIs ix"]})

pattern2 = r"HI+"
pattern3 = r"\s"

replacement = ""
df.with_columns(
   pl.col("content").str.extract_all(rf".*?({pattern2}|$)")
     .alias("sentences")
)
shape: (2, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ content                ┆ sentences                          β”‚
β”‚ ---                    ┆ ---                                β”‚
β”‚ str                    ┆ list[str]                          β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ════════════════════════════════════║
β”‚ o neHItw oHIIIIIth ree ┆ ["o neHI", "tw oHIIIII", "th ree"] β”‚
β”‚ fo urHIIfi veHIIIIs ix ┆ ["fo urHII", "fi veHIIII", "s ix"] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

list.eval() could then be used to process the list and "extract" the desired result.

df.with_columns(
   pl.col("content").str.extract_all(rf".*?({pattern2}|$)")
     .list.eval(
        pl.element().str.replace_all(pattern2, "")
                    .str.replace_all(pattern3, replacement)
     )
     .alias("normal_text")
)
shape: (2, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ content                ┆ normal_text             β”‚
β”‚ ---                    ┆ ---                     β”‚
β”‚ str                    ┆ list[str]               β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════════════════║
β”‚ o neHItw oHIIIIIth ree ┆ ["one", "two", "three"] β”‚
β”‚ fo urHIIfi veHIIIIs ix ┆ ["four", "five", "six"] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Performance

A basic comparison of both approaches.

N = 2000
df = pl.DataFrame({
   "content": [
      "o neHItw oHIIIIIth ree" * N, 
      "fo urHIIfi veHIIIIs ix" * N] * N
})
Name Time
.str + .list.eval() 8.28s
.map_elements() 29.9s
Find elsewhere
🌐
Polars
docs.pola.rs β€Ί api β€Ί python β€Ί version β€Ί 0.19 β€Ί reference β€Ί series β€Ί api β€Ί polars.Series.apply.html
polars.Series.apply β€” Polars documentation
return_dtype: PolarsDataType | None = None, *, skip_nulls: bool = True, ) β†’ Self[source]# Apply a custom/user-defined function (UDF) over elements in this Series. Deprecated since version 0.19.0: This method has been renamed to Series.map_elements(). Parameters: function Β· Custom function or lambda.
🌐
Polars
docs.pola.rs β€Ί polars-cloud β€Ί integrations β€Ί lambda
AWS Lambda - Polars user guide
The code for the lambda function can be boiled down to the following (pseudo-code): import boto3 import polars as pl import polars_cloud as pc client = boto3.client("secretsmanager") # authenticate to polars cloud with the secrets created above pc.authenticate( client_id=client.get_secret_value(SecretId="<SECRET>").get("SecretString"), client_secret=client.get_secret_value(SecretId="<SECRET>").get("SecretString"), ) # define the compute context cc = pc.ComputeContext(cpus=2, memory=4) # submit the query pl.scan_csv(...).remote(cc).sink_parquet(...)
🌐
Polars
docs.pola.rs β€Ί user-guide β€Ί expressions β€Ί user-defined-python-functions
User-defined Python functions - Polars user guide
Polars expressions are quite powerful and flexible, so there is much less need for custom Python functions compared to other libraries. Still, you may need to pass an expression's state to a third party library or apply your black box function to data in Polars.
Top answer
1 of 1
2

The idea is to use Polars Expressions instead of applying custom Python functions/lambdas.

It looks like you're trying to count when ref and another column have the same value?

df.select(pl.exclude("ref") == pl.col("ref"))
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚ v1    ┆ v2    β”‚
β”‚ ---   ┆ ---   β”‚
β”‚ bool  ┆ bool  β”‚
β•žβ•β•β•β•β•β•β•β•ͺ═══════║
β”‚ true  ┆ true  β”‚
β”‚ false ┆ false β”‚
β”‚ false ┆ true  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜

.sum_horizontal() can be used to get a "count" of the true values on each row.

df.with_columns(count = pl.sum_horizontal(pl.exclude("ref") == pl.col("ref")))
shape: (3, 4)
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ref ┆ v1  ┆ v2  ┆ count β”‚
β”‚ --- ┆ --- ┆ --- ┆ ---   β”‚
β”‚ i64 ┆ i64 ┆ i64 ┆ u32   β”‚
β•žβ•β•β•β•β•β•ͺ═════β•ͺ═════β•ͺ═══════║
β”‚ -1  ┆ -1  ┆ -1  ┆ 2     β”‚
β”‚ 2   ┆ 5   ┆ 5   ┆ 0     β”‚
β”‚ 8   ┆ 0   ┆ 8   ┆ 1     β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜
🌐
Confessions of a Data Guy
confessionsofadataguy.com β€Ί home β€Ί polars vs pandas. inside an aws lambda.
Polars vs Pandas. Inside an AWS Lambda. - Confessions of a Data Guy
July 22, 2023 - Remember, I just want to read a bucket of s3 files as easily as possible and do some simple work on a Lambda … I want it to be as easy as it would be with Spark! Firstly, because of the file-by-file iteration we had to do in Pandas, I has my hopes extremely hight that Polars in conjunction with pyarrow might be able to simply read a folder.
🌐
Medium
medium.com β€Ί @kasperjuunge β€Ί 20-pandas-operations-translated-to-polars-4b9daba154f5
20 Pandas Operations Translated to Polars | by Kasper Junge | Medium
January 11, 2024 - Polars: pl.concat([df1, df2]) Pandas: df['A'].apply(lambda x: x*2) Polars: df.with_column(pl.col('A').apply(lambda x: x*2)) Pandas: df.dropna() Polars: df.drop_nulls() Pandas: df.fillna(value) Polars: df.fill_none(value) Pandas: df.rename(columns={'A': 'X'}) Polars: df.rename({'A': 'X'}) Pandas: df['A'].unique() Polars: df['A'].unique() Pandas: df.info() Polars: df.describe() Pandas: df[df['A'] > 1] Polars: df.filter(pl.col('A') > 1) Pandas: df.agg({'A': ['sum', 'min'], 'B': ['max', 'mean']}) Polars: df.agg([pl.sum('A'), pl.min('A'), pl.max('B'), pl.mean('B')]) Pandas: df['A'].astype('float') Polars: df.with_column(pl.col('A').cast(pl.Float64)) Pandas: df1.merge(df2, on='key').merge(df3, on='key') Polars: df1.join(df2, on='key').join(df3, on='key') Pandas Vs Polars Β·
Top answer
1 of 1
6

Use replace_strict:

In [21]: data1 = {"a": [1, 2, 3, 4], "b1": [11, 12, 13, 14], "c1" : [31, 32, 33, 34]}
    ...: df1_pl = pl.DataFrame(data1)
    ...: print(df1_pl)
    ...: weekday = ['Monday', 'Tuesday', 'Wednesday', 'Thursday']
    ...:
    ...: print(df1_pl.with_columns(
    ...: weekday=pl.col('a').replace_strict({idx: val for idx, val in enumerate(weekday, start=1)})
    ...: ))
shape: (4, 3)
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ a   ┆ b1  ┆ c1  β”‚
β”‚ --- ┆ --- ┆ --- β”‚
β”‚ i64 ┆ i64 ┆ i64 β”‚
β•žβ•β•β•β•β•β•ͺ═════β•ͺ═════║
β”‚ 1   ┆ 11  ┆ 31  β”‚
β”‚ 2   ┆ 12  ┆ 32  β”‚
β”‚ 3   ┆ 13  ┆ 33  β”‚
β”‚ 4   ┆ 14  ┆ 34  β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
shape: (4, 4)
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ a   ┆ b1  ┆ c1  ┆ weekday   β”‚
β”‚ --- ┆ --- ┆ --- ┆ ---       β”‚
β”‚ i64 ┆ i64 ┆ i64 ┆ str       β”‚
β•žβ•β•β•β•β•β•ͺ═════β•ͺ═════β•ͺ═══════════║
β”‚ 1   ┆ 11  ┆ 31  ┆ Monday    β”‚
β”‚ 2   ┆ 12  ┆ 32  ┆ Tuesday   β”‚
β”‚ 3   ┆ 13  ┆ 33  ┆ Wednesday β”‚
β”‚ 4   ┆ 14  ┆ 34  ┆ Thursday  β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
🌐
Reddit
reddit.com β€Ί r/rust β€Ί polars: computing a new column from multiple columns - there must be a better way
r/rust on Reddit: Polars: Computing a new column from multiple columns - there must be a better way
May 4, 2023 -

I recently decided to use Polars in my side-project and stumbled upon surprisingly challenging task: computing a new column from two other variables using a function (not Polars expressions) for computation. I read the data from CSV file and want to derive more variables from it and I thought Polars would be a good tool for that.

Because I needed a working, not a good code I created a solution below, but there must be a better way! But maybe Polars is not for such use-cases and I should use some other crate? If so, please tell me which one.

fn add_col3(df: LazyFrame) -> Result<LazyFrame> {
    let mut col3 = vec![];

    let data = df.clone().collect()?.to_ndarray::<Float64Type>()?;

    for row in data.rows() {
        // I'm aware I could do this loop
        // more efficiently
        let a = row[1]; 
        let b = row[2];

        let c = complex_computation(a, b)?;

        col3.push(c);
    }

    let col3 = Series::new("Value3", col3);
    let df = df.with_column(col3.lit());

    Ok(df)
}

fn complex_computation(a: f64, b: f64) -> Result<f64> {
    ...
    
    Ok(c)
}

Having to clone, collect and convert to ndarray seems very inefficient to me and no really idiomatic. But I'm rather clueless how I could do this better - and most questions online discuss Python API of Polars.

In Python I would need to use .struct() and .apply() with lambda to do that computation. But in Rust .struct() seems to not exist and .apply() is for one column only.

Did anyone attempted to do that previously and came up with a better solution?