python - creating a new column in polars applying a function to a column - Stack Overflow
python - polars apply a lambda with list comprehension like pandas: Any other better way? - Stack Overflow
python - Improving polars statement that adds a column applying a lambda function on each row - Stack Overflow
python - Apply function to all columns of a Polars-DataFrame - Stack Overflow
Your first example works because a Series has a multiplication method.
For example if you do
Copyfunc(pl.Series([1,2,3,4,5]))
then you get back a series of the original multiplied by 2.
Your func2 is just an anonymous function. To use map_batches, your function needs to operate on the entire column and return something like a Series.
For instance:
Copyfrom lxml import etree as ET
def func2_series(xml_strings):
ret_List=[]
for xml_string in xml_strings:
root = ET.fromstring(xml_string, ET.XMLParser(recover=True))
text_list = []
for elem in root.iter():
text = elem.text.strip() if elem.text else ''
text_list.append(text)
ret_List.append(text_list)
return pl.Series(ret_List)
followed by
Copydf.with_columns(pl.col("B").map_batches(func2_series).alias('new_col2'))
will work.
Alternatively if you have
Copydef func2(xml_string):
root = ET.fromstring(xml_string, ET.XMLParser(recover=True))
text_list = []
for elem in root.iter():
text = elem.text.strip() if elem.text else ''
text_list.append(text)
return text_list
then you can use map_elements and polars will do the looping for you.
Copydfpl.with_columns(pl.col("B").map_elements(func2))
btw, you don't need to use a lambda if the function you're passing accepts the exact x that you have. In other words where you have .map_batches(lambda x: func2(x)) you can just do .map_batches(func2). The lambda comes into play if you need to transform the parameters.
As stated in the comments, use .map_elements instead of .map_batches. Also, if you want only list of strings I recommend to use beautifulsoups method .stripped_strings:
Copyimport polars as pl
from bs4 import BeautifulSoup
# create a sample dataframe
df = pl.DataFrame({
'A': [1, 2, 3],
'B': ['<p>some text</p><p>bla</p>', '<p>some text<p><p>foo</p>', '<p>some text<p>']
})
def func(mystring):
return mystring*2
def func2(xml_string):
soup = BeautifulSoup(xml_string, 'html.parser')
return list(soup.stripped_strings)
# create a sample series to add as a new column
df = df.with_columns((pl.col("A").map_elements(lambda x: func(x)).alias('new_col')))
df = df.with_columns((pl.col("B").map_elements(lambda x: func2(x)).alias('new_col2')))
print(df)
Prints:
shape: (3, 4)
βββββββ¬βββββββββββββββββββββββββββββ¬ββββββββββ¬βββββββββββββββββββββββ
β A β B β new_col β new_col2 β
β --- β --- β --- β --- β
β i64 β str β i64 β list[str] β
βββββββͺβββββββββββββββββββββββββββββͺββββββββββͺβββββββββββββββββββββββ‘
β 1 β <p>some text</p><p>bla</p> β 2 β ["some text", "bla"] β
β 2 β <p>some text<p><p>foo</p> β 4 β ["some text", "foo"] β
β 3 β <p>some text<p> β 6 β ["some text"] β
βββββββ΄βββββββββββββββββββββββββββββ΄ββββββββββ΄βββββββββββββββββββββββ
I recently decided to use Polars in my side-project and stumbled upon surprisingly challenging task: computing a new column from two other variables using a function (not Polars expressions) for computation. I read the data from CSV file and want to derive more variables from it and I thought Polars would be a good tool for that.
Because I needed a working, not a good code I created a solution below, but there must be a better way! But maybe Polars is not for such use-cases and I should use some other crate? If so, please tell me which one.
fn add_col3(df: LazyFrame) -> Result<LazyFrame> {
let mut col3 = vec![];
let data = df.clone().collect()?.to_ndarray::<Float64Type>()?;
for row in data.rows() {
// I'm aware I could do this loop
// more efficiently
let a = row[1];
let b = row[2];
let c = complex_computation(a, b)?;
col3.push(c);
}
let col3 = Series::new("Value3", col3);
let df = df.with_column(col3.lit());
Ok(df)
}
fn complex_computation(a: f64, b: f64) -> Result<f64> {
...
Ok(c)
}Having to clone, collect and convert to ndarray seems very inefficient to me and no really idiomatic. But I'm rather clueless how I could do this better - and most questions online discuss Python API of Polars.
In Python I would need to use .struct() and .apply() with lambda to do that computation. But in Rust .struct() seems to not exist and .apply() is for one column only.
Did anyone attempted to do that previously and came up with a better solution?