dataframe' object has no attribute 'withcolumn

'DataFrame' object has no attribute 'withColumn'

stackoverflow.com › questions › 56988316 › dataframe-object-has-no-attribute-withcolumn

You mixed up pandas dataframe and Spark dataframe.

The issue is pandas df doesn't have spark function withColumn.

Answer from Ani Menon on Stack Overflow

Stack Overflow

stackoverflow.com › questions › 56988316 › dataframe-object-has-no-attribute-withcolumn

python - 'DataFrame' object has no attribute 'withColumn' - Stack Overflow

Top answer

1 of 3

You mixed up pandas dataframe and Spark dataframe.

The issue is pandas df doesn't have spark function withColumn.

2 of 3

I figured it out. Thanks for the help.

def res(df):
    if df['data_type_x'] == df['data_type_y']:
        return 'no change'
    elif pd.isnull(df['data_type_x']):
        return 'new attribute'
    elif pd.isnull(df['data_type_y']):
        return 'deleted attribute'
    elif df['data_type_x'] != df['data_type_y'] and not pd.isnull(df['data_type_x']) and not pd.isnull(df['data_type_y']):
        return 'datatype change'

pd_merge['result'] = pd_merge.apply(res, axis = 1)

Itsourcecode

itsourcecode.com › home › attributeerror: ‘dataframe’ object has no attribute ‘withcolumn’

attributeerror: 'dataframe' object has no attribute 'withcolumn' |Fixed

March 31, 2023 - The error message “‘DataFrame’ object has no attribute ‘withColumn’” occurs when you are trying to add a new column to a Pandas DataFrame using the withColumn() method.

Discussions

'DataFrame' object has no attribute 'withColumn' Getting this error - Stack Overflow

I am trying to do string Matching. But when I am getting this error while creating a column. Please help. (AttributeError: 'DataFrame' object has no attribute 'withColumn') from pyspark.sql import More on stackoverflow.com

stackoverflow.com

April 30, 2023

"'DataFrame' object has no attribute" Issue

Double check if there's a space in the column name. 'Survived ' vs 'Survived' It happens more often than you'd think especially with CSV data source.

Top answer

1 of 5

"sklearn.datasets" is a scikit package, where it contains a method load_iris().

load_iris(), by default return an object which holds data, target and other members in it. In order to get actual values you have to read the data and target content itself.

Whereas 'iris.csv', holds feature and target together.

FYI: If you set return_X_y as True in load_iris(), then you will directly get features and target.

from sklearn import datasets
data,target = datasets.load_iris(return_X_y=True)

2 of 5

The Iris Dataset from Sklearn is in Sklearn's Bunch format:

print(type(iris))
print(iris.keys())

output:

<class 'sklearn.utils.Bunch'>
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

So, that's why you can access it as:

x=iris.data
y=iris.target

But when you read the CSV file as DataFrame as mentioned by you:

iris = pd.read_csv('iris.csv',header=None).iloc[:,2:4]
iris.head()

output is:

    2   3
0   petal_length    petal_width
1   1.4 0.2
2   1.4 0.2
3   1.3 0.2
4   1.5 0.2

Here the column names are '1' and '2'.

First of all you should read the CSV file as:

df = pd.read_csv('iris.csv')

you should not include header=None as your csv file includes the column names i.e. the headers.

So, now what you can do is something like this:

X = df.iloc[:, [2, 3]] # Will give you columns 2 and 3 i.e 'petal_length' and 'petal_width'
y = df.iloc[:, 4] # Label column i.e 'species'

or if you want to use the column names then:

X = df[['petal_length', 'petal_width']]
y = df.iloc['species']

Also, if you want to convert labels from string to numerical format use sklearn LabelEncoder

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y = le.fit_transform(y)

Stack Overflow

stackoverflow.com › questions › 76139983 › dataframe-object-has-no-attribute-withcolumn-getting-this-error

'DataFrame' object has no attribute 'withColumn' Getting this error - Stack Overflow

April 30, 2023 - from pyspark.sql import functions as f from fuzzywuzzy import fuzz from pyspark.sql.types import StringType from pyspark.sql import SparkSession, DataFrame def matchstring(s1, s2): return fuzz.token_sort_ratio(s1, s2) MatchUDF = f.udf(matchstring, StringType()) spark = SparkSession.builder.appName("test").getOrCreate() df_merged = ps.merge(df_Sale_KR,df_Dist_Mast, on='Distributor_ID', how='left') df_similarity_score = df_merged.withColumn("similarity_score", MatchUDF(f.col("source"), f.col("target"))) df_similarity_score.show() ... I can't currently test this so may be a bit off, but pandas merge-function doesn't return a pyspark dataframe, it returns a pandas dataframe which - as far as I know - does not implement withColumn.

Apache

spark.apache.org › docs › latest › api › python › reference › pyspark.sql › api › pyspark.sql.DataFrame.withColumn.html

pyspark.sql.DataFrame.withColumn — PySpark 4.1.2 documentation

To avoid this, use select() with multiple columns at once. ... >>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", "name"]) >>> df.withColumn('age2', df.age + 2).show() +---+-----+----+ |age| name|age2| +---+-----+----+ | 2|Alice| 4| | 5| Bob| 7| +---+-----+----+

Plain English

python.plainenglish.io › how-to-fix-attributeerror-in-python-6cea86059a27

How to Fix AttributeError in Python? | by JOKEN VILLANUEVA | Python in Plain English

January 27, 2025 - The error message “‘DataFrame’ object has no attribute ‘withColumn’” occurs when you are trying to add a new column to a Pandas DataFrame using the withColumn() method.

reddit.com › r/learnpython › "'dataframe' object has no attribute" issue

r/learnpython on Reddit: "'DataFrame' object has no attribute" Issue

October 30, 2020 -

I am in university and am taking a special topics class regarding AI. I have zero knowledge about Python, how it works, or what anything means.

A project for the class involves manipulating Bayesian networks to predict how many and which individuals die upon the sinking of a ship. This is the code I am supposed to manipulate:

##EDIT VARIABLES TO THE VARIABLES OF INTEREST
train_var = train.loc[:,['Survived','Sex']]  
test_var = test.loc[:,['Sex']]  
BayesNet = BayesianModel([('Sex','Survived')])

I am supposed to add another variable, 'Pclass,' to the mix, paying attention to the order for causation. I have added that variable to every line of this code in every way imaginable and consistently get an error from this line:

predictions = pandas.DataFrame({'PassengerId': test.PassengerId,'Survived': hypothesis.Survived.tolist()})
predictions

For example, the error I get for this version of the code:

train_var = train.loc[:,['Survived','Pclass','Sex']]  
test_var = test.loc[:,['Pclass']]  
BayesNet = BayesianModel([('Sex','Pclass','Survived')])

is this:

AttributeError                            Traceback (most recent call last)
<ipython-input-98-16d9eb9451f7> in <module>
----> 1 predictions = pandas.DataFrame({'PassengerId': test.PassengerId,'Survived': hypothesis.Survived.tolist()})
      2 predictions

/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5137             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5138                 return self[name]
-> 5139             return object.__getattribute__(self, name)
   5140 
   5141     def __setattr__(self, name: str, value) -> None:

AttributeError: 'DataFrame' object has no attribute 'Survived'

Honestly, I have no idea wtf any of this means. I have tried googling this issue and have come up with nothing.

Any help would be greatly appreciated. I know it's a lot.

Top answer

1 of 2

Double check if there's a space in the column name. 'Survived ' vs 'Survived' It happens more often than you'd think especially with CSV data source.

2 of 2

It's an issue with how you're calling the data and if it's actually there.

train.loc[:,['Survived','Sex']]

tells me that there's a DataFrame (which is from pandas, hence the error) called train and this line is trying to access parts of that dataframe (it's just a type of an array). Specifically, it's trying to access columns named Survived and Sex.

Similarly, this line tells me there's another dataframe (df) known as test with a column named Sex and this is access that data.

test.loc[:,['Sex']]

The error code also informs me of some things

predictions = pandas.DataFrame({'PassengerId': test.PassengerId,'Survived': hypothesis.Survived.tolist()})

There's another df called predictions that's of dict type which is trying to access information from the another hypothesis df. The attribute it's tryin to access in the second key of the dict is

hypothesis.Survived.tolist()

which is a way of calling a column from that df. That is, when the predictions line is executed, it's trying to pull all the values from the Survived column of the hypothesis df.

The error is that the df doesn't actually have a column named Survived. So either there's missing data, or you're calling it wrong, or there's a missing reference.

Without knowing more about your code and your question, I can't really extrapolate much more.

GitHub

github.com › dask › dask › issues › 8624

AttributeError: 'DataFrame' object has no attribute 'name'; Various stack overflow / github suggested fixes not working · Issue #8624 · dask/dask

January 26, 2022 - { File "[redacted]/pandas/core/generic.py", line 5487, in __getattr__ return object.__getattribute__(self, name) AttributeError: 'DataFrame' object has no attribute 'name'. Did you mean: 'rename'? }

Author dask

Find elsewhere

Google Bing Mojeek

Databricks Community

community.databricks.com › t5 › data-engineering › attributeerror-dataframe-object-has-no-attribute-rename › td-p › 28109

Solved: AttributeError: 'DataFrame' object has no attribut... - Databricks Community - 28109

January 2, 2024 - https://stackoverflow.com/questions/38134643/data-frame-object-has-no-attribute ... If df_boston is a DataFrame, but you still face issues, try an alternative syntax: df_boston = df_boston.rename(columns={'zn': 'Zoning'}).

Stack Overflow

stackoverflow.com › q › 46832357

apache spark - pyspark: DataFrame.withColumn() sometimes requires assignment to a new DataFrame with a different name - Stack Overflow

df_new = df.withColumn('AMOUNT', df.AMOUNT*lit(-1)) => works! When I use other methods or UDFs, it doesn't seem to exhibit the same weirdness. I can just assign the DataFrame back to itself.

reddit.com › r/learnpython › attributeerror: 'dataframe' object has no attribute 'data'

r/learnpython on Reddit: AttributeError: 'DataFrame' object has no attribute 'data'

September 29, 2021 -

wine = pd.read_csv("combined.csv", header=0).iloc[:-1]
df = pd.DataFrame(wine)
df
dataset = pd.DataFrame(df.data, columns =df.feature_names)
dataset['target']=df.target
dataset

ERROR:

<ipython-input-27-64122078da92> in <module>
----> 1 dataset = pd.DataFrame(df.data, columns =df.feature_names)
      2 dataset['target']=df.target
      3 dataset

D:\Anaconda\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5463             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5464                 return self[name]
-> 5465             return object.__getattribute__(self, name)
   5466 
   5467     def __setattr__(self, name: str, value) -> None:

AttributeError: 'DataFrame' object has no attribute 'data'

I'm trying to set up a target to proceed with my Multi Linear Regression Project, but I can't even do that. I've already downloaded the CSV file and have it uploaded on a Jupyter Notebook. What I'm I doing wrong?

Top answer

1 of 3

2 of 3

Is there a column in your file called data? you already have a df from the first line. Maybe you just want to rename the columns. df.rename(columns={})

Stack Overflow

stackoverflow.com › questions › 38134643 › how-to-resolve-attributeerror-dataframe-object-has-no-attribute

python - How to resolve AttributeError: 'DataFrame' object has no attribute - Stack Overflow

Top answer

1 of 7

Check your DataFrame with data.columns

It should print something like this

Index([u'regiment', u'company',  u'name',u'postTestScore'], dtype='object')

Check for hidden white spaces..Then you can rename with

data = data.rename(columns={'Number ': 'Number'})

2 of 7

I think the column name that contains "Number" is something like " Number" or "Number ". I'm assuming you might have a residual space in the column name. Please run print "<{}>".format(data.columns[1]) and see what you get. If it's something like < Number>, it can be fixed with:

data.columns = data.columns.str.strip()

See pandas.Series.str.strip

In general, AttributeError: 'DataFrame' object has no attribute '...', where ... is some column name, is caused because . notation has been used to reference a nonexistent column name or pandas method.

pandas methods are accessed with a .. pandas columns can also be accessed with a . (e.g. data.col) or with brackets (e.g. ['col'] or [['col1', 'col2']]).

data.columns = data.columns.str.strip() is a fast way to quickly remove leading and trailing spaces from all column names. Otherwise verify the column or attribute is correctly spelled.

Stack Exchange

gis.stackexchange.com › questions › 317928 › equivalent-method-for-withcolumn-for-geodataframe

python - Equivalent method for .withcolumn() for geodataframe - Geographic Information Systems Stack Exchange

Top answer

1 of 1

To add a new column into a (geo)pandas.(Geo)DataFrame, you should use the .assign method. The column names are passed as keyword arguments, and the values can be scalars, sequences, or callable functions and methods that accept the dataframe in its current state as the first argument.

So your code becomes:

shpfile = "/dbfs/FileStore/tables/gda_000a11a_e.shp"
gda_GDF = (
    GeoDataFrame.from_file(shpfile)
        .assign(exists=lambda gdf: gdf.contains(other_geom)
)

YouTube

youtube.com › watch

How to fix AttributeError: 'DataFrame' object has no attribute 'columns' whe... in Python - YouTube

00:59

Hello, Dedicated Coders! 🖥️💡We're excited to share with you our newest video, "How to solve AttributeError: 'DataFrame' object has no attribute 'columns' ...

Published May 5, 2024

Views 290

Stack Overflow

stackoverflow.com › questions › 50686616 › dataframe-object-has-no-attribute-apply-when-trying-to-apply-lambda-to-cre

python - "'DataFrame' object has no attribute 'apply'" when trying to apply lambda to create new column - Stack Overflow

Top answer

1 of 2

The syntax you are using is for a pandas DataFrame. To achieve this for a spark DataFrame, you should use the withColumn() method. This works great for a wide range of well defined DataFrame functions, but it's a little more complicated for user defined mapping functions.

General Case

In order to define a udf, you need to specify the output data type. For instance, if you wanted to apply a function my_func that returned a string, you could create a udf as follows:

import pyspark.sql.functions as f
my_udf = f.udf(my_func, StringType())

Then you can use my_udf to create a new column like:

df = df.withColumn('new_column', my_udf(f.col("some_column_name")))

Another option is to use select:

df = df.select("*", my_udf(f.col("some_column_name")).alias("new_column"))

Specific Problem

Using a udf

In your specific case, you want to use a dictionary to translate the values of your DataFrame.

Here is a way to define a udf for this purpose:

some_map_udf = f.udf(lambda x: some_map.get(x, None), IntegerType())

Notice that I used dict.get() because you want your udf to be robust to bad inputs.

df = df.withColumn('new_column', some_map_udf(f.col("some_column_name")))

Using DataFrame functions

Sometimes using a udf is unavoidable, but whenever possible, using DataFrame functions is usually preferred.

Here is one option to do the same thing without using the udf.

The trick is to iterate over the items in some_map to create a list of pyspark.sql.functions.when() functions.

some_map_func = [f.when(f.col("some_column_name") == k, v) for k, v in some_map.items()]
print(some_map_func)
#[Column<CASE WHEN (some_column_name = a) THEN 0 END>,
# Column<CASE WHEN (some_column_name = c) THEN 1 END>,
# Column<CASE WHEN (some_column_name = b) THEN 1 END>]

Now you can use pyspark.sql.functions.coalesce() inside of a select:

df = df.select("*", f.coalesce(*some_map_func).alias("some_column_name"))

This works because when() returns null by default if the condition is not met, and coalesce() will pick the first non-null value it encounters. Since the keys of the map are unique, at most one column will be non-null.

2 of 2

You have a spark dataframe, not a pandas dataframe. To add new column to the spark dataframe:

import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
df = df.withColumn('new_column', F.udf(some_map.get, IntegerType())(some_column_name))
df.show()

Polars

docs.pola.rs › py-polars › html › reference › dataframe › api › polars.DataFrame.with_columns.html

polars.DataFrame.with_columns — Polars documentation

Add columns to this DataFrame. Added columns will replace existing columns with the same name. ... Column(s) to add, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.

Kaggle

kaggle.com › general › 108926

AttributeError: 'DataFrame' object has no attribute 'dtype' ...

Click here if you are not automatically redirected after 5 seconds.

Saturn Cloud

saturncloud.io › blog › solving-the-dataframe-object-has-no-attribute-name-error-in-pandas

Solving the 'DataFrame Object Has No Attribute 'name' Error in Pandas | Saturn Cloud Blog

July 10, 2023 - Running this code will result in an AttributeError: 'DataFrame' object has no attribute 'name'. This is because a DataFrame as a whole does not have a 'name' attribute.

Stack Overflow

stackoverflow.com › questions › 46169022 › dataframe-object-has-no-attribute-col-name

python - 'DataFrame' object has no attribute 'col_name' - Stack Overflow

Top answer

1 of 3

I think read_table have default separator tab, so is necessary define separator parameter:

x = pd.read_table('path to csv', sep=',')

Or use read_csv with default separator ,, so sep: can be omit.

x = pd.read_csv('path to csv')

2 of 3

Try to strip the potential whitespaces around the column name with this:

x.columns = [col.strip() for col in x.columns.tolist()]

Or as suggested in the documenation here and highlighted in @jezrael's answer:

x.columns = x.columns.str.strip()

Then, you will be able to access columns with x.col1..x.coln. Also be aware that column names are case sensitive.

Example:

>>> import pandas as pd 
>>> df = pd.DataFrame([[1,2],[3,4]], columns=[' col1', 'col2 '])
>>> df
    col1  col2 
0      1      2
1      3      4
>>> df.col1
Traceback (most recent call last):
..    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'col1'
>>> df.col2 
Traceback (most recent call last):
...    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'col2'
>>> df.columns = [col.strip() for col in df.columns.tolist()]
>>> df.col1
0    1
1    3
Name: col1, dtype: int64
>>> df.col2 
0    2
1    4
Name: col2, dtype: int64
>>>

Stack Overflow

stackoverflow.com › questions › 57363618 › pyspark-dataframe-object-has-no-attribute-get-object-id

python - pyspark 'DataFrame' object has no attribute '_get_object_id' - Stack Overflow

Top answer

1 of 2

You can't reference a second spark DataFrame inside a function, unless you're using a join. IIUC, you can do the following to achieve your desired result.

Suppose that means is the following:

#means.show()
#+---+---------+
#| id|avg(col1)|
#+---+---------+
#|  1|     12.0|
#|  3|    300.0|
#|  2|     21.0|
#+---+---------+

Join df and means on the id column, then apply your when condition

from pyspark.sql.functions import when

df.join(means, on="id")\
    .withColumn(
        "col1",
        when(
            (df["col1"].isNull()), 
            means["avg(col1)"]
        ).otherwise(df["col1"])
    )\
    .select(*df.columns)\
    .show()
#+---+-----+
#| id| col1|
#+---+-----+
#|  1| 12.0|
#|  1| 12.0|
#|  1| 14.0|
#|  1| 10.0|
#|  3|300.0|
#|  3|300.0|
#|  2| 21.0|
#|  2| 22.0|
#|  2| 20.0|
#+---+-----+

But in this case, I'd actually recommend using a Window with pyspark.sql.functions.mean:

from pyspark.sql import Window
from pyspark.sql.functions import col, mean

df.withColumn(
    "col1",
    when(
        col("col1").isNull(), 
        mean("col1").over(Window.partitionBy("id"))
    ).otherwise(col("col1"))
).show()
#+---+-----+
#| id| col1|
#+---+-----+
#|  1| 12.0|
#|  1| 10.0|
#|  1| 12.0|
#|  1| 14.0|
#|  3|300.0|
#|  3|300.0|
#|  2| 22.0|
#|  2| 20.0|
#|  2| 21.0|
#+---+-----+

2 of 2

-5

I think you are using Scala API, in which you use (). In PySpark, use [] instead.