You can't reference a second spark DataFrame inside a function, unless you're using a join. IIUC, you can do the following to achieve your desired result.

Suppose that means is the following:

#means.show()
#+---+---------+
#| id|avg(col1)|
#+---+---------+
#|  1|     12.0|
#|  3|    300.0|
#|  2|     21.0|
#+---+---------+

Join df and means on the id column, then apply your when condition

from pyspark.sql.functions import when

df.join(means, on="id")\
    .withColumn(
        "col1",
        when(
            (df["col1"].isNull()), 
            means["avg(col1)"]
        ).otherwise(df["col1"])
    )\
    .select(*df.columns)\
    .show()
#+---+-----+
#| id| col1|
#+---+-----+
#|  1| 12.0|
#|  1| 12.0|
#|  1| 14.0|
#|  1| 10.0|
#|  3|300.0|
#|  3|300.0|
#|  2| 21.0|
#|  2| 22.0|
#|  2| 20.0|
#+---+-----+

But in this case, I'd actually recommend using a Window with pyspark.sql.functions.mean:

from pyspark.sql import Window
from pyspark.sql.functions import col, mean

df.withColumn(
    "col1",
    when(
        col("col1").isNull(), 
        mean("col1").over(Window.partitionBy("id"))
    ).otherwise(col("col1"))
).show()
#+---+-----+
#| id| col1|
#+---+-----+
#|  1| 12.0|
#|  1| 10.0|
#|  1| 12.0|
#|  1| 14.0|
#|  3|300.0|
#|  3|300.0|
#|  2| 22.0|
#|  2| 20.0|
#|  2| 21.0|
#+---+-----+
Answer from pault on Stack Overflow
🌐
Cumulative Sum
cumsum.wordpress.com › 2020 › 10 › 10 › pyspark-attributeerror-dataframe-object-has-no-attribute-_get_object_id
[pyspark] AttributeError: ‘DataFrame’ object has no attribute ‘_get_object_id’
October 10, 2020 - AttributeError: ‘DataFrame’ object has no attribute ‘_get_object_id’ · The reason being that isin expects actual local values or collections but df2.select('id') returns a data frame.
🌐
Incorta Community
community.incorta.com › t5 › data-schemas-knowledgebase › issue-with-converting-a-pandas-dataframe-to-a-spark-dataframe › ta-p › 5279
Issue with converting a Pandas DataFrame to a Spar... - Incorta Community
November 15, 2023 - Symptoms You received the error when trying to convert a Pandas DataFrame to Spark DataFrame in a PySpark MV. Here is the error.- INC_03070101: Transformation error Error 'DataFrame' object has no attribute 'iteritems' AttributeError : 'DataFrame' object has no attribute 'iteritems' Diagnosis ...
🌐
Cloudera Community
community.cloudera.com › t5 › Support-Questions › Pyspark-issue-AttributeError-DataFrame-object-has-no › m-p › 78093
Pyspark issue AttributeError: 'DataFrame' object has no attribute 'saveAsTextFile'
January 2, 2024 - As the error message states, the object, either a DataFrame or List does not have the saveAsTextFile() method. result.write.save() or result.toJavaRDD.saveAsTextFile() shoud do the work, or you can refer to DataFrame or RDD api: https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.sql.DataFrameWriter · https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.rdd.RDD ... To save a DataFrame as a text file in PySpark, you need to convert it to an RDD first, or use DataFrame writer functions.
🌐
AWS re:Post
repost.aws › questions › QUvWrsRjenSrqHLJqLpy4DWg › attributeerror-dataframe-object-has-no-attribute-get-object-id
AttributeError: 'DataFrame' object has no attribute '_get_object_id' | AWS re:Post
October 11, 2018 - AttributeError: 'DataFrame' object has no attribute '_get_object_id' when I run the script. I'm pretty confident the error is occurring during this line: datasink = glueContext.write_dynamic_frame.from_catalog(frame = source_dynamic_frame, database = target_database, table_name = target_table_name, transformation_ctx = "datasink") but I can't decipher what it's trying to tell me. Can anyone please help me out or point me in the right direction? Thanks! %pyspark import sys from pyspark.context import SparkContext from pyspark.sql.functions import lit, current_timestamp from pyspark.sql.window i
🌐
Spark By {Examples}
sparkbyexamples.com › home › hbase › attributeerror: ‘dataframe’ object has no attribute ‘map’ in pyspark
AttributeError: 'DataFrame' object has no attribute 'map' in PySpark - Spark By {Examples}
April 3, 2021 - Problem: In PySpark I am getting error AttributeError: 'DataFrame' object has no attribute 'map' when I use map() transformation on DataFrame.
Find elsewhere
🌐
Apache
spark.apache.org › docs › latest › api › python › reference › pyspark.pandas › frame.html
DataFrame — PySpark 4.1.1 documentation - Apache Spark
DataFrame.spark provides features that does not exist in pandas but in Spark. These can be accessed by DataFrame.spark.<function/property>. DataFrame.plot is both a callable method and a namespace attribute for specific plotting methods of the form DataFrame.plot.<kind>.
🌐
Stack Overflow
stackoverflow.com › questions › 72442900 › dataframe-py-in-getattr-attributeerror-dataframe-object-has-no-attribut
pyspark - dataframe.py", in __getattr__ AttributeError: 'DataFrame' object has no attribute 'index' - Stack Overflow
May 31, 2022 - ERROR as below File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1643, in getattr AttributeError: 'DataFrame' object has no attribute 'index'
Top answer
1 of 4
2

As a workaround, downgrade to pandas v1.5

%pip install --upgrade pandas==1.5

The answers provided till now used to work prior to 3rd April 2023.

As of April 4, with pandas 2.0.0, you are not able to convert a Pandas DataFrame to a Spark DataFrame using the command:

spark.createDataFrame(df)

Using the above command leads to the error mentioned in the question:

AttributeError: 'DataFrame' object has no attribute 'iteritems'

The iteritems function seems to have been removed in pandas 2.0.0. From the changelog of pandas 2.0.0:

Removed deprecated Series.iteritems(), DataFrame.iteritems(), use obj.items instead

While the code written in spark to convert pandas dataframe to a spark dataframe still uses iteritems

/databricks/spark/python/pyspark/sql/pandas/conversion.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    308                     warnings.warn(msg)
    309                     raise
--> 310         data = self._convert_from_pandas(data, schema, timezone)
    311         return self._create_dataframe(data, schema, samplingRatio, verifySchema)
    312 

/databricks/spark/python/pyspark/sql/pandas/conversion.py in _convert_from_pandas(self, pdf, schema, timezone)
    340                             pdf[field.name] = s
    341             else:
--> 342                 for column, series in pdf.iteritems():
    343                     s = _check_series_convert_timestamps_tz_local(series, timezone)
    344                     if s is not series:

Looks like we will have to wait for a fix to use Pandas 2.0.0.

2 of 4
2

You just need to use display function passing Pandas DataFrame as the argument - not try to call it as a member of the Pandas DataFrame class.

display(pdf)

Or you can simply specify variable name with Pandas DataFrame object - then it will be printed using Panda's built-in representation

import pyspark.sql.functions as F

pdf = spark.range(10).withColumn("rnd", F.rand()).toPandas()

Top answer
1 of 5
2

"sklearn.datasets" is a scikit package, where it contains a method load_iris().

load_iris(), by default return an object which holds data, target and other members in it. In order to get actual values you have to read the data and target content itself.

Whereas 'iris.csv', holds feature and target together.

FYI: If you set return_X_y as True in load_iris(), then you will directly get features and target.

from sklearn import datasets
data,target = datasets.load_iris(return_X_y=True)
2 of 5
1

The Iris Dataset from Sklearn is in Sklearn's Bunch format:

print(type(iris))
print(iris.keys())

output:

<class 'sklearn.utils.Bunch'>
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

So, that's why you can access it as:

x=iris.data
y=iris.target

But when you read the CSV file as DataFrame as mentioned by you:

iris = pd.read_csv('iris.csv',header=None).iloc[:,2:4]
iris.head()

output is:

    2   3
0   petal_length    petal_width
1   1.4 0.2
2   1.4 0.2
3   1.3 0.2
4   1.5 0.2

Here the column names are '1' and '2'.

First of all you should read the CSV file as:

df = pd.read_csv('iris.csv')

you should not include header=None as your csv file includes the column names i.e. the headers.

So, now what you can do is something like this:

X = df.iloc[:, [2, 3]] # Will give you columns 2 and 3 i.e 'petal_length' and 'petal_width'
y = df.iloc[:, 4] # Label column i.e 'species'

or if you want to use the column names then:

X = df[['petal_length', 'petal_width']]
y = df.iloc['species']

Also, if you want to convert labels from string to numerical format use sklearn LabelEncoder

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y = le.fit_transform(y)
🌐
Databricks Community
community.databricks.com › databricks community › data engineering › attributeerror: 'dataframe' object has no attribute 'dropduplicateswithinwatermark'
AttributeError: 'DataFrame' object has no attribut... - Databricks Community - 61132
February 19, 2024 - Hello, I have some trouble deduplicating rows on the "id" column, with the method "dropDuplicatesWithinWatermark" in a pipeline. When I run this pipeline, I get the error message: "AttributeError: 'DataFrame' object has no attribute 'dropDuplicatesWithinWatermark'" Here is part of the code: @dl...
🌐
Stack Overflow
stackoverflow.com › questions › 37039341 › attributeerror-dataframe-object-has-no-attribute-get-on-vectorassembler-spa
python - AttributeError: 'DataFrame' object has no attribute 'get' on VectorAssembler spark ML - Stack Overflow
May 8, 2017 - Traceback (most recent call last): File "/tmp/zeppelin_pyspark.py", line 164, in <module> intp.setStatementsFinished(output.get(), False) File "/home/zeppelin/zeppelin-0.5.5-incubating-bin-all/interpreter/spark/pyspark/pyspark.zip/pyspark/sql/dataframe.py", line 749, in __getattr__ "'%s' object has no attribute '%s'" % (self.__class__.__name__, name)) AttributeError: 'DataFrame' object has no attribute 'get'
🌐
GitHub
github.com › microsoft › FLAML › issues › 625
AttributeError: 'DataFrame' object has no attribute 'copy' · Issue #625 · microsoft/FLAML
July 2, 2022 - I m using autoML(FLAML) with Spark on large data. The error image is given below train = spark.read.parquet("./train.parquet") test = spark.read.parquet("./test.parquet") input_cols = [c for c in train.columns if c != 'target'] vectorAss...
Author   microsoft
🌐
Apache
spark.apache.org › docs › latest › api › python › reference › pyspark.sql › api › pyspark.sql.Column.getItem.html
pyspark.sql.Column.getItem — PySpark 4.1.2 documentation
>>> df = spark.createDataFrame([([1, 2], {"key": "value"})], ["l", "d"]) >>> df.select(df.l.getItem(0), df.d.getItem("key")).show() +----+------+ |l[0]|d[key]| +----+------+ | 1| value| +----+------+