dataframe' object has no attribute 'getitem' pyspark

pyspark 'DataFrame' object has no attribute '_get_object_id'

stackoverflow.com › questions › 57363618 › pyspark-dataframe-object-has-no-attribute-get-object-id

You can't reference a second spark DataFrame inside a function, unless you're using a join. IIUC, you can do the following to achieve your desired result.

Suppose that means is the following:

#means.show()
#+---+---------+
#| id|avg(col1)|
#+---+---------+
#|  1|     12.0|
#|  3|    300.0|
#|  2|     21.0|
#+---+---------+

Join df and means on the id column, then apply your when condition

from pyspark.sql.functions import when

df.join(means, on="id")\
    .withColumn(
        "col1",
        when(
            (df["col1"].isNull()), 
            means["avg(col1)"]
        ).otherwise(df["col1"])
    )\
    .select(*df.columns)\
    .show()
#+---+-----+
#| id| col1|
#+---+-----+
#|  1| 12.0|
#|  1| 12.0|
#|  1| 14.0|
#|  1| 10.0|
#|  3|300.0|
#|  3|300.0|
#|  2| 21.0|
#|  2| 22.0|
#|  2| 20.0|
#+---+-----+

But in this case, I'd actually recommend using a Window with pyspark.sql.functions.mean:

from pyspark.sql import Window
from pyspark.sql.functions import col, mean

df.withColumn(
    "col1",
    when(
        col("col1").isNull(), 
        mean("col1").over(Window.partitionBy("id"))
    ).otherwise(col("col1"))
).show()
#+---+-----+
#| id| col1|
#+---+-----+
#|  1| 12.0|
#|  1| 10.0|
#|  1| 12.0|
#|  1| 14.0|
#|  3|300.0|
#|  3|300.0|
#|  2| 22.0|
#|  2| 20.0|
#|  2| 21.0|
#+---+-----+

Answer from pault on Stack Overflow

Stack Overflow

stackoverflow.com › questions › 57363618 › pyspark-dataframe-object-has-no-attribute-get-object-id

python - pyspark 'DataFrame' object has no attribute '_get_object_id' - Stack Overflow

Top answer

1 of 2

You can't reference a second spark DataFrame inside a function, unless you're using a join. IIUC, you can do the following to achieve your desired result.

Suppose that means is the following:

#means.show()
#+---+---------+
#| id|avg(col1)|
#+---+---------+
#|  1|     12.0|
#|  3|    300.0|
#|  2|     21.0|
#+---+---------+

Join df and means on the id column, then apply your when condition

from pyspark.sql.functions import when

df.join(means, on="id")\
    .withColumn(
        "col1",
        when(
            (df["col1"].isNull()), 
            means["avg(col1)"]
        ).otherwise(df["col1"])
    )\
    .select(*df.columns)\
    .show()
#+---+-----+
#| id| col1|
#+---+-----+
#|  1| 12.0|
#|  1| 12.0|
#|  1| 14.0|
#|  1| 10.0|
#|  3|300.0|
#|  3|300.0|
#|  2| 21.0|
#|  2| 22.0|
#|  2| 20.0|
#+---+-----+

But in this case, I'd actually recommend using a Window with pyspark.sql.functions.mean:

from pyspark.sql import Window
from pyspark.sql.functions import col, mean

df.withColumn(
    "col1",
    when(
        col("col1").isNull(), 
        mean("col1").over(Window.partitionBy("id"))
    ).otherwise(col("col1"))
).show()
#+---+-----+
#| id| col1|
#+---+-----+
#|  1| 12.0|
#|  1| 10.0|
#|  1| 12.0|
#|  1| 14.0|
#|  3|300.0|
#|  3|300.0|
#|  2| 22.0|
#|  2| 20.0|
#|  2| 21.0|
#+---+-----+

2 of 2

-5

I think you are using Scala API, in which you use (). In PySpark, use [] instead.

Cumulative Sum

cumsum.wordpress.com › 2020 › 10 › 10 › pyspark-attributeerror-dataframe-object-has-no-attribute-_get_object_id

[pyspark] AttributeError: ‘DataFrame’ object has no attribute ‘_get_object_id’

October 10, 2020 - AttributeError: ‘DataFrame’ object has no attribute ‘_get_object_id’ · The reason being that isin expects actual local values or collections but df2.select('id') returns a data frame.

Stack Overflow

stackoverflow.com › questions › 67267858 › dataframe-object-has-no-attribute-get

pyspark - DataFrame' object has no attribute 'get - Stack Overflow

Top answer

1 of 1

Seaborn expects pandas dataframes and not spark dataframes. Change your code to:

Copyg1= sns.scatterplot( x= "events", y= "ways" , data = x2.toPandas())

and it should work.

Incorta Community

community.incorta.com › t5 › data-schemas-knowledgebase › issue-with-converting-a-pandas-dataframe-to-a-spark-dataframe › ta-p › 5279

Issue with converting a Pandas DataFrame to a Spar... - Incorta Community

November 15, 2023 - Symptoms You received the error when trying to convert a Pandas DataFrame to Spark DataFrame in a PySpark MV. Here is the error.- INC_03070101: Transformation error Error 'DataFrame' object has no attribute 'iteritems' AttributeError : 'DataFrame' object has no attribute 'iteritems' Diagnosis ...

Ggbaker

ggbaker.ca › 732 › content › spark-df.html

Spark DataFrames Concepts

There are many Spark DataFrame functions that can be used in column expressions, as well as basic Python operators that are overloaded to imply a column operation. There are many places you have to refer to a column, and there are three different ways to do it. These are equivalent: data.groupby(data['lname']) # as a getitem on the DF data.groupby('lname') # by column name only data.groupby(data.lname) # as a property on the DF

Stack Overflow

stackoverflow.com › questions › 71446185 › pyspark-attributeerror-dataframe-object-has-no-attribute-cast

apache spark sql - pyspark AttributeError: 'DataFrame' object has no attribute 'cast' - Stack Overflow

Top answer

1 of 3

A short, clean, scalable solution
Change some columns, leave the rest untouched

import pyspark.sql.functions as F

# That's not part of the solution, just a creation of a sample dataframe
# df = spark.createDataFrame([(10, 1,2,3,4),(20, 5,6,7,8)],'Id int, Revenue int ,GROSS_PROFIT int ,Net_Income int ,Enterprise_Value int')

cols_to_cast = ["Revenue" ,"GROSS_PROFIT" ,"Net_Income" ,"Enterprise_Value"]
df = df.select([F.col(c).cast('double') if c in cols_to_cast else c for c in df.columns])

df.printSchema()

root
 |-- Id: integer (nullable = true)
 |-- Revenue: double (nullable = true)
 |-- GROSS_PROFIT: double (nullable = true)
 |-- Net_Income: double (nullable = true)
 |-- Enterprise_Value: double (nullable = true)

2 of 3

If this helps

df = spark.createDataFrame([(1, 0),
(2, 1),
(3  ,1),
(4, 1),
(5, 0),
(6  ,0),
(7, 1),
(8  ,1),
(9  ,1),
(10,    1),
(11,    0),
(12,    0)],
('Time' ,'Tag1'))

df = df.withColumn('a', col('Time').cast('integer')).withColumn('a1', col('Tag1').cast('double'))
df.printSchema()
df.show()

Stack Overflow

stackoverflow.com › questions › 75926636 › databricks-issue-while-creating-spark-data-frame-from-pandas

python - Databricks: Issue while creating spark data frame from pandas - Stack Overflow

Top answer

1 of 5

It's related to the Databricks Runtime (DBR) version used - the Spark versions in up to DBR 12.2 rely on .iteritems function to construct a Spark DataFrame from Pandas DataFrame. This issue was fixed in the Spark 3.4 that is available as DBR 13.x.

If you can't upgrade to DBR 13.x, then you need to downgrade the Pandas to latest 1.x version (1.5.3 right now) by using %pip install -U pandas==1.5.3 command in your notebook. Although it's just better to use Pandas version shipped with your DBR - it was tested for compatibility with other packages in DBR.

2 of 5

I couldn't change package versions, but it looks like this was a name change only.

So I did

df.iteritems = df.items

and spark.createDataFrame(df) works now.

Sure, it's ugly, and it will break my notebook when I move to a cluster with a new DBR, but it works for now.

EDIT: AyoubH's answer is better because you only have to do it once. With the code above, you have to modify every data frame you display.

Cloudera Community

community.cloudera.com › t5 › Support-Questions › Pyspark-issue-AttributeError-DataFrame-object-has-no › m-p › 78093

Pyspark issue AttributeError: 'DataFrame' object has no attribute 'saveAsTextFile'

January 2, 2024 - As the error message states, the object, either a DataFrame or List does not have the saveAsTextFile() method. result.write.save() or result.toJavaRDD.saveAsTextFile() shoud do the work, or you can refer to DataFrame or RDD api: https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.sql.DataFrameWriter · https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.rdd.RDD ... To save a DataFrame as a text file in PySpark, you need to convert it to an RDD first, or use DataFrame writer functions.

AWS re:Post

repost.aws › questions › QUvWrsRjenSrqHLJqLpy4DWg › attributeerror-dataframe-object-has-no-attribute-get-object-id

AttributeError: 'DataFrame' object has no attribute '_get_object_id' | AWS re:Post

October 11, 2018 - AttributeError: 'DataFrame' object has no attribute '_get_object_id' when I run the script. I'm pretty confident the error is occurring during this line: datasink = glueContext.write_dynamic_frame.from_catalog(frame = source_dynamic_frame, database = target_database, table_name = target_table_name, transformation_ctx = "datasink") but I can't decipher what it's trying to tell me. Can anyone please help me out or point me in the right direction? Thanks! %pyspark import sys from pyspark.context import SparkContext from pyspark.sql.functions import lit, current_timestamp from pyspark.sql.window i

Find elsewhere

Google Bing Mojeek

Spark By {Examples}

sparkbyexamples.com › home › hbase › attributeerror: ‘dataframe’ object has no attribute ‘map’ in pyspark

AttributeError: 'DataFrame' object has no attribute 'map' in PySpark - Spark By {Examples}

April 3, 2021 - Problem: In PySpark I am getting error AttributeError: 'DataFrame' object has no attribute 'map' when I use map() transformation on DataFrame.

Apache

spark.apache.org › docs › latest › api › python › reference › pyspark.pandas › frame.html

DataFrame — PySpark 4.1.1 documentation - Apache Spark

DataFrame.spark provides features that does not exist in pandas but in Spark. These can be accessed by DataFrame.spark.<function/property>. DataFrame.plot is both a callable method and a namespace attribute for specific plotting methods of the form DataFrame.plot.<kind>.

Stack Overflow

stackoverflow.com › questions › 72442900 › dataframe-py-in-getattr-attributeerror-dataframe-object-has-no-attribut

pyspark - dataframe.py", in __getattr__ AttributeError: 'DataFrame' object has no attribute 'index' - Stack Overflow

May 31, 2022 - ERROR as below File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1643, in getattr AttributeError: 'DataFrame' object has no attribute 'index'

Stack Overflow

stackoverflow.com › questions › 75922697 › anyone-know-how-to-display-a-pandas-dataframe-in-databricks › 75927530

python - Anyone know how to display a pandas dataframe in Databricks? - Stack Overflow

Top answer

1 of 4

As a workaround, downgrade to pandas v1.5

%pip install --upgrade pandas==1.5

The answers provided till now used to work prior to 3rd April 2023.

As of April 4, with pandas 2.0.0, you are not able to convert a Pandas DataFrame to a Spark DataFrame using the command:

spark.createDataFrame(df)

Using the above command leads to the error mentioned in the question:

AttributeError: 'DataFrame' object has no attribute 'iteritems'

The iteritems function seems to have been removed in pandas 2.0.0. From the changelog of pandas 2.0.0:

Removed deprecated Series.iteritems(), DataFrame.iteritems(), use obj.items instead

While the code written in spark to convert pandas dataframe to a spark dataframe still uses iteritems

/databricks/spark/python/pyspark/sql/pandas/conversion.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    308                     warnings.warn(msg)
    309                     raise
--> 310         data = self._convert_from_pandas(data, schema, timezone)
    311         return self._create_dataframe(data, schema, samplingRatio, verifySchema)
    312 

/databricks/spark/python/pyspark/sql/pandas/conversion.py in _convert_from_pandas(self, pdf, schema, timezone)
    340                             pdf[field.name] = s
    341             else:
--> 342                 for column, series in pdf.iteritems():
    343                     s = _check_series_convert_timestamps_tz_local(series, timezone)
    344                     if s is not series:

Looks like we will have to wait for a fix to use Pandas 2.0.0.

2 of 4

You just need to use display function passing Pandas DataFrame as the argument - not try to call it as a member of the Pandas DataFrame class.

display(pdf)

Or you can simply specify variable name with Pandas DataFrame object - then it will be printed using Panda's built-in representation

import pyspark.sql.functions as F

pdf = spark.range(10).withColumn("rnd", F.rand()).toPandas()

Stack Overflow

stackoverflow.com › questions › 68550053 › pyspark-attributeerror-dataframe-object-has-no-attribute-values

pandas - PySpark : AttributeError: 'DataFrame' object has no attribute 'values' - Stack Overflow

Top answer

1 of 1

The syntax is valid with Pandas DataFrames but that attribute doesn't exist for the PySpark created DataFrames. You can check out this link for the documentation.

Usually, the collect() method or the .rdd attribute would help you with these tasks.

You can use the following snippet to produce the desired result:

http_path = sdf.rdd.map(lambda row: row['http_path'].split('?'))
api_param_df = pd.DataFrame([[row[0], np.nan] if len(row) == 1 else row for row in http_path.collect()], columns=["api", "param"])
sdf = pd.concat([sdf.toPandas()['raw'], api_param_df], axis=1)

Note that I removed the comments to make it more readable and I've also substituted the regex with a simple split.

Stack Exchange

datascience.stackexchange.com › questions › 37435 › i-got-the-following-error-dataframe-object-has-no-attribute-data

python - I got the following error : 'DataFrame' object has no attribute 'data' - Data Science Stack Exchange

Top answer

1 of 5

"sklearn.datasets" is a scikit package, where it contains a method load_iris().

load_iris(), by default return an object which holds data, target and other members in it. In order to get actual values you have to read the data and target content itself.

Whereas 'iris.csv', holds feature and target together.

FYI: If you set return_X_y as True in load_iris(), then you will directly get features and target.

from sklearn import datasets
data,target = datasets.load_iris(return_X_y=True)

2 of 5

The Iris Dataset from Sklearn is in Sklearn's Bunch format:

print(type(iris))
print(iris.keys())

output:

<class 'sklearn.utils.Bunch'>
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

So, that's why you can access it as:

x=iris.data
y=iris.target

But when you read the CSV file as DataFrame as mentioned by you:

iris = pd.read_csv('iris.csv',header=None).iloc[:,2:4]
iris.head()

output is:

    2   3
0   petal_length    petal_width
1   1.4 0.2
2   1.4 0.2
3   1.3 0.2
4   1.5 0.2

Here the column names are '1' and '2'.

First of all you should read the CSV file as:

df = pd.read_csv('iris.csv')

you should not include header=None as your csv file includes the column names i.e. the headers.

So, now what you can do is something like this:

X = df.iloc[:, [2, 3]] # Will give you columns 2 and 3 i.e 'petal_length' and 'petal_width'
y = df.iloc[:, 4] # Label column i.e 'species'

or if you want to use the column names then:

X = df[['petal_length', 'petal_width']]
y = df.iloc['species']

Also, if you want to convert labels from string to numerical format use sklearn LabelEncoder

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y = le.fit_transform(y)

Databricks Community

community.databricks.com › databricks community › data engineering › attributeerror: 'dataframe' object has no attribute 'dropduplicateswithinwatermark'

AttributeError: 'DataFrame' object has no attribut... - Databricks Community - 61132

February 19, 2024 - Hello, I have some trouble deduplicating rows on the "id" column, with the method "dropDuplicatesWithinWatermark" in a pipeline. When I run this pipeline, I get the error message: "AttributeError: 'DataFrame' object has no attribute 'dropDuplicatesWithinWatermark'" Here is part of the code: @dl...

Stack Overflow

stackoverflow.com › questions › 37039341 › attributeerror-dataframe-object-has-no-attribute-get-on-vectorassembler-spa

python - AttributeError: 'DataFrame' object has no attribute 'get' on VectorAssembler spark ML - Stack Overflow

May 8, 2017 - Traceback (most recent call last): File "/tmp/zeppelin_pyspark.py", line 164, in <module> intp.setStatementsFinished(output.get(), False) File "/home/zeppelin/zeppelin-0.5.5-incubating-bin-all/interpreter/spark/pyspark/pyspark.zip/pyspark/sql/dataframe.py", line 749, in __getattr__ "'%s' object has no attribute '%s'" % (self.__class__.__name__, name)) AttributeError: 'DataFrame' object has no attribute 'get'

GitHub

github.com › microsoft › FLAML › issues › 625

AttributeError: 'DataFrame' object has no attribute 'copy' · Issue #625 · microsoft/FLAML

July 2, 2022 - I m using autoML(FLAML) with Spark on large data. The error image is given below train = spark.read.parquet("./train.parquet") test = spark.read.parquet("./test.parquet") input_cols = [c for c in train.columns if c != 'target'] vectorAss...

Author microsoft

Stack Overflow

stackoverflow.com › questions › 51813517 › dataframe-object-has-no-attribute-col

apache spark - DataFrame object has no attribute 'col' - Stack Overflow

Top answer

1 of 5

The book you're referring to describes Scala / Java API. In PySpark use []

df["count"]

2 of 5

The book combines the Scala and PySpark API's.

In Scala / Java API, df.col("column_name") or df.apply("column_name") return the Column.

Whereas in pyspark use the below to get the column from DF.

df.colName
df["colName"]

Stack Overflow

stackoverflow.com › questions › 54347570 › attribute-error-pyspark-dataframe-error › 54347883

apache spark sql - Attribute error? Pyspark dataframe() error - Stack Overflow

Top answer

1 of 1

It means that you have an object of type Dataframe, and you are trying to call the attribute dataframe which does not exist.