attributeerror: 'dataframe' object has no attribute pyspark

PySpark : AttributeError: 'DataFrame' object has no attribute 'values'

stackoverflow.com › questions › 68550053 › pyspark-attributeerror-dataframe-object-has-no-attribute-values

The syntax is valid with Pandas DataFrames but that attribute doesn't exist for the PySpark created DataFrames. You can check out this link for the documentation.

Usually, the collect() method or the .rdd attribute would help you with these tasks.

You can use the following snippet to produce the desired result:

http_path = sdf.rdd.map(lambda row: row['http_path'].split('?'))
api_param_df = pd.DataFrame([[row[0], np.nan] if len(row) == 1 else row for row in http_path.collect()], columns=["api", "param"])
sdf = pd.concat([sdf.toPandas()['raw'], api_param_df], axis=1)

Note that I removed the comments to make it more readable and I've also substituted the regex with a simple split.

Answer from Ali Mirferdos on Stack Overflow

Cloudera Community

community.cloudera.com › t5 › Support-Questions › Pyspark-issue-AttributeError-DataFrame-object-has-no › m-p › 78093

Pyspark issue AttributeError: 'DataFrame' object has no attribute 'saveAsTextFile'

January 2, 2024 - As the error message states, the object, either a DataFrame or List does not have the saveAsTextFile() method. result.write.save() or result.toJavaRDD.saveAsTextFile() shoud do the work, or you can refer to DataFrame or RDD api: https://spark.apache.org/docs/2.1.0/api/scala/index.html#org....

Cumulative Sum

cumsum.wordpress.com › 2020 › 10 › 10 › pyspark-attributeerror-dataframe-object-has-no-attribute-_get_object_id

[pyspark] AttributeError: ‘DataFrame’ object has no attribute ‘_get_object_id’

October 10, 2020 - AttributeError: ‘DataFrame’ object has no attribute ‘_get_object_id’ · The reason being that isin expects actual local values or collections but df2.select('id') returns a data frame.

Stack Overflow

stackoverflow.com › questions › 68550053 › pyspark-attributeerror-dataframe-object-has-no-attribute-values

pandas - PySpark : AttributeError: 'DataFrame' object has no attribute 'values' - Stack Overflow

Top answer

1 of 1

The syntax is valid with Pandas DataFrames but that attribute doesn't exist for the PySpark created DataFrames. You can check out this link for the documentation.

Usually, the collect() method or the .rdd attribute would help you with these tasks.

You can use the following snippet to produce the desired result:

http_path = sdf.rdd.map(lambda row: row['http_path'].split('?'))
api_param_df = pd.DataFrame([[row[0], np.nan] if len(row) == 1 else row for row in http_path.collect()], columns=["api", "param"])
sdf = pd.concat([sdf.toPandas()['raw'], api_param_df], axis=1)

Note that I removed the comments to make it more readable and I've also substituted the regex with a simple split.

Stack Overflow

stackoverflow.com › questions › 57363618 › pyspark-dataframe-object-has-no-attribute-get-object-id

python - pyspark 'DataFrame' object has no attribute '_get_object_id' - Stack Overflow

Top answer

1 of 2

You can't reference a second spark DataFrame inside a function, unless you're using a join. IIUC, you can do the following to achieve your desired result.

Suppose that means is the following:

#means.show()
#+---+---------+
#| id|avg(col1)|
#+---+---------+
#|  1|     12.0|
#|  3|    300.0|
#|  2|     21.0|
#+---+---------+

Join df and means on the id column, then apply your when condition

from pyspark.sql.functions import when

df.join(means, on="id")\
    .withColumn(
        "col1",
        when(
            (df["col1"].isNull()), 
            means["avg(col1)"]
        ).otherwise(df["col1"])
    )\
    .select(*df.columns)\
    .show()
#+---+-----+
#| id| col1|
#+---+-----+
#|  1| 12.0|
#|  1| 12.0|
#|  1| 14.0|
#|  1| 10.0|
#|  3|300.0|
#|  3|300.0|
#|  2| 21.0|
#|  2| 22.0|
#|  2| 20.0|
#+---+-----+

But in this case, I'd actually recommend using a Window with pyspark.sql.functions.mean:

from pyspark.sql import Window
from pyspark.sql.functions import col, mean

df.withColumn(
    "col1",
    when(
        col("col1").isNull(), 
        mean("col1").over(Window.partitionBy("id"))
    ).otherwise(col("col1"))
).show()
#+---+-----+
#| id| col1|
#+---+-----+
#|  1| 12.0|
#|  1| 10.0|
#|  1| 12.0|
#|  1| 14.0|
#|  3|300.0|
#|  3|300.0|
#|  2| 22.0|
#|  2| 20.0|
#|  2| 21.0|
#+---+-----+

2 of 2

-5

I think you are using Scala API, in which you use (). In PySpark, use [] instead.

Stack Overflow

stackoverflow.com › questions › 71446185 › pyspark-attributeerror-dataframe-object-has-no-attribute-cast

apache spark sql - pyspark AttributeError: 'DataFrame' object has no attribute 'cast' - Stack Overflow

Top answer

1 of 3

A short, clean, scalable solution
Change some columns, leave the rest untouched

import pyspark.sql.functions as F

# That's not part of the solution, just a creation of a sample dataframe
# df = spark.createDataFrame([(10, 1,2,3,4),(20, 5,6,7,8)],'Id int, Revenue int ,GROSS_PROFIT int ,Net_Income int ,Enterprise_Value int')

cols_to_cast = ["Revenue" ,"GROSS_PROFIT" ,"Net_Income" ,"Enterprise_Value"]
df = df.select([F.col(c).cast('double') if c in cols_to_cast else c for c in df.columns])

df.printSchema()

root
 |-- Id: integer (nullable = true)
 |-- Revenue: double (nullable = true)
 |-- GROSS_PROFIT: double (nullable = true)
 |-- Net_Income: double (nullable = true)
 |-- Enterprise_Value: double (nullable = true)

2 of 3

If this helps

df = spark.createDataFrame([(1, 0),
(2, 1),
(3  ,1),
(4, 1),
(5, 0),
(6  ,0),
(7, 1),
(8  ,1),
(9  ,1),
(10,    1),
(11,    0),
(12,    0)],
('Time' ,'Tag1'))

df = df.withColumn('a', col('Time').cast('integer')).withColumn('a1', col('Tag1').cast('double'))
df.printSchema()
df.show()

AWS re:Post

repost.aws › questions › QUvWrsRjenSrqHLJqLpy4DWg › attributeerror-dataframe-object-has-no-attribute-get-object-id

AttributeError: 'DataFrame' object has no attribute '_get_object_id' | AWS re:Post

October 11, 2018 - AttributeError: 'DataFrame' object has no attribute '_get_object_id' when I run the script. I'm pretty confident the error is occurring during this line: datasink = glueContext.write_dynamic_frame.from_catalog(frame = source_dynamic_frame, database = target_database, table_name = target_table_name, transformation_ctx = "datasink") but I can't decipher what it's trying to tell me. Can anyone please help me out or point me in the right direction? Thanks! %pyspark import sys from pyspark.context import SparkContext from pyspark.sql.functions import lit, current_timestamp from pyspark.sql.window i

Spark By {Examples}

sparkbyexamples.com › home › hbase › attributeerror: ‘dataframe’ object has no attribute ‘map’ in pyspark

AttributeError: 'DataFrame' object has no attribute 'map' in PySpark - Spark By {Examples}

April 3, 2021 - Problem: In PySpark I am getting error AttributeError: 'DataFrame' object has no attribute 'map' when I use map() transformation on DataFrame.

Incorta Community

community.incorta.com › t5 › data-schemas-knowledgebase › issue-with-converting-a-pandas-dataframe-to-a-spark-dataframe › ta-p › 5279

Issue with converting a Pandas DataFrame to a Spar... - Incorta Community

November 15, 2023 - Symptoms You received the error when trying to convert a Pandas DataFrame to Spark DataFrame in a PySpark MV. Here is the error.- INC_03070101: Transformation error Error 'DataFrame' object has no attribute 'iteritems' AttributeError : 'DataFrame' object has no attribute 'iteritems' Diagnosis Since...

Stack Overflow

stackoverflow.com › questions › 38594784 › pyspark-attributeerror-dataframe-object-has-no-attribute-todf › 38622723

pyspark AttributeError: 'DataFrame' object has no attribute 'toDF' - Stack Overflow

Top answer

1 of 2

I figured it out. Looks like it has to do with our spark version. It worked with 1.6

2 of 2

if you are working with spark version 1.6 then use this code for conversion of rdd into df

from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame(rdd)

if you want to assign title to rows then use this

df= rdd.map(lambda p: Row(ip=p[0], time=p[1], zone=p[2]))

ip,time,zone are row headers in this example.

Find elsewhere

Google Bing Mojeek

Cloudera Community

community.cloudera.com › t5 › Support-Questions › Pyspark-issue-AttributeError-DataFrame-object-has-no › m-p › 381546

Re: Pyspark issue AttributeError: 'DataFrame' object has no attribute 'saveAsTextFile'

Stack Overflow

stackoverflow.com › questions › 51813517 › dataframe-object-has-no-attribute-col

apache spark - DataFrame object has no attribute 'col' - Stack Overflow

Top answer

1 of 5

The book you're referring to describes Scala / Java API. In PySpark use []

df["count"]

2 of 5

The book combines the Scala and PySpark API's.

In Scala / Java API, df.col("column_name") or df.apply("column_name") return the Column.

Whereas in pyspark use the below to get the column from DF.

df.colName
df["colName"]

Stack Overflow

stackoverflow.com › questions › 50686616 › dataframe-object-has-no-attribute-apply-when-trying-to-apply-lambda-to-cre

python - "'DataFrame' object has no attribute 'apply'" when trying to apply lambda to create new column - Stack Overflow

Top answer

1 of 2

The syntax you are using is for a pandas DataFrame. To achieve this for a spark DataFrame, you should use the withColumn() method. This works great for a wide range of well defined DataFrame functions, but it's a little more complicated for user defined mapping functions.

General Case

In order to define a udf, you need to specify the output data type. For instance, if you wanted to apply a function my_func that returned a string, you could create a udf as follows:

import pyspark.sql.functions as f
my_udf = f.udf(my_func, StringType())

Then you can use my_udf to create a new column like:

df = df.withColumn('new_column', my_udf(f.col("some_column_name")))

Another option is to use select:

df = df.select("*", my_udf(f.col("some_column_name")).alias("new_column"))

Specific Problem

Using a udf

In your specific case, you want to use a dictionary to translate the values of your DataFrame.

Here is a way to define a udf for this purpose:

some_map_udf = f.udf(lambda x: some_map.get(x, None), IntegerType())

Notice that I used dict.get() because you want your udf to be robust to bad inputs.

df = df.withColumn('new_column', some_map_udf(f.col("some_column_name")))

Using DataFrame functions

Sometimes using a udf is unavoidable, but whenever possible, using DataFrame functions is usually preferred.

Here is one option to do the same thing without using the udf.

The trick is to iterate over the items in some_map to create a list of pyspark.sql.functions.when() functions.

some_map_func = [f.when(f.col("some_column_name") == k, v) for k, v in some_map.items()]
print(some_map_func)
#[Column<CASE WHEN (some_column_name = a) THEN 0 END>,
# Column<CASE WHEN (some_column_name = c) THEN 1 END>,
# Column<CASE WHEN (some_column_name = b) THEN 1 END>]

Now you can use pyspark.sql.functions.coalesce() inside of a select:

df = df.select("*", f.coalesce(*some_map_func).alias("some_column_name"))

This works because when() returns null by default if the condition is not met, and coalesce() will pick the first non-null value it encounters. Since the keys of the map are unique, at most one column will be non-null.

2 of 2

You have a spark dataframe, not a pandas dataframe. To add new column to the spark dataframe:

import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
df = df.withColumn('new_column', F.udf(some_map.get, IntegerType())(some_column_name))
df.show()

Stack Overflow

stackoverflow.com › questions › 73334906 › attributeerror-dataframe-object-has-no-attribute-dtype-error-in-pyspark

python - AttributeError: 'DataFrame' object has no attribute 'dtype' error in pyspark - Stack Overflow

Top answer

1 of 1

I faced the same problem, in my case it was because I had duplicate column names after the join.

I see you have report_date and marketplaceid in both dataframes. For each duplicated pair, you need to either drop one or both, or rename one of them.

Stack Overflow

stackoverflow.com › questions › 54347570 › attribute-error-pyspark-dataframe-error

apache spark sql - Attribute error? Pyspark dataframe() error - Stack Overflow

Top answer

1 of 1

It means that you have an object of type Dataframe, and you are trying to call the attribute dataframe which does not exist.

JetBrains

intellij-support.jetbrains.com › hc › en-us › community › posts › 360003244439-Error-viewing-pyspark-DataFrame

Error viewing pyspark DataFrame – IDEs Support (IntelliJ Platform) | JetBrains

Traceback (most recent call last): ... File "C:\anaconda\envs\py36ml\lib\site-packages\pyspark\sql\dataframe.py", line 1300, in __getattr__ "'%s' object has no attribute '%s'" % (self.__class__.__name__, name)) AttributeError: 'DataFrame' object has no attribute 'axe...

Stack Overflow

stackoverflow.com › questions › 75926636 › databricks-issue-while-creating-spark-data-frame-from-pandas

python - Databricks: Issue while creating spark data frame from pandas - Stack Overflow

Top answer

1 of 5

It's related to the Databricks Runtime (DBR) version used - the Spark versions in up to DBR 12.2 rely on .iteritems function to construct a Spark DataFrame from Pandas DataFrame. This issue was fixed in the Spark 3.4 that is available as DBR 13.x.

If you can't upgrade to DBR 13.x, then you need to downgrade the Pandas to latest 1.x version (1.5.3 right now) by using %pip install -U pandas==1.5.3 command in your notebook. Although it's just better to use Pandas version shipped with your DBR - it was tested for compatibility with other packages in DBR.

2 of 5

I couldn't change package versions, but it looks like this was a name change only.

So I did

df.iteritems = df.items

and spark.createDataFrame(df) works now.

Sure, it's ugly, and it will break my notebook when I move to a cluster with a new DBR, but it works for now.

EDIT: AyoubH's answer is better because you only have to do it once. With the code above, you have to modify every data frame you display.

Hail Discussion

discuss.hail.is › help [0.1]

AttributeError: 'DataFrame' object has no attribute 'to_spark' - Help [0.1] - Hail Discussion

July 22, 2018 - I am trying to covert a Hail table to a pandas dataframe: kk2 = hl.Table.to_pandas(table1) # convert to pandas I am not sure why I am getting this error: --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in 1 kk2 = ...

Stack Overflow

stackoverflow.com › questions › 72010538 › attributeerror-dataframe-object-has-no-attribute-select

AttributeError: 'DataFrame' object has no attribute 'select' - Stack Overflow

Top answer

1 of 1

Convert the pandas df to spark for you to select

df = spark.createDataFrame(data)
df.select("box").show()

Databricks Community

community.databricks.com › t5 › data-engineering › attributeerror-dataframe-object-has-no-attribute › td-p › 61132

AttributeError: 'DataFrame' object has no attribut... - Databricks Community - 61132

February 19, 2024 - Hello, I have some trouble deduplicating rows on the "id" column, with the method "dropDuplicatesWithinWatermark" in a pipeline. When I run this pipeline, I get the error message: "AttributeError: 'DataFrame' object has no attribute 'dropDuplicatesWithinWatermark'" Here is part of the code: @dl...

JetBrains

youtrack.jetbrains.com › issue › PY-37227

PyCharm debugger is confusing PySpark DataFrame with ...

April 27, 2021 - {{ (>_<) }} This version of your browser is not supported. Try upgrading to the latest stable version. Something went seriously wrong