Pandas API on Spark is supported from Spark 3.2.0 (https://issues.apache.org/jira/browse/SPARK-34849), but the method you try to use on Dataframe was implemented later and introduced in the version 3.3.0.

It was introduced with this ticket https://issues.apache.org/jira/browse/SPARK-37337 in commit: https://github.com/apache/spark/commit/bc7d55fc1046a55df61fdb380629699e9959fcc6

Which basically makes changes to naming and depreciation/undepreciation of the methods.

### What changes were proposed in this pull request?
The PR is proposed to:

- Undeprecate (Spark)DataFrame.to_koalas 

- Deprecate (Spark)DataFrame.to_pandas_like and introduce (Spark)DataFrame.pandas_api instead.

### Why are the changes needed?
Currently, (Spark)DataFrame.to_pandas_on_spark is too long to memorize and inconvenient to call.
With the proposal of the PR, we may improve the user experience and make APIs more developer-friendly.

### Does this PR introduce _any_ user-facing change?
Yes.

(Spark)DataFrame.pandas_api is introduced.
(Spark)DataFrame.to_pandas_on_spark is deprecated.
(Spark)DataFrame.to_koalas is undeprecated.

For the Spark 3.2.1 you can check those:

type(df.to_koalas())

/databricks/spark/python/pyspark/sql/dataframe.py:2964: FutureWarning: DataFrame.to_koalas is deprecated. Use DataFrame.to_pandas_on_spark instead.
  warnings.warn(
Out[5]: pyspark.pandas.frame.DataFrame

or this one:

type(df.to_pandas_on_spark())

Out[6]: pyspark.pandas.frame.DataFrame
Answer from Paweł Tajs on Stack Overflow
🌐
Hail Discussion
discuss.hail.is › help [0.1]
AttributeError: 'DataFrame' object has no attribute 'to_spark' - Help [0.1] - Hail Discussion
July 22, 2018 - I am trying to covert a Hail table to a pandas dataframe: kk2 = hl.Table.to_pandas(table1) # convert to pandas I am not sure why I am getting this error: --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in 1 kk2 = hl.Table.to_pandas(table1) # convert to pandas /home/hail/hail.zip/hail/typecheck/check.py in wrapper(*args, **kwargs) 545 ...
🌐
Apache
spark.apache.org › docs › latest › api › python › reference › pyspark.sql › api › pyspark.sql.DataFrame.toPandas.html
pyspark.sql.DataFrame.toPandas — PySpark 4.1.1 documentation
This method should only be used if the resulting Pandas pandas.DataFrame is expected to be small, as all the data is loaded into the driver’s memory. Usage with spark.sql.execution.arrow.pyspark.enabled=True is experimental. ... >>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", "name"]) >>> df.toPandas() age name 0 2 Alice 1 5 Bob
🌐
Incorta Community
community.incorta.com › t5 › data-schemas-knowledgebase › issue-with-converting-a-pandas-dataframe-to-a-spark-dataframe › ta-p › 5279
Issue with converting a Pandas DataFrame to a Spar... - Incorta Community
November 15, 2023 - Symptoms You received the error when trying to convert a Pandas DataFrame to Spark DataFrame in a PySpark MV. Here is the error.- INC_03070101: Transformation error Error 'DataFrame' object has no attribute 'iteritems' AttributeError : 'DataFrame' object has no attribute 'iteritems' Diagnosis ...
🌐
Snowflake Community
community.snowflake.com › s › question › 0D53r0000BsFOuwCQG › topandas-error
.topandas() error
December 12, 2022 - Join our community of data professionals to learn, connect, share and innovate together
🌐
JetBrains
intellij-support.jetbrains.com › hc › en-us › community › posts › 360003244439-Error-viewing-pyspark-DataFrame
Error viewing pyspark DataFrame – IDEs Support (IntelliJ Platform) | JetBrains
I have a variable that's type pyspark.sql.dataframe.Dataframe (spark 2.4.0), when I click "...View as DataFrame" in pycharm, I get a python error (below), and the python console locks up so I have to restart pycharm. As a work-around, I was able to convert it to a pandas DataFrame (df.toPandas()), which is viewable without errors.
Find elsewhere
🌐
Cloudera Community
community.cloudera.com › t5 › Support-Questions › Pyspark-issue-AttributeError-DataFrame-object-has-no › td-p › 78093
Pyspark issue AttributeError: 'DataFrame' object has no attribute 'saveAsTextFile'
January 2, 2024 - #%% import findspark ... example8.saveAsTextFile("juyfd") main() ... As the error message states, the object, either a DataFrame or List does not have the saveAsTextFile() method....
🌐
GitHub
github.com › microsoft › FLAML › issues › 625
AttributeError: 'DataFrame' object has no attribute 'copy' · Issue #625 · microsoft/FLAML
July 2, 2022 - I m using autoML(FLAML) with Spark on large data. The error image is given below train = spark.read.parquet("./train.parquet") test = spark.read.parquet("./test.parquet") input_cols = [c for c in train.columns if c != 'target'] vectorAss...
Author   Shafi2016
Top answer
1 of 2
7

The syntax you are using is for a pandas DataFrame. To achieve this for a spark DataFrame, you should use the withColumn() method. This works great for a wide range of well defined DataFrame functions, but it's a little more complicated for user defined mapping functions.

General Case

In order to define a udf, you need to specify the output data type. For instance, if you wanted to apply a function my_func that returned a string, you could create a udf as follows:

import pyspark.sql.functions as f
my_udf = f.udf(my_func, StringType())

Then you can use my_udf to create a new column like:

df = df.withColumn('new_column', my_udf(f.col("some_column_name")))

Another option is to use select:

df = df.select("*", my_udf(f.col("some_column_name")).alias("new_column"))

Specific Problem

Using a udf

In your specific case, you want to use a dictionary to translate the values of your DataFrame.

Here is a way to define a udf for this purpose:

some_map_udf = f.udf(lambda x: some_map.get(x, None), IntegerType())

Notice that I used dict.get() because you want your udf to be robust to bad inputs.

df = df.withColumn('new_column', some_map_udf(f.col("some_column_name")))

Using DataFrame functions

Sometimes using a udf is unavoidable, but whenever possible, using DataFrame functions is usually preferred.

Here is one option to do the same thing without using the udf.

The trick is to iterate over the items in some_map to create a list of pyspark.sql.functions.when() functions.

some_map_func = [f.when(f.col("some_column_name") == k, v) for k, v in some_map.items()]
print(some_map_func)
#[Column<CASE WHEN (some_column_name = a) THEN 0 END>,
# Column<CASE WHEN (some_column_name = c) THEN 1 END>,
# Column<CASE WHEN (some_column_name = b) THEN 1 END>]

Now you can use pyspark.sql.functions.coalesce() inside of a select:

df = df.select("*", f.coalesce(*some_map_func).alias("some_column_name"))

This works because when() returns null by default if the condition is not met, and coalesce() will pick the first non-null value it encounters. Since the keys of the map are unique, at most one column will be non-null.

2 of 2
1

You have a spark dataframe, not a pandas dataframe. To add new column to the spark dataframe:

import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
df = df.withColumn('new_column', F.udf(some_map.get, IntegerType())(some_column_name))
df.show()
Top answer
1 of 5
2

"sklearn.datasets" is a scikit package, where it contains a method load_iris().

load_iris(), by default return an object which holds data, target and other members in it. In order to get actual values you have to read the data and target content itself.

Whereas 'iris.csv', holds feature and target together.

FYI: If you set return_X_y as True in load_iris(), then you will directly get features and target.

from sklearn import datasets
data,target = datasets.load_iris(return_X_y=True)
2 of 5
1

The Iris Dataset from Sklearn is in Sklearn's Bunch format:

print(type(iris))
print(iris.keys())

output:

<class 'sklearn.utils.Bunch'>
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

So, that's why you can access it as:

x=iris.data
y=iris.target

But when you read the CSV file as DataFrame as mentioned by you:

iris = pd.read_csv('iris.csv',header=None).iloc[:,2:4]
iris.head()

output is:

    2   3
0   petal_length    petal_width
1   1.4 0.2
2   1.4 0.2
3   1.3 0.2
4   1.5 0.2

Here the column names are '1' and '2'.

First of all you should read the CSV file as:

df = pd.read_csv('iris.csv')

you should not include header=None as your csv file includes the column names i.e. the headers.

So, now what you can do is something like this:

X = df.iloc[:, [2, 3]] # Will give you columns 2 and 3 i.e 'petal_length' and 'petal_width'
y = df.iloc[:, 4] # Label column i.e 'species'

or if you want to use the column names then:

X = df[['petal_length', 'petal_width']]
y = df.iloc['species']

Also, if you want to convert labels from string to numerical format use sklearn LabelEncoder

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y = le.fit_transform(y)
🌐
Databricks Documentation
docs.databricks.com › apache spark › pandas api on spark › convert between pyspark and pandas dataframes
Convert between PySpark and pandas DataFrames | Databricks on AWS
April 21, 2023 - Even with Arrow, toPandas() results in the collection of all records in the DataFrame to the driver program and should be done on a small subset of the data. In addition, not all Spark data types are supported and an error can be raised if a column has an unsupported type.
🌐
Stack Overflow
stackoverflow.com › questions › 44388459 › pyspark-topandas-error › 44397769
python - pyspark toPandas Error? - Stack Overflow
6 PySpark, top for DataFrame · 1 toPandas() error using pyspark: 'int' object is not iterable · 0 How to fix Py4JJavaError while using .toPandas() function? 0 pyspark the method toPandas internal · 0 pyspark toPandas() IndexError: index is out of bounds ·
🌐
GitHub
github.com › pandas-dev › pandas › issues › 29135
combine_first: 'DataFrame' object has no attribute 'dtype' with duplicate columns · Issue #29135 · pandas-dev/pandas
October 21, 2019 - There was an error while loading. Please reload this page · The above call results in AttributeError: 'DataFrame' object has no attribute 'dtype' which is difficult to interpret. Under the hood the set logic tries to maintain dtype but the duplicate column label results in finding a DataFrame ...
Author   stippingerm
🌐
GeeksforGeeks
geeksforgeeks.org › how-to-fix-module-pandas-has-no-attribute-dataframe
How to Fix: module ‘pandas’ has no attribute ‘dataframe’ - GeeksforGeeks
December 19, 2021 - To create dataframe we need to use DataFrame(). If we use dataframe it will throw an error because there is no dataframe attribute in pandas. The method is DataFrame(). We need to pass any dictionary as an argument. Since the dictionary has ...
🌐
Statology
statology.org › home › how to fix: module ‘pandas’ has no attribute ‘dataframe’
How to Fix: module 'pandas' has no attribute 'dataframe'
October 27, 2021 - This tutorial explains how to fix the following error in Python: module 'pandas' has no attribute 'dataframe'.