sqlContext is missing; it needs to be created. The following code works:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark import sql

conf = SparkConf().setAppName("myFirstApp").setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = sql.SQLContext(sc)

a = sc.parallelize([[1, "a"], [2, "b"], [3, "c"], [4, "d"], [5, "e"]]).toDF(['ind', "state"])

a.show()

Edit:

In Spark 2.0, the above can be achieved with:

from pyspark import SparkConf
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").config(conf=SparkConf()).getOrCreate()

a = spark.createDataFrame([[1, "a"], [2, "b"], [3, "c"], [4, "d"], [5, "e"]], ['ind', "state"])
a.show()
Answer from Akavall on Stack Overflow
🌐
Cloudera Community
community.cloudera.com › t5 › Support-Questions › Pyspark-issue-AttributeError-DataFrame-object-has-no › m-p › 78093
Pyspark issue AttributeError: 'DataFrame' object has no attribute 'saveAsTextFile'
January 2, 2024 - So, if someone could help resolve this issue that would be most appreciated ... As the error message states, the object, either a DataFrame or List does not have the saveAsTextFile() method. result.write.save() or result.toJavaRDD.saveAsTextFile() shoud do the work, or you can refer to DataFrame ...
🌐
Hail Discussion
discuss.hail.is › help [0.1]
AttributeError: 'DataFrame' object has no attribute 'to_spark' - Help [0.1] - Hail Discussion
July 22, 2018 - I am trying to covert a Hail table to a pandas dataframe: kk2 = hl.Table.to_pandas(table1) # convert to pandas I am not sure why I am getting this error: --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in 1 kk2 = ...
🌐
Spark By {Examples}
sparkbyexamples.com › home › hbase › attributeerror: ‘dataframe’ object has no attribute ‘map’ in pyspark
AttributeError: 'DataFrame' object has no attribute 'map' in PySpark - Spark By {Examples}
March 27, 2024 - df2=df.map(lambda x: [x[0],x[1]]) File "C:\ProgramData\Anaconda3\lib\site-packages\pyspark\sql\dataframe.py", line 1401, in __getattr__ "'%s' object has no attribute '%s'" % (self.__class__.__name__, name)) AttributeError: 'DataFrame' object has no attribute 'map' PySpark DataFrame doesn’t have a map() transformation instead it’s present in RDD hence you are getting the error AttributeError: ‘DataFrame’ object has no attribute ‘map’
Find elsewhere
🌐
Stack Overflow
stackoverflow.com › questions › 38364637 › something-wrong-with-spark-rdd
python - Something Wrong with Spark RDD - Stack Overflow
The RDD API does not have a show() method, hence the error in #1. You need to either convert to a dataframe (as you do in #2) or use something like:
🌐
Stack Overflow
stackoverflow.com › questions › 52899337 › pyspark-attributeerror-pipelinedrdd-object-has-no-attribute-get-object-id
python - Pyspark: AttributeError: 'PipelinedRDD' object has no attribute '_get_object_id' - Stack Overflow
2 Pyspark pyspark.rdd.PipelinedRDD not working with model · 0 'PipelinedRDD' object has no attribute 'sparkSession' when creating dataframe in pyspark · 4 AttributeError: 'Pipeline' object has no attribute '_transfer_param_map_to_java' 1 Error in pyspark pipeline ·
🌐
Stack Overflow
stackoverflow.com › questions › linked › 39535447
Hot Linked Questions - Stack Overflow
I'd like to convert pyspark.sql.dataframe.DataFrame to pyspark.rdd.RDD[String] I converted a DataFrame df to RDD data: data = df.rdd type (data) ## pyspark.rdd.RDD the new RDD data contains Row ... ... I am using pyspark 2.0 to create a DataFrame object by reading a csv using: data = spark.read.csv('data.csv', header=True) I find the type of the data using type(data) The result is pyspark.sql....
Top answer
1 of 5
2

"sklearn.datasets" is a scikit package, where it contains a method load_iris().

load_iris(), by default return an object which holds data, target and other members in it. In order to get actual values you have to read the data and target content itself.

Whereas 'iris.csv', holds feature and target together.

FYI: If you set return_X_y as True in load_iris(), then you will directly get features and target.

from sklearn import datasets
data,target = datasets.load_iris(return_X_y=True)
2 of 5
1

The Iris Dataset from Sklearn is in Sklearn's Bunch format:

print(type(iris))
print(iris.keys())

output:

<class 'sklearn.utils.Bunch'>
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

So, that's why you can access it as:

x=iris.data
y=iris.target

But when you read the CSV file as DataFrame as mentioned by you:

iris = pd.read_csv('iris.csv',header=None).iloc[:,2:4]
iris.head()

output is:

    2   3
0   petal_length    petal_width
1   1.4 0.2
2   1.4 0.2
3   1.3 0.2
4   1.5 0.2

Here the column names are '1' and '2'.

First of all you should read the CSV file as:

df = pd.read_csv('iris.csv')

you should not include header=None as your csv file includes the column names i.e. the headers.

So, now what you can do is something like this:

X = df.iloc[:, [2, 3]] # Will give you columns 2 and 3 i.e 'petal_length' and 'petal_width'
y = df.iloc[:, 4] # Label column i.e 'species'

or if you want to use the column names then:

X = df[['petal_length', 'petal_width']]
y = df.iloc['species']

Also, if you want to convert labels from string to numerical format use sklearn LabelEncoder

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y = le.fit_transform(y)
🌐
Apache JIRA
issues.apache.org › jira › browse › SPARK-10122
[SPARK-10122] AttributeError: 'RDD' object has no attribute 'offsetRanges' - ASF JIRA
August 21, 2015 - from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils def attach_kafka_metadata(kafka_rdd): offset_ranges = kafka_rdd.offsetRanges() return kafka_rdd if __name__ == "__main__": sc = SparkContext(appName='kafka-test') ssc = StreamingContext(sc, 10) kafka_stream = KafkaUtils.createDirectStream( ssc, [TOPIC], kafkaParams={ 'metadata.broker.list': BROKERS, }, ) kafka_stream.transform(attach_kafka_metadata).count().pprint() ssc.start() ssc.awaitTermination()
🌐
GitHub
github.com › dask › dask › issues › 8624
AttributeError: 'DataFrame' object has no attribute 'name'; Various stack overflow / github suggested fixes not working · Issue #8624 · dask/dask
January 26, 2022 - The method being applied should return a Dask dataframe of 3 np.int64 columns: state_id, city_id, district_id. inspect.getmembers(my_dask_df) was expected to return a list of object names and values to introspect the dask dataframe object.
Author   david-thrower