attributeerror: 'dataframe' object has no attribute 'rdd'

Converting rdd to dataframe: AttributeError: 'RDD' object has no attribute 'toDF' [duplicate]

stackoverflow.com › questions › 47341048 › converting-rdd-to-dataframe-attributeerror-rdd-object-has-no-attribute-todf

sqlContext is missing; it needs to be created. The following code works:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark import sql

conf = SparkConf().setAppName("myFirstApp").setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = sql.SQLContext(sc)

a = sc.parallelize([[1, "a"], [2, "b"], [3, "c"], [4, "d"], [5, "e"]]).toDF(['ind', "state"])

a.show()

Edit:

In Spark 2.0, the above can be achieved with:

from pyspark import SparkConf
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").config(conf=SparkConf()).getOrCreate()

a = spark.createDataFrame([[1, "a"], [2, "b"], [3, "c"], [4, "d"], [5, "e"]], ['ind', "state"])
a.show()

Answer from Akavall on Stack Overflow

Cloudera Community

community.cloudera.com › t5 › Support-Questions › Pyspark-issue-AttributeError-DataFrame-object-has-no › m-p › 78093

Pyspark issue AttributeError: 'DataFrame' object has no attribute 'saveAsTextFile'

January 2, 2024 - So, if someone could help resolve this issue that would be most appreciated ... As the error message states, the object, either a DataFrame or List does not have the saveAsTextFile() method. result.write.save() or result.toJavaRDD.saveAsTextFile() shoud do the work, or you can refer to DataFrame ...

Stack Overflow

stackoverflow.com › questions › 47341048 › converting-rdd-to-dataframe-attributeerror-rdd-object-has-no-attribute-todf

apache spark - Converting rdd to dataframe: AttributeError: 'RDD' object has no attribute 'toDF' - Stack Overflow

Top answer

1 of 2

sqlContext is missing; it needs to be created. The following code works:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark import sql

conf = SparkConf().setAppName("myFirstApp").setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = sql.SQLContext(sc)

a = sc.parallelize([[1, "a"], [2, "b"], [3, "c"], [4, "d"], [5, "e"]]).toDF(['ind', "state"])

a.show()

Edit:

In Spark 2.0, the above can be achieved with:

from pyspark import SparkConf
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").config(conf=SparkConf()).getOrCreate()

a = spark.createDataFrame([[1, "a"], [2, "b"], [3, "c"], [4, "d"], [5, "e"]], ['ind', "state"])
a.show()

2 of 2

-2

u can directly do this

a_df = a.toDF()
type(a_df)

Stack Overflow

stackoverflow.com › questions › 53618990 › attributeerror-rdd-object-has-no-attribute-show

python - AttributeError: 'RDD' object has no attribute 'show' - Stack Overflow

Top answer

1 of 1

The error is clear as df is an rdd. You should change it to a dataframe using toDF likes in the following code:

df = df.toDF()
df.show()

Hail Discussion

discuss.hail.is › help [0.1]

AttributeError: 'DataFrame' object has no attribute 'to_spark' - Help [0.1] - Hail Discussion

July 22, 2018 - I am trying to covert a Hail table to a pandas dataframe: kk2 = hl.Table.to_pandas(table1) # convert to pandas I am not sure why I am getting this error: --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in 1 kk2 = ...

Stack Overflow

stackoverflow.com › questions › 63441431 › converting-rdd-to-dataframe-attributeerror-rdd-object-has-no-attribute-todf

python - Converting rdd to dataframe: AttributeError: 'RDD' object has no attribute 'toDF' using PySpark - Stack Overflow

Top answer

1 of 2

Initialize SparkSession by passing sparkcontext.

Example:

from pyspark import SparkConf, SparkContext
from pyspark.sql.functions import *
from pyspark.sql import SparkSession

conf = SparkConf().setMaster("local").setAppName("Dataframe_examples")
sc = SparkContext(conf=conf)

spark = SparkSession(sc)

def parsedLine(line):
    fields = line.split(',')
    movieId = fields[0]
    movieName = fields[1]
    genres = fields[2]
    return movieId, movieName, genres

movies = sc.textFile("file:///home/ajit/ml-25m/movies.csv")

#or using spark.sparkContext
movies = spark.sparkContext.textFile("file:///home/ajit/ml-25m/movies.csv")

parsedLines = movies.map(parsedLine)
print(parsedLines.count())

dataFrame = parsedLines.toDF(["movieId"])
dataFrame.printSchema()

2 of 2

Use SparkSession to make the RDD dataframe as follows:

movies = sc.textFile("file:///home/ajit/ml-25m/movies.csv")
parsedLines = movies.map(parsedLine)
print(parsedLines.count())

spark = SparkSession.builder.getOrCreate()
dataFrame = spark.createDataFrame(parsedLines).toDF(["movieId"])
dataFrame.printSchema()

or use the spark context from the session at first.

spark = SparkSession.builder.master("local").appName("Dataframe_examples").getOrCreate()
sc = spark.sparkContext

Stack Overflow

stackoverflow.com › questions › 39535447 › attributeerror-dataframe-object-has-no-attribute-map

python - AttributeError: 'DataFrame' object has no attribute 'map' - Stack Overflow

Top answer

1 of 2

111

You can't map a dataframe, but you can convert the dataframe to an RDD and map that by doing spark_df.rdd.map(). Prior to Spark 2.0, spark_df.map would alias to spark_df.rdd.map(). With Spark 2.0, you must explicitly call .rdd first.

2 of 2

You can use df.rdd.map(), as DataFrame does not have map or flatMap, but be aware of the implications of using df.rdd:

Converting to RDD breaks Dataframe lineage, there is no predicate pushdown, no column prunning, no SQL plan and less efficient PySpark transformations.

What should you do instead?

Keep in mind that the high-level DataFrame API is equipped with many alternatives. First, you can use select or selectExpr.

Another example is using explode instead of flatMap(which existed in RDD):

df.select($"name",explode($"knownLanguages"))
    .show(false)

Result:

+-------+------+
|name   |col   |
+-------+------+
|James  |Java  |
|James  |Scala |
|Michael|Spark |
|Michael|Java  |
|Michael|null  |
|Robert |CSharp|
|Robert |      |
+-------+------+

You can also use withColumn or UDF, depending on the use-case, or another option in the DataFrame API.

Spark By {Examples}

sparkbyexamples.com › home › hbase › attributeerror: ‘dataframe’ object has no attribute ‘map’ in pyspark

AttributeError: 'DataFrame' object has no attribute 'map' in PySpark - Spark By {Examples}

March 27, 2024 - df2=df.map(lambda x: [x[0],x[1]]) File "C:\ProgramData\Anaconda3\lib\site-packages\pyspark\sql\dataframe.py", line 1401, in __getattr__ "'%s' object has no attribute '%s'" % (self.__class__.__name__, name)) AttributeError: 'DataFrame' object has no attribute 'map' PySpark DataFrame doesn’t have a map() transformation instead it’s present in RDD hence you are getting the error AttributeError: ‘DataFrame’ object has no attribute ‘map’

Stack Overflow

stackoverflow.com › questions › 38134643 › how-to-resolve-attributeerror-dataframe-object-has-no-attribute

python - How to resolve AttributeError: 'DataFrame' object has no attribute - Stack Overflow

Top answer

1 of 7

Check your DataFrame with data.columns

It should print something like this

Index([u'regiment', u'company',  u'name',u'postTestScore'], dtype='object')

Check for hidden white spaces..Then you can rename with

data = data.rename(columns={'Number ': 'Number'})

2 of 7

I think the column name that contains "Number" is something like " Number" or "Number ". I'm assuming you might have a residual space in the column name. Please run print "<{}>".format(data.columns[1]) and see what you get. If it's something like < Number>, it can be fixed with:

data.columns = data.columns.str.strip()

See pandas.Series.str.strip

In general, AttributeError: 'DataFrame' object has no attribute '...', where ... is some column name, is caused because . notation has been used to reference a nonexistent column name or pandas method.

pandas methods are accessed with a .. pandas columns can also be accessed with a . (e.g. data.col) or with brackets (e.g. ['col'] or [['col1', 'col2']]).

data.columns = data.columns.str.strip() is a fast way to quickly remove leading and trailing spaces from all column names. Otherwise verify the column or attribute is correctly spelled.

Stack Overflow

stackoverflow.com › questions › 45164914 › how-can-i-solve-attributeerror-rdd-object-has-no-attribute-get-object-id-w

python - How can I solve AttributeError: 'RDD' object has no attribute '_get_object_id' when using UDF? - Stack Overflow

Top answer

1 of 1

You got to pass the dataframe columns and not the dataframe itself.

>>> from pyspark.sql.types import *
>>> example_dataframe.show()
+---+---+
|  a|  b|
+---+---+
|  1|  1|
|  2|  2|
+---+---+
>>> af = UserDefinedFunction(lambda line_a, line_b : aa(line_a, line_b), StringType())
>>>example_dataframe.withColumn('c',af(example_dataframe['a'],example_dataframe['b'])).show()

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  1|  3|
|  2|  2|  6|
+---+---+---+

Find elsewhere

Google Bing Mojeek

Stack Overflow

stackoverflow.com › questions › 38364637 › something-wrong-with-spark-rdd

python - Something Wrong with Spark RDD - Stack Overflow

The RDD API does not have a show() method, hence the error in #1. You need to either convert to a dataframe (as you do in #2) or use something like:

AWS re:Post

repost.aws › questions › QUvWrsRjenSrqHLJqLpy4DWg › attributeerror-dataframe-object-has-no-attribute-get-object-id

AttributeError: 'DataFrame' object has no attribute '_get_object_id' | AWS re:Post

October 11, 2018 - AttributeError: 'DataFrame' object has no attribute '_get_object_id'

Stack Overflow

stackoverflow.com › questions › 53030713 › pyspark-rdd-rdd-object-has-no-attribute-flatmap

python - Pyspark rdd : 'RDD' object has no attribute 'flatmap' - Stack Overflow

Top answer

1 of 1

my_rdd = my_rdd.flatMap(lambda r: (r[5].split('|')))

uppercase !!!

Stack Overflow

stackoverflow.com › questions › 48990291 › rdd-object-has-no-attribute-jdf-pyspark-rdd

python 3.x - 'RDD' object has no attribute '_jdf' pyspark RDD - Stack Overflow

Top answer

1 of 1

You shouldn't be using rdd with CountVectorizer. Instead you should try to form the array of words in the dataframe itself as

train_data = spark.read.text("20ng-train-all-terms.txt")

from pyspark.sql import functions as F
td= train_data.select(F.split("value", " ").alias("words")).select(F.col("words")[0].alias("label"), F.col("words"))

from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)

And then it should work so that you can call transform function as

vectorizer_transformer.transform(td).show(truncate=False)

Now, if you want to stick to the old style of converting to the rdd style then you have to modify certain lines of code. Following is the modified complete code (working) of yours

from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()

train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line[0].split(" ")).map(lambda words: Row(label=words[0], words=words[1:])).toDF()
from pyspark.ml.feature import CountVectorizer

vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(tr_data)

But I would suggest you to stick with dataframe way.

Stack Overflow

stackoverflow.com › questions › 32788387 › pipelinedrdd-object-has-no-attribute-todf-in-pyspark

python - 'PipelinedRDD' object has no attribute 'toDF' in PySpark - Stack Overflow

Top answer

1 of 2

123

toDF method is a monkey patch executed inside SparkSession (SQLContext constructor in 1.x) constructor so to be able to use it you have to create a SQLContext (or SparkSession) first:

# SQLContext or HiveContext in Spark 1.x
from pyspark.sql import SparkSession
from pyspark import SparkContext

sc = SparkContext()

rdd = sc.parallelize([("a", 1)])
hasattr(rdd, "toDF")
## False

spark = SparkSession(sc)
hasattr(rdd, "toDF")
## True

rdd.toDF().show()
## +---+---+
## | _1| _2|
## +---+---+
## |  a|  1|
## +---+---+

Not to mention you need a SQLContext or SparkSession to work with DataFrames in the first place.

2 of 2

Make sure you have spark session too.

sc = SparkContext("local", "first app")
spark = SparkSession(sc)

Stack Overflow

stackoverflow.com › questions › 45633302 › pyspark-loading-csv-attributeerror-rdd-object-has-no-attribute-get-object-i

python - PySpark loading CSV AttributeError: 'RDD' object has no attribute '_get_object_id' - Stack Overflow

Top answer

1 of 1

You cannot pass RDD to csv reader. You should use path directly:

df = sqlContext.load(source="com.databricks.spark.csv", 
    header = 'true', path = "hdfs:///path/to/sensordata20171008223515.csv")

Only a limited number of formats (notably JSON) supports RDD as an input argument.

Stack Overflow

stackoverflow.com › questions › 52899337 › pyspark-attributeerror-pipelinedrdd-object-has-no-attribute-get-object-id

python - Pyspark: AttributeError: 'PipelinedRDD' object has no attribute '_get_object_id' - Stack Overflow

2 Pyspark pyspark.rdd.PipelinedRDD not working with model · 0 'PipelinedRDD' object has no attribute 'sparkSession' when creating dataframe in pyspark · 4 AttributeError: 'Pipeline' object has no attribute '_transfer_param_map_to_java' 1 Error in pyspark pipeline ·

Stack Overflow

stackoverflow.com › questions › linked › 39535447

Hot Linked Questions - Stack Overflow

I'd like to convert pyspark.sql.dataframe.DataFrame to pyspark.rdd.RDD[String] I converted a DataFrame df to RDD data: data = df.rdd type (data) ## pyspark.rdd.RDD the new RDD data contains Row ... ... I am using pyspark 2.0 to create a DataFrame object by reading a csv using: data = spark.read.csv('data.csv', header=True) I find the type of the data using type(data) The result is pyspark.sql....

Stack Exchange

datascience.stackexchange.com › questions › 37435 › i-got-the-following-error-dataframe-object-has-no-attribute-data

python - I got the following error : 'DataFrame' object has no attribute 'data' - Data Science Stack Exchange

Top answer

1 of 5

"sklearn.datasets" is a scikit package, where it contains a method load_iris().

load_iris(), by default return an object which holds data, target and other members in it. In order to get actual values you have to read the data and target content itself.

Whereas 'iris.csv', holds feature and target together.

FYI: If you set return_X_y as True in load_iris(), then you will directly get features and target.

from sklearn import datasets
data,target = datasets.load_iris(return_X_y=True)

2 of 5

The Iris Dataset from Sklearn is in Sklearn's Bunch format:

print(type(iris))
print(iris.keys())

output:

<class 'sklearn.utils.Bunch'>
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

So, that's why you can access it as:

x=iris.data
y=iris.target

But when you read the CSV file as DataFrame as mentioned by you:

iris = pd.read_csv('iris.csv',header=None).iloc[:,2:4]
iris.head()

output is:

    2   3
0   petal_length    petal_width
1   1.4 0.2
2   1.4 0.2
3   1.3 0.2
4   1.5 0.2

Here the column names are '1' and '2'.

First of all you should read the CSV file as:

df = pd.read_csv('iris.csv')

you should not include header=None as your csv file includes the column names i.e. the headers.

So, now what you can do is something like this:

X = df.iloc[:, [2, 3]] # Will give you columns 2 and 3 i.e 'petal_length' and 'petal_width'
y = df.iloc[:, 4] # Label column i.e 'species'

or if you want to use the column names then:

X = df[['petal_length', 'petal_width']]
y = df.iloc['species']

Also, if you want to convert labels from string to numerical format use sklearn LabelEncoder

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y = le.fit_transform(y)

Apache JIRA

issues.apache.org › jira › browse › SPARK-10122

[SPARK-10122] AttributeError: 'RDD' object has no attribute 'offsetRanges' - ASF JIRA

August 21, 2015 - from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils def attach_kafka_metadata(kafka_rdd): offset_ranges = kafka_rdd.offsetRanges() return kafka_rdd if __name__ == "__main__": sc = SparkContext(appName='kafka-test') ssc = StreamingContext(sc, 10) kafka_stream = KafkaUtils.createDirectStream( ssc, [TOPIC], kafkaParams={ 'metadata.broker.list': BROKERS, }, ) kafka_stream.transform(attach_kafka_metadata).count().pprint() ssc.start() ssc.awaitTermination()

GitHub

github.com › dask › dask › issues › 8624

AttributeError: 'DataFrame' object has no attribute 'name'; Various stack overflow / github suggested fixes not working · Issue #8624 · dask/dask

January 26, 2022 - The method being applied should return a Dask dataframe of 3 np.int64 columns: state_id, city_id, district_id. inspect.getmembers(my_dask_df) was expected to return a list of object names and values to introspect the dask dataframe object.

Author david-thrower