dataframe' object has no attribute 'printschema' pyspark

printSchema() not working for dataframe created from pandas (using Python)

stackoverflow.com › questions › 75791256 › printschema-not-working-for-dataframe-created-from-pandas-using-python

You can use df.info to get the schema of a pandas DataFrame.
Yes there is a difference between a pandas DataFrame and a Spark DataFrame. There is even a pandas on Spark DataFrame.

https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
https://spark.apache.org/docs/3.2.1/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.spark.frame.html

Answer from MC10 on Stack Overflow

Stack Overflow

stackoverflow.com › questions › 75791256 › printschema-not-working-for-dataframe-created-from-pandas-using-python

printSchema() not working for dataframe created from pandas (using Python) - Stack Overflow

Top answer

1 of 1

You can use df.info to get the schema of a pandas DataFrame.
Yes there is a difference between a pandas DataFrame and a Spark DataFrame. There is even a pandas on Spark DataFrame.

https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
https://spark.apache.org/docs/3.2.1/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.spark.frame.html

Stack Overflow

stackoverflow.com › questions › 38134643 › how-to-resolve-attributeerror-dataframe-object-has-no-attribute

python - How to resolve AttributeError: 'DataFrame' object has no attribute - Stack Overflow

Top answer

1 of 7

Check your DataFrame with data.columns

It should print something like this

Index([u'regiment', u'company',  u'name',u'postTestScore'], dtype='object')

Check for hidden white spaces..Then you can rename with

data = data.rename(columns={'Number ': 'Number'})

2 of 7

I think the column name that contains "Number" is something like " Number" or "Number ". I'm assuming you might have a residual space in the column name. Please run print "<{}>".format(data.columns[1]) and see what you get. If it's something like < Number>, it can be fixed with:

data.columns = data.columns.str.strip()

See pandas.Series.str.strip

In general, AttributeError: 'DataFrame' object has no attribute '...', where ... is some column name, is caused because . notation has been used to reference a nonexistent column name or pandas method.

pandas methods are accessed with a .. pandas columns can also be accessed with a . (e.g. data.col) or with brackets (e.g. ['col'] or [['col1', 'col2']]).

data.columns = data.columns.str.strip() is a fast way to quickly remove leading and trailing spaces from all column names. Otherwise verify the column or attribute is correctly spelled.

Stack Overflow

stackoverflow.com › questions › 39671564 › print-out-types-of-data-frame-columns-in-spark

pyspark - Print out types of data frame columns in Spark - Stack Overflow

Top answer

1 of 2

df.printSchema() will print you the dataframe schema in an easy to follow formatting

2 of 2

Try:

>>> for name, dtype in df.dtypes:
...     print(name, dtype)

>>> df.schema

Databricks Community

community.databricks.com › t5 › data-engineering › attributeerror-dataframe-object-has-no-attribute › td-p › 61132

AttributeError: 'DataFrame' object has no attribut... - Databricks Community - 61132

February 19, 2024 - Hello, I have some trouble deduplicating rows on the "id" column, with the method "dropDuplicatesWithinWatermark" in a pipeline. When I run this pipeline, I get the error message: "AttributeError: 'DataFrame' object has no attribute 'dropDuplicatesWithinWatermark'" Here is part of the code: @dl...

Apache

spark.apache.org › docs › latest › api › python › reference › pyspark.sql › api › pyspark.sql.DataFrame.printSchema.html

pyspark.sql.DataFrame.printSchema — PySpark 4.1.1 documentation

>>> df = spark.createDataFrame( ... [(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"]) >>> df.printSchema() root |-- age: long (nullable = true) |-- name: string (nullable = true)

Cumulative Sum

cumsum.wordpress.com › 2020 › 10 › 10 › pyspark-attributeerror-dataframe-object-has-no-attribute-_get_object_id

[pyspark] AttributeError: ‘DataFrame’ object has no attribute ‘_get_object_id’

October 10, 2020 - AttributeError: ‘DataFrame’ object has no attribute ‘_get_object_id’ · The reason being that isin expects actual local values or collections but df2.select('id') returns a data frame.

Cloudera Community

community.cloudera.com › t5 › Support-Questions › Pyspark-issue-AttributeError-DataFrame-object-has-no › td-p › 78093

Pyspark issue AttributeError: 'DataFrame' object has no attribute 'saveAsTextFile'

January 2, 2024 - #%% import findspark ... example8.saveAsTextFile("juyfd") main() ... As the error message states, the object, either a DataFrame or List does not have the saveAsTextFile() method....

Spark By {Examples}

sparkbyexamples.com › home › hbase › attributeerror: ‘dataframe’ object has no attribute ‘map’ in pyspark

AttributeError: 'DataFrame' object has no attribute 'map' in PySpark - Spark By {Examples}

March 27, 2024 - data = [('James',3000),('Anna',4001),('Robert',6200)] df = spark.createDataFrame(data,["name","salary"]) df.show() #converts DataFrame to rdd rdd=df.rdd print(rdd.collect()) # apply map() transformation) rdd2=df.rdd.map(lambda x: [x[0],x[1]*20/100]) print(rdd2.collect()) #conver RDD to DataFrame df2=rdd2.toDF(["name","bonus"]) df2.show()

Stack Exchange

datascience.stackexchange.com › questions › 37435 › i-got-the-following-error-dataframe-object-has-no-attribute-data

python - I got the following error : 'DataFrame' object has no attribute 'data' - Data Science Stack Exchange

Top answer

1 of 5

"sklearn.datasets" is a scikit package, where it contains a method load_iris().

load_iris(), by default return an object which holds data, target and other members in it. In order to get actual values you have to read the data and target content itself.

Whereas 'iris.csv', holds feature and target together.

FYI: If you set return_X_y as True in load_iris(), then you will directly get features and target.

from sklearn import datasets
data,target = datasets.load_iris(return_X_y=True)

2 of 5

The Iris Dataset from Sklearn is in Sklearn's Bunch format:

print(type(iris))
print(iris.keys())

output:

<class 'sklearn.utils.Bunch'>
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

So, that's why you can access it as:

x=iris.data
y=iris.target

But when you read the CSV file as DataFrame as mentioned by you:

iris = pd.read_csv('iris.csv',header=None).iloc[:,2:4]
iris.head()

output is:

    2   3
0   petal_length    petal_width
1   1.4 0.2
2   1.4 0.2
3   1.3 0.2
4   1.5 0.2

Here the column names are '1' and '2'.

First of all you should read the CSV file as:

df = pd.read_csv('iris.csv')

you should not include header=None as your csv file includes the column names i.e. the headers.

So, now what you can do is something like this:

X = df.iloc[:, [2, 3]] # Will give you columns 2 and 3 i.e 'petal_length' and 'petal_width'
y = df.iloc[:, 4] # Label column i.e 'species'

or if you want to use the column names then:

X = df[['petal_length', 'petal_width']]
y = df.iloc['species']

Also, if you want to convert labels from string to numerical format use sklearn LabelEncoder

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y = le.fit_transform(y)

Find elsewhere

Google Bing Mojeek

Hail Discussion

discuss.hail.is › help [0.1]

AttributeError: 'DataFrame' object has no attribute 'to_spark' - Help [0.1] - Hail Discussion

July 22, 2018 - I am trying to covert a Hail table to a pandas dataframe: kk2 = hl.Table.to_pandas(table1) # convert to pandas I am not sure why I am getting this error: --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in 1 kk2 = hl.Table.to_pandas(table1) # convert to pandas /home/hail/hail.zip/hail/typecheck/check.py in wrapper(*args, **kwargs) 545 ...

Stack Overflow

stackoverflow.com › questions › 57363618 › pyspark-dataframe-object-has-no-attribute-get-object-id

python - pyspark 'DataFrame' object has no attribute '_get_object_id' - Stack Overflow

Top answer

1 of 2

You can't reference a second spark DataFrame inside a function, unless you're using a join. IIUC, you can do the following to achieve your desired result.

Suppose that means is the following:

#means.show()
#+---+---------+
#| id|avg(col1)|
#+---+---------+
#|  1|     12.0|
#|  3|    300.0|
#|  2|     21.0|
#+---+---------+

Join df and means on the id column, then apply your when condition

from pyspark.sql.functions import when

df.join(means, on="id")\
    .withColumn(
        "col1",
        when(
            (df["col1"].isNull()), 
            means["avg(col1)"]
        ).otherwise(df["col1"])
    )\
    .select(*df.columns)\
    .show()
#+---+-----+
#| id| col1|
#+---+-----+
#|  1| 12.0|
#|  1| 12.0|
#|  1| 14.0|
#|  1| 10.0|
#|  3|300.0|
#|  3|300.0|
#|  2| 21.0|
#|  2| 22.0|
#|  2| 20.0|
#+---+-----+

But in this case, I'd actually recommend using a Window with pyspark.sql.functions.mean:

from pyspark.sql import Window
from pyspark.sql.functions import col, mean

df.withColumn(
    "col1",
    when(
        col("col1").isNull(), 
        mean("col1").over(Window.partitionBy("id"))
    ).otherwise(col("col1"))
).show()
#+---+-----+
#| id| col1|
#+---+-----+
#|  1| 12.0|
#|  1| 10.0|
#|  1| 12.0|
#|  1| 14.0|
#|  3|300.0|
#|  3|300.0|
#|  2| 22.0|
#|  2| 20.0|
#|  2| 21.0|
#+---+-----+

2 of 2

-5

I think you are using Scala API, in which you use (). In PySpark, use [] instead.

Incorta Community

community.incorta.com › t5 › data-schemas-knowledgebase › issue-with-converting-a-pandas-dataframe-to-a-spark-dataframe › ta-p › 5279

Issue with converting a Pandas DataFrame to a Spar... - Incorta Community

November 15, 2023 - Symptoms You received the error when trying to convert a Pandas DataFrame to Spark DataFrame in a PySpark MV. Here is the error.- INC_03070101: Transformation error Error 'DataFrame' object has no attribute 'iteritems' AttributeError : 'DataFrame' object has no attribute 'iteritems' Diagnosis Since...

AWS re:Post

repost.aws › questions › QUvWrsRjenSrqHLJqLpy4DWg › attributeerror-dataframe-object-has-no-attribute-get-object-id

AttributeError: 'DataFrame' object has no attribute '_get_object_id' | AWS re:Post

October 11, 2018 - %pyspark import sys from pyspark.context import SparkContext from pyspark.sql.functions import lit, current_timestamp from pyspark.sql.window import Window from pyspark.sql.types import IntegerType, StructType, StructField, LongType from awsglue.context import GlueContext from awsglue.transforms import * from awsglue.utils import getResolvedOptions from awsglue.dynamicframe import DynamicFrame from awsglue.job import Job def dfZipWithIndex (df, offset=1, colName="RID"): ''' Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe and preserves a schema :param df:

Stack Overflow

stackoverflow.com › questions › 61343134 › describe-vs-printschema-methods-on-dataframe

apache spark - describe vs printSchema methods on DataFrame - Stack Overflow

>>> df2.describe() DataFrame[summary: string, name: string, course: string, score: string] >>> df2.describe <bound method DataFrame.describe of DataFrame[name: string, course: string, score: int]> ... >>> df2.printSchema() root |-- name: string (nullable = true) |-- course: string (nullable = true) |-- score: integer (nullable = true)

Stack Overflow

stackoverflow.com › questions › 71446185 › pyspark-attributeerror-dataframe-object-has-no-attribute-cast

apache spark sql - pyspark AttributeError: 'DataFrame' object has no attribute 'cast' - Stack Overflow

Top answer

1 of 3

A short, clean, scalable solution
Change some columns, leave the rest untouched

import pyspark.sql.functions as F

# That's not part of the solution, just a creation of a sample dataframe
# df = spark.createDataFrame([(10, 1,2,3,4),(20, 5,6,7,8)],'Id int, Revenue int ,GROSS_PROFIT int ,Net_Income int ,Enterprise_Value int')

cols_to_cast = ["Revenue" ,"GROSS_PROFIT" ,"Net_Income" ,"Enterprise_Value"]
df = df.select([F.col(c).cast('double') if c in cols_to_cast else c for c in df.columns])

df.printSchema()

root
 |-- Id: integer (nullable = true)
 |-- Revenue: double (nullable = true)
 |-- GROSS_PROFIT: double (nullable = true)
 |-- Net_Income: double (nullable = true)
 |-- Enterprise_Value: double (nullable = true)

2 of 3

If this helps

df = spark.createDataFrame([(1, 0),
(2, 1),
(3  ,1),
(4, 1),
(5, 0),
(6  ,0),
(7, 1),
(8  ,1),
(9  ,1),
(10,    1),
(11,    0),
(12,    0)],
('Time' ,'Tag1'))

df = df.withColumn('a', col('Time').cast('integer')).withColumn('a1', col('Tag1').cast('double'))
df.printSchema()
df.show()

GitHub

github.com › microsoft › FLAML › issues › 625

AttributeError: 'DataFrame' object has no attribute 'copy' · Issue #625 · microsoft/FLAML

July 2, 2022 - I m using autoML(FLAML) with Spark on large data. The error image is given below train = spark.read.parquet("./train.parquet") test = spark.read.parquet("./test.parquet") input_cols = [c for c in train.columns if c != 'target'] vectorAss...

Author Shafi2016

Spark By {Examples}

sparkbyexamples.com › home › pyspark › pyspark printschema() example

PySpark printSchema() Example - Spark By {Examples}

May 17, 2024 - The printSchema() method in PySpark is a very helpful function used to display the schema of a DataFrame in a readable hierarchy format. This method

Cumulative Sum

cumsum.wordpress.com › 2020 › 09 › 26 › pyspark-attributeerror-nonetype-object-has-no-attribute

[pyspark] AttributeError: ‘NoneType’ object has no attribute

February 25, 2021 - In pyspark, however, it's pretty common for a beginner to make the following mistake, i.e. assign a data frame to a variable after calling show method on it, and then try to use it somewhere else…

Stack Overflow

stackoverflow.com › questions › 74899640 › attributeerror-dataframewriter-object-has-no-attribute-schema

pyspark - AttributeError: 'DataFrameWriter' object has no attribute 'schema' - Stack Overflow

Top answer

1 of 1

As you would have already guessed, you can fix the code by removing .schema(my_schema) like below

my_spark_df.write.format("delta").save(my_path)

I think you are confused where does the schema apply, you need to create a dataframe with the schema(use some dummy Seq or rdd), and during that point you need to mention the schema. While you call DataFrameWriter there is no option to provide schema, it infers the schema of the dataframe on which the writer API is called.

You could take your initial dataframe alter its schema like below and use this intermediate dataframe for the write api call

 df.withColumn("new_column_name",$"old_column_name".cast("new_datatype"))

Stack Overflow

stackoverflow.com › questions › 51813517 › dataframe-object-has-no-attribute-col

apache spark - DataFrame object has no attribute 'col' - Stack Overflow

Top answer

1 of 5

The book you're referring to describes Scala / Java API. In PySpark use []

df["count"]

2 of 5

The book combines the Scala and PySpark API's.

In Scala / Java API, df.col("column_name") or df.apply("column_name") return the Column.

Whereas in pyspark use the below to get the column from DF.

df.colName
df["colName"]