SparkSession is not a replacement for a SparkContext but an equivalent of the SQLContext. Just use it use the same way as you used to use SQLContext:
spark.createDataFrame(...)
and if you ever have to access SparkContext use sparkContext attribute:
spark.sparkContext
so if you need SQLContext for backwards compatibility you can:
SQLContext(sparkContext=spark.sparkContext, sparkSession=spark)
Whenever we are trying to create a DF from a backward-compatible object like RDD or a data frame created by spark session, you need to make your SQL context-aware about your session and context.
Like Ex:
If I create a RDD:
ss=SparkSession.builder.appName("vivek").master('local').config("k1","vi").getOrCreate()
rdd=ss.sparkContext.parallelize([('Alex',21),('Bob',44)])
But if we wish to create a df from this RDD, we need to
sq=SQLContext(sparkContext=ss.sparkContext, sparkSession=ss)
then only we can use SQLContext with RDD/DF created by pandas.
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)])
df=sq.createDataFrame(rdd,schema)
df.collect()
Check your DataFrame with data.columns
It should print something like this
Index([u'regiment', u'company', u'name',u'postTestScore'], dtype='object')
Check for hidden white spaces..Then you can rename with
data = data.rename(columns={'Number ': 'Number'})
I think the column name that contains "Number" is something like " Number" or "Number ". I'm assuming you might have a residual space in the column name. Please run print "<{}>".format(data.columns[1]) and see what you get. If it's something like < Number>, it can be fixed with:
data.columns = data.columns.str.strip()
See pandas.Series.str.strip
In general, AttributeError: 'DataFrame' object has no attribute '...', where ... is some column name, is caused because . notation has been used to reference a nonexistent column name or pandas method.
pandas methods are accessed with a .. pandas columns can also be accessed with a . (e.g. data.col) or with brackets (e.g. ['col'] or [['col1', 'col2']]).
data.columns = data.columns.str.strip() is a fast way to quickly remove leading and trailing spaces from all column names. Otherwise verify the column or attribute is correctly spelled.
Mariusz answer didn't really help me. So if you like me found this because it's the only result on google and you're new to pyspark (and spark in general), here's what worked for me.
In my case I was getting that error because I was trying to execute pyspark code before the pyspark environment had been set up.
Making sure that pyspark was available and set up before doing calls dependent on pyspark.sql.functions fixed the issue for me.
The error message says that in 27th line of udf you are calling some pyspark sql functions. It is line with abs() so I suppose that somewhere above you call from pyspark.sql.functions import * and it overrides python's abs() function.
The issue has occured due to
Copydf = emp_data.filter((f.col("POSTAL") == 2148) | (f.col("POSTAL") == 2125)).show(5)
Adding the .show(5) at the end changes the type of the object from a pyspark DataFrame to NoneType.
Therefore when you use
df_new = df.select(f.split(f.col("NAME"), ',')).show(3) you get the error AttributeError: 'NoneType' object has no attribute 'select'
A better way to do this would be to use:
Copydf = emp_data.filter((f.col("POSTAL") == 2148) | (f.col("POSTAL") == 2125))
df.show(5)
You can also use display(df) for a styled display.
I had this code:
Copydt = df_sales.withColumn("Flag", lit(var)).display()
Problem:
.display()(or.show()) is an action that displays your DataFrame in the notebook/console, but it returnsNone.So when you assign this to
dt, it’s no longer a DataFrame — it’sNoneType.And when you try to use
dt.select()or anything else on it later, it throws the'NoneType'error.
Correct way to handle this:
Keep the DataFrame assignment and display separate:
Copydt = df_sales.withColumn("Flag", lit(var))
display(dt) or dt.display()
If it says <class 'NoneType'>, chances are you called .show(), .display() or a similar action on the assignment line.
It's related to the Databricks Runtime (DBR) version used - the Spark versions in up to DBR 12.2 rely on .iteritems function to construct a Spark DataFrame from Pandas DataFrame. This issue was fixed in the Spark 3.4 that is available as DBR 13.x.
If you can't upgrade to DBR 13.x, then you need to downgrade the Pandas to latest 1.x version (1.5.3 right now) by using %pip install -U pandas==1.5.3 command in your notebook. Although it's just better to use Pandas version shipped with your DBR - it was tested for compatibility with other packages in DBR.
I couldn't change package versions, but it looks like this was a name change only.
So I did
df.iteritems = df.items
and spark.createDataFrame(df) works now.
Sure, it's ugly, and it will break my notebook when I move to a cluster with a new DBR, but it works for now.
EDIT: AyoubH's answer is better because you only have to do it once. With the code above, you have to modify every data frame you display.