You can't reference a second spark DataFrame inside a function, unless you're using a join. IIUC, you can do the following to achieve your desired result.
Suppose that means is the following:
#means.show()
#+---+---------+
#| id|avg(col1)|
#+---+---------+
#| 1| 12.0|
#| 3| 300.0|
#| 2| 21.0|
#+---+---------+
Join df and means on the id column, then apply your when condition
from pyspark.sql.functions import when
df.join(means, on="id")\
.withColumn(
"col1",
when(
(df["col1"].isNull()),
means["avg(col1)"]
).otherwise(df["col1"])
)\
.select(*df.columns)\
.show()
#+---+-----+
#| id| col1|
#+---+-----+
#| 1| 12.0|
#| 1| 12.0|
#| 1| 14.0|
#| 1| 10.0|
#| 3|300.0|
#| 3|300.0|
#| 2| 21.0|
#| 2| 22.0|
#| 2| 20.0|
#+---+-----+
But in this case, I'd actually recommend using a Window with pyspark.sql.functions.mean:
from pyspark.sql import Window
from pyspark.sql.functions import col, mean
df.withColumn(
"col1",
when(
col("col1").isNull(),
mean("col1").over(Window.partitionBy("id"))
).otherwise(col("col1"))
).show()
#+---+-----+
#| id| col1|
#+---+-----+
#| 1| 12.0|
#| 1| 10.0|
#| 1| 12.0|
#| 1| 14.0|
#| 3|300.0|
#| 3|300.0|
#| 2| 22.0|
#| 2| 20.0|
#| 2| 21.0|
#+---+-----+
Answer from pault on Stack Overflowobject has no attribute 'index'
python - pyspark 'DataFrame' object has no attribute '_get_object_id' - Stack Overflow
Create a pyspark dataframe index when appending to creating a db table - Stack Overflow
python - How to resolve AttributeError: 'DataFrame' object has no attribute - Stack Overflow
import pandas as pd
from pandasql import sqldf
class ABC:
def __init__(self):
self.details = {'Name' : ['Ankit', 'Aishwarya', 'Shaurya', 'Shivangi'],'Age' : [23, 21, 22, 21]}
self.df = pd.DataFrame(self.details)
objAbc=ABC()
for i,rowObject in sqldf("select distinct Name from objAbc.df").iterrows():
print(rowObject.Name)
AttributeError Traceback (most recent call last) <ipython-input-21-aaba2874542d> in <module> 8 objAbc=ABC() 9 ---> 10 for i,rowObject in sqldf("select distinct Name from objAbc.df").iterrows(): 11 print(rowObject.Name) 12 ~\anaconda3\lib\site-packages\pandasql\sqldf.py in sqldf(query, env, db_uri) 154 >>> sqldf("select avg(x) from df;", locals()) 155 """ --> 156 return PandaSQL(db_uri)(query, env) ~\anaconda3\lib\site-packages\pandasql\sqldf.py in __call__(self, query, env) 56 continue 57 self.loaded_tables.add(table_name) ---> 58 write_table(env[table_name], table_name, conn) 59 60 try: ~\anaconda3\lib\site-packages\pandasql\sqldf.py in write_table(df, tablename, conn) 119 message='The provided table name \'%s\' is not found exactly as such in the database' % tablename) 120 to_sql(df, name=tablename, con=conn, --> 121 index=not any(name is None for name in df.index.names)) # load index into db if all levels are named 122 123 AttributeError: 'ABC' object has no attribute 'index'
I want to iterate through dataframe defined in class ABC using pandasql. sqldf function but Getting Error.
'ABC' object has no attribute 'index'
Not able to understand why object.property name is not working in sqldf.
PS: As per requirement it's mandatory to use sqldf of pandasql.
Please Help.
You can't reference a second spark DataFrame inside a function, unless you're using a join. IIUC, you can do the following to achieve your desired result.
Suppose that means is the following:
#means.show()
#+---+---------+
#| id|avg(col1)|
#+---+---------+
#| 1| 12.0|
#| 3| 300.0|
#| 2| 21.0|
#+---+---------+
Join df and means on the id column, then apply your when condition
from pyspark.sql.functions import when
df.join(means, on="id")\
.withColumn(
"col1",
when(
(df["col1"].isNull()),
means["avg(col1)"]
).otherwise(df["col1"])
)\
.select(*df.columns)\
.show()
#+---+-----+
#| id| col1|
#+---+-----+
#| 1| 12.0|
#| 1| 12.0|
#| 1| 14.0|
#| 1| 10.0|
#| 3|300.0|
#| 3|300.0|
#| 2| 21.0|
#| 2| 22.0|
#| 2| 20.0|
#+---+-----+
But in this case, I'd actually recommend using a Window with pyspark.sql.functions.mean:
from pyspark.sql import Window
from pyspark.sql.functions import col, mean
df.withColumn(
"col1",
when(
col("col1").isNull(),
mean("col1").over(Window.partitionBy("id"))
).otherwise(col("col1"))
).show()
#+---+-----+
#| id| col1|
#+---+-----+
#| 1| 12.0|
#| 1| 10.0|
#| 1| 12.0|
#| 1| 14.0|
#| 3|300.0|
#| 3|300.0|
#| 2| 22.0|
#| 2| 20.0|
#| 2| 21.0|
#+---+-----+
I think you are using Scala API, in which you use (). In PySpark, use [] instead.
Check your DataFrame with data.columns
It should print something like this
Index([u'regiment', u'company', u'name',u'postTestScore'], dtype='object')
Check for hidden white spaces..Then you can rename with
data = data.rename(columns={'Number ': 'Number'})
I think the column name that contains "Number" is something like " Number" or "Number ". I'm assuming you might have a residual space in the column name. Please run print "<{}>".format(data.columns[1]) and see what you get. If it's something like < Number>, it can be fixed with:
data.columns = data.columns.str.strip()
See pandas.Series.str.strip
In general, AttributeError: 'DataFrame' object has no attribute '...', where ... is some column name, is caused because . notation has been used to reference a nonexistent column name or pandas method.
pandas methods are accessed with a .. pandas columns can also be accessed with a . (e.g. data.col) or with brackets (e.g. ['col'] or [['col1', 'col2']]).
data.columns = data.columns.str.strip() is a fast way to quickly remove leading and trailing spaces from all column names. Otherwise verify the column or attribute is correctly spelled.
The book you're referring to describes Scala / Java API. In PySpark use []
df["count"]
The book combines the Scala and PySpark API's.
In Scala / Java API, df.col("column_name") or df.apply("column_name") return the Column.
Whereas in pyspark use the below to get the column from DF.
df.colName
df["colName"]
I figured it out. Looks like it has to do with our spark version. It worked with 1.6
if you are working with spark version 1.6 then use this code for conversion of rdd into df
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame(rdd)
if you want to assign title to rows then use this
df= rdd.map(lambda p: Row(ip=p[0], time=p[1], zone=p[2]))
ip,time,zone are row headers in this example.
"sklearn.datasets" is a scikit package, where it contains a method load_iris().
load_iris(), by default return an object which holds data, target and other members in it. In order to get actual values you have to read the data and target content itself.
Whereas 'iris.csv', holds feature and target together.
FYI: If you set return_X_y as True in load_iris(), then you will directly get features and target.
from sklearn import datasets
data,target = datasets.load_iris(return_X_y=True)
The Iris Dataset from Sklearn is in Sklearn's Bunch format:
print(type(iris))
print(iris.keys())
output:
<class 'sklearn.utils.Bunch'>
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
So, that's why you can access it as:
x=iris.data
y=iris.target
But when you read the CSV file as DataFrame as mentioned by you:
iris = pd.read_csv('iris.csv',header=None).iloc[:,2:4]
iris.head()
output is:
2 3
0 petal_length petal_width
1 1.4 0.2
2 1.4 0.2
3 1.3 0.2
4 1.5 0.2
Here the column names are '1' and '2'.
First of all you should read the CSV file as:
df = pd.read_csv('iris.csv')
you should not include header=None as your csv file includes the column names i.e. the headers.
So, now what you can do is something like this:
X = df.iloc[:, [2, 3]] # Will give you columns 2 and 3 i.e 'petal_length' and 'petal_width'
y = df.iloc[:, 4] # Label column i.e 'species'
or if you want to use the column names then:
X = df[['petal_length', 'petal_width']]
y = df.iloc['species']
Also, if you want to convert labels from string to numerical format use sklearn LabelEncoder
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y = le.fit_transform(y)
I came across this question when I was dealing with pyspark DataFrame. So, if you're also using pyspark DataFrame, you can convert it to pandas DataFrame using toPandas() method.
loc was introduced in 0.11, so you'll need to upgrade your pandas to follow the 10minute introduction.