A short, clean, scalable solution
Change some columns, leave the rest untouched
import pyspark.sql.functions as F
# That's not part of the solution, just a creation of a sample dataframe
# df = spark.createDataFrame([(10, 1,2,3,4),(20, 5,6,7,8)],'Id int, Revenue int ,GROSS_PROFIT int ,Net_Income int ,Enterprise_Value int')
cols_to_cast = ["Revenue" ,"GROSS_PROFIT" ,"Net_Income" ,"Enterprise_Value"]
df = df.select([F.col(c).cast('double') if c in cols_to_cast else c for c in df.columns])
df.printSchema()
root
|-- Id: integer (nullable = true)
|-- Revenue: double (nullable = true)
|-- GROSS_PROFIT: double (nullable = true)
|-- Net_Income: double (nullable = true)
|-- Enterprise_Value: double (nullable = true)
Answer from David דודו Markovitz on Stack OverflowA short, clean, scalable solution
Change some columns, leave the rest untouched
import pyspark.sql.functions as F
# That's not part of the solution, just a creation of a sample dataframe
# df = spark.createDataFrame([(10, 1,2,3,4),(20, 5,6,7,8)],'Id int, Revenue int ,GROSS_PROFIT int ,Net_Income int ,Enterprise_Value int')
cols_to_cast = ["Revenue" ,"GROSS_PROFIT" ,"Net_Income" ,"Enterprise_Value"]
df = df.select([F.col(c).cast('double') if c in cols_to_cast else c for c in df.columns])
df.printSchema()
root
|-- Id: integer (nullable = true)
|-- Revenue: double (nullable = true)
|-- GROSS_PROFIT: double (nullable = true)
|-- Net_Income: double (nullable = true)
|-- Enterprise_Value: double (nullable = true)
If this helps
df = spark.createDataFrame([(1, 0),
(2, 1),
(3 ,1),
(4, 1),
(5, 0),
(6 ,0),
(7, 1),
(8 ,1),
(9 ,1),
(10, 1),
(11, 0),
(12, 0)],
('Time' ,'Tag1'))
df = df.withColumn('a', col('Time').cast('integer')).withColumn('a1', col('Tag1').cast('double'))
df.printSchema()
df.show()
I am in university and am taking a special topics class regarding AI. I have zero knowledge about Python, how it works, or what anything means.
A project for the class involves manipulating Bayesian networks to predict how many and which individuals die upon the sinking of a ship. This is the code I am supposed to manipulate:
##EDIT VARIABLES TO THE VARIABLES OF INTEREST
train_var = train.loc[:,['Survived','Sex']]
test_var = test.loc[:,['Sex']]
BayesNet = BayesianModel([('Sex','Survived')])I am supposed to add another variable, 'Pclass,' to the mix, paying attention to the order for causation. I have added that variable to every line of this code in every way imaginable and consistently get an error from this line:
predictions = pandas.DataFrame({'PassengerId': test.PassengerId,'Survived': hypothesis.Survived.tolist()})
predictionsFor example, the error I get for this version of the code:
train_var = train.loc[:,['Survived','Pclass','Sex']]
test_var = test.loc[:,['Pclass']]
BayesNet = BayesianModel([('Sex','Pclass','Survived')])is this:
AttributeError Traceback (most recent call last)
<ipython-input-98-16d9eb9451f7> in <module>
----> 1 predictions = pandas.DataFrame({'PassengerId': test.PassengerId,'Survived': hypothesis.Survived.tolist()})
2 predictions
/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
5137 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5138 return self[name]
-> 5139 return object.__getattribute__(self, name)
5140
5141 def __setattr__(self, name: str, value) -> None:
AttributeError: 'DataFrame' object has no attribute 'Survived'Honestly, I have no idea wtf any of this means. I have tried googling this issue and have come up with nothing.
Any help would be greatly appreciated. I know it's a lot.
Double check if there's a space in the column name. 'Survived ' vs 'Survived' It happens more often than you'd think especially with CSV data source.
It's an issue with how you're calling the data and if it's actually there.
train.loc[:,['Survived','Sex']]
tells me that there's a DataFrame (which is from pandas, hence the error) called train and this line is trying to access parts of that dataframe (it's just a type of an array). Specifically, it's trying to access columns named Survived and Sex.
Similarly, this line tells me there's another dataframe (df) known as test with a column named Sex and this is access that data.
test.loc[:,['Sex']]
The error code also informs me of some things
predictions = pandas.DataFrame({'PassengerId': test.PassengerId,'Survived': hypothesis.Survived.tolist()})
There's another df called predictions that's of dict type which is trying to access information from the another hypothesis df. The attribute it's tryin to access in the second key of the dict is
hypothesis.Survived.tolist()
which is a way of calling a column from that df. That is, when the predictions line is executed, it's trying to pull all the values from the Survived column of the hypothesis df.
The error is that the df doesn't actually have a column named Survived. So either there's missing data, or you're calling it wrong, or there's a missing reference.
Without knowing more about your code and your question, I can't really extrapolate much more.
AttributeError: 'DataFrame' object has no attribute 'dtype'
AttributeError: 'DataFrame' object has no attribute 'get'
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
[FEA] as_type() for Dataframes
"sklearn.datasets" is a scikit package, where it contains a method load_iris().
load_iris(), by default return an object which holds data, target and other members in it. In order to get actual values you have to read the data and target content itself.
Whereas 'iris.csv', holds feature and target together.
FYI: If you set return_X_y as True in load_iris(), then you will directly get features and target.
from sklearn import datasets
data,target = datasets.load_iris(return_X_y=True)
The Iris Dataset from Sklearn is in Sklearn's Bunch format:
print(type(iris))
print(iris.keys())
output:
<class 'sklearn.utils.Bunch'>
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
So, that's why you can access it as:
x=iris.data
y=iris.target
But when you read the CSV file as DataFrame as mentioned by you:
iris = pd.read_csv('iris.csv',header=None).iloc[:,2:4]
iris.head()
output is:
2 3
0 petal_length petal_width
1 1.4 0.2
2 1.4 0.2
3 1.3 0.2
4 1.5 0.2
Here the column names are '1' and '2'.
First of all you should read the CSV file as:
df = pd.read_csv('iris.csv')
you should not include header=None as your csv file includes the column names i.e. the headers.
So, now what you can do is something like this:
X = df.iloc[:, [2, 3]] # Will give you columns 2 and 3 i.e 'petal_length' and 'petal_width'
y = df.iloc[:, 4] # Label column i.e 'species'
or if you want to use the column names then:
X = df[['petal_length', 'petal_width']]
y = df.iloc['species']
Also, if you want to convert labels from string to numerical format use sklearn LabelEncoder
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y = le.fit_transform(y)
The time stamp column doesn't exist yet when you try to refer to it; You can either use pyspark.sql.functions.col to refer to it in a dynamic way without specifying which data frame object the column belongs to as:
import pyspark.sql.functions as F
df = df.withColumn("unix_timestamp", df.unix_timestamp.cast("timestamp")).filter(F.col("unix_timestamp") > hours_36)
Or without creating the intermediate column:
df.filter(df.unix_timestamp.cast("timestamp") > hours_36)
The API Doc tells me that you can also use a String notation for filtering: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.filter
import pyspark.sql.functions as F
df = df.withColumn("unix_timestamp", df.unix_timestamp.cast("timestamp"))
.filter("unix_timestamp > %s" % hours_36)
Maybe its not so effienc though
The last line in the code below gives an AttributeError (see desc in title). The code below is immediately followed by a loop. Any ideas how O can properly refer to the 'Date' column which is the first column in the csv file. PS.: I tried d0=p('Date') and that does give me anything either. help, please.
import numpy as np
import pandas as pd
import scipy as sp
import statsmodels.api as sm
from datetime import date
ticker = 'IBM'
begdate= date(2012,1,1)
enddate= date(2016,12,31)
p = pd.read_csv('/Users/myname/Downloads/IBM_M.csv',
index_col=0,
parse_dates=["Date"])
print(p.head())
#calculate log returns
p['log_ret'] = np.log(p['Adj Close']) - np.log(p['Adj Close'].shift(1))
logret = p['log_ret']
print(logret.head())
ddate=[]
d0=p.date