The function pd.read_csv() is already a DataFrame and thus that kind of object does not support calling .to_dataframe().
You can check the type of your variable ds using print(type(ds)), you will see that it is a pandas DataFrame type.
The function pd.read_csv() is already a DataFrame and thus that kind of object does not support calling .to_dataframe().
You can check the type of your variable ds using print(type(ds)), you will see that it is a pandas DataFrame type.
According to what I understand. You are loading loanapp_c.csv in ds using this code:
ds = pd.read_csv('desktop/python ML/loanapp_c.csv')
ds over here is a DataFrame object. What you are doing is calling to_dataframe on an object which a DataFrame already.
Removing this dataset = ds.to_dataframe() from your code should solve the error
python - from spark dataframe to pandas dataframe - Stack Overflow
dict to data frame with pandas ('list' object has no attribute 'values)
python - I got the following error : 'DataFrame' object has no attribute 'data' - Data Science Stack Exchange
[Bug] Dataset .to_pandas() throws
Videos
Check your DataFrame with data.columns
It should print something like this
Index([u'regiment', u'company', u'name',u'postTestScore'], dtype='object')
Check for hidden white spaces..Then you can rename with
data = data.rename(columns={'Number ': 'Number'})
I think the column name that contains "Number" is something like " Number" or "Number ". I'm assuming you might have a residual space in the column name. Please run print "<{}>".format(data.columns[1]) and see what you get. If it's something like < Number>, it can be fixed with:
data.columns = data.columns.str.strip()
See pandas.Series.str.strip
In general, AttributeError: 'DataFrame' object has no attribute '...', where ... is some column name, is caused because . notation has been used to reference a nonexistent column name or pandas method.
pandas methods are accessed with a .. pandas columns can also be accessed with a . (e.g. data.col) or with brackets (e.g. ['col'] or [['col1', 'col2']]).
data.columns = data.columns.str.strip() is a fast way to quickly remove leading and trailing spaces from all column names. Otherwise verify the column or attribute is correctly spelled.
when you put .show() at the end, it is not a pyspark data frame anymore.
Remove it and it should work.
tx_ecommerce =tx_df.filter(tx_df["POS_Cardholder_Presence"]=="ECommerce")
tx_ecommerce.toPandas()
you can do this to read a parquet file:
import pandas as pd
txt = pd.read_parquet("/data/file.parquet")
txt_ecommerce = txt.loc[txt.POS_Cardholder_Presence =="ECommerce"]
Hey guys, I am learning how to convert the dictionary to data frame. I have a nested dictionary called user_dict like this:
File of dictionary in pickle format
[{'1000003': {'car': 0.0, 'car_passenger': 0.0, 'pt': 0.0, 'walk': 0.0, 'bike': 0.0}}, {'1000007': {'car': 0.0, 'car_passenger': 0.0, 'pt': 856.0786277323101, 'walk': 2546.869189662443, 'bike': 0.0}},
{'1000008': {'car': 0.0, 'car_passenger': 34189.569164682835, 'pt': 0.0, 'walk': 0.0, 'bike': 0.0}},
{'1000009': {'car': 0.0, 'car_passenger': 0.0, 'pt': 0.0, 'walk': 0.0, 'bike': 9847.472668350396}}]I want to convert the dict to data frame like this:
car car_passenger pt walk bike 1000003 0.0 0.0 0.0 0.0 0.0 1000007 0.0 0.0 856.078 2546.869 0.0 1000008 0.0 34189.569 0.0 0.0 0.0 1000009 0.0 0.0 0.0 0.0 9847.472
I converted it through from_dict:
df =pd.DataFrame.from_dict(user_dict,orient='index') df
But I got an error as this:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-44-2ef0fc236180> in <module>
----> 1 df =pd.DataFrame.from_dict(user_dict,orient='index')
2 df
/Library/Python/3.7/site-packages/pandas/core/frame.py in from_dict(cls, data, orient, dtype, columns)
1361 if len(data) > 0:
1362 # TODO speed up Series case
-> 1363 if isinstance(list(data.values())[0], (Series, dict)):
1364 data = _from_nested_dict(data)
1365 else:
AttributeError: 'list' object has no attribute 'values'I do not know how to fix it. Can anyone help me or explain me how to fix it?
Any help is appreciated.
"sklearn.datasets" is a scikit package, where it contains a method load_iris().
load_iris(), by default return an object which holds data, target and other members in it. In order to get actual values you have to read the data and target content itself.
Whereas 'iris.csv', holds feature and target together.
FYI: If you set return_X_y as True in load_iris(), then you will directly get features and target.
from sklearn import datasets
data,target = datasets.load_iris(return_X_y=True)
The Iris Dataset from Sklearn is in Sklearn's Bunch format:
print(type(iris))
print(iris.keys())
output:
<class 'sklearn.utils.Bunch'>
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
So, that's why you can access it as:
x=iris.data
y=iris.target
But when you read the CSV file as DataFrame as mentioned by you:
iris = pd.read_csv('iris.csv',header=None).iloc[:,2:4]
iris.head()
output is:
2 3
0 petal_length petal_width
1 1.4 0.2
2 1.4 0.2
3 1.3 0.2
4 1.5 0.2
Here the column names are '1' and '2'.
First of all you should read the CSV file as:
df = pd.read_csv('iris.csv')
you should not include header=None as your csv file includes the column names i.e. the headers.
So, now what you can do is something like this:
X = df.iloc[:, [2, 3]] # Will give you columns 2 and 3 i.e 'petal_length' and 'petal_width'
y = df.iloc[:, 4] # Label column i.e 'species'
or if you want to use the column names then:
X = df[['petal_length', 'petal_width']]
y = df.iloc['species']
Also, if you want to convert labels from string to numerical format use sklearn LabelEncoder
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y = le.fit_transform(y)
The code presented here doesn't show this discrepancy, but sometimes I get stuck when invoking dataframe in all lower case.
Switching to camel-case (pd.DataFrame()) cleans up the problem.
Please check if:
a) you've named a file 'pandas.py' in the same directory as your script, or
b) another variable called 'pd' is used in your program.