map is a method that you can call on a pandas.Series object. This method doesn't exist on pandas.DataFrame objects.
df['new'] = df['old'].map(d)
In your code ^^^ df['old'] is returning a pandas.Dataframe object for some reason.
- As @jezrael points out this could be due to having more than one old column in the dataframe.
Or perhaps your code isn't quite the same as the example you have given.
Either way the error is there because you are calling map() on a pandas.Dataframe object
map is a method that you can call on a pandas.Series object. This method doesn't exist on pandas.DataFrame objects.
df['new'] = df['old'].map(d)
In your code ^^^ df['old'] is returning a pandas.Dataframe object for some reason.
- As @jezrael points out this could be due to having more than one old column in the dataframe.
Or perhaps your code isn't quite the same as the example you have given.
Either way the error is there because you are calling map() on a pandas.Dataframe object
Main problem is after selecting old column get DataFrame instead Series, so map implemented yet to Series failed.
Here should be duplicated column old, so if select one column it return all columns old in DataFrame:
df = pd.DataFrame([[1,3,8],[4,5,3]], columns=['old','old','col'])
print (df)
old old col
0 1 3 8
1 4 5 3
print(df['old'])
old old
0 1 3
1 4 5
#dont use dict like variable, because python reserved word
df['new'] = df['old'].map(d)
print (df)
AttributeError: 'DataFrame' object has no attribute 'map'
Possible solution for deduplicated this columns:
s = df.columns.to_series()
new = s.groupby(s).cumcount().astype(str).radd('_').replace('_0','')
df.columns += new
print (df)
old old_1 col
0 1 3 8
1 4 5 3
Another problem should be MultiIndex in column, test it by:
mux = pd.MultiIndex.from_arrays([['old','old','col'],['a','b','c']])
df = pd.DataFrame([[1,3,8],[4,5,3]], columns=mux)
print (df)
old col
a b c
0 1 3 8
1 4 5 3
print (df.columns)
MultiIndex(levels=[['col', 'old'], ['a', 'b', 'c']],
codes=[[1, 1, 0], [0, 1, 2]])
And solution is flatten MultiIndex:
#python 3.6+
df.columns = [f'{a}_{b}' for a, b in df.columns]
#puthon bellow
#df.columns = ['{}_{}'.format(a,b) for a, b in df.columns]
print (df)
old_a old_b col_c
0 1 3 8
1 4 5 3
Another solution is map by MultiIndex with tuple and assign to new tuple:
df[('new', 'd')] = df[('old', 'a')].map(d)
print (df)
old col new
a b c d
0 1 3 8 A
1 4 5 3 D
print (df.columns)
MultiIndex(levels=[['col', 'old', 'new'], ['a', 'b', 'c', 'd']],
codes=[[1, 1, 0, 2], [0, 1, 2, 3]])
You can't map a dataframe, but you can convert the dataframe to an RDD and map that by doing spark_df.rdd.map(). Prior to Spark 2.0, spark_df.map would alias to spark_df.rdd.map(). With Spark 2.0, you must explicitly call .rdd first.
You can use df.rdd.map(), as DataFrame does not have map or flatMap, but be aware of the implications of using df.rdd:
Converting to RDD breaks Dataframe lineage, there is no predicate pushdown, no column prunning, no SQL plan and less efficient PySpark transformations.
What should you do instead?
Keep in mind that the high-level DataFrame API is equipped with many alternatives. First, you can use select or selectExpr.
Another example is using explode instead of flatMap(which existed in RDD):
df.select($"name",explode($"knownLanguages"))
.show(false)
Result:
+-------+------+
|name |col |
+-------+------+
|James |Java |
|James |Scala |
|Michael|Spark |
|Michael|Java |
|Michael|null |
|Robert |CSharp|
|Robert | |
+-------+------+
You can also use withColumn or UDF, depending on the use-case, or another option in the DataFrame API.
python - pandas 'DataFrame' object has no attribute 'map' - Stack Overflow
Error when running ``make_mean_sem_table`` (AttributeError: 'DataFrame' object has no attribute 'map')
AttributeError: 'DataFrame' object has no attribute 'map' in version 2.7.0
python - 'DataFrame' object has no attribute 'map' - Stack Overflow
You get that result because when you apply() the extractGrammar() function to your DataFrame, it passes each row of the DataFrame to the function. Then when you access the ['POS Tag'] column, it is not returning that entire Series, but rather the contents of that POS Tag cell for that row, which is a list. Lists do not have a map method. If you are trying to count the occurrences of the second element of each tuple in the POS Tag column, you could try the following:
tag_count_data = Counter([x[1] for x in email['POS Tag']])
This will give you a Counter of the second elements of the tags for that individual row.
In order to the df with the tags that I'd posted on the question and based on the kind guidance of LiamFiddler, I later on proceeded with:
- Turning Counter objects into a dict using dict()
- I turned dict into a Series,
- I set column values to be the column names based on this answer
- and then went on to select the tags that I need for my dataDrame.
def extractGrammar(email):
# Updated calculate the tags I need
tag_count_data = Counter([x[1] for x in email['POS_Tag']])
#Convert the Counter object to dict
tag_count_dict = dict(tag_count_data)
#Turning dict into Series
email_tag = pd.DataFrame(pd.Series(tag_count_dict).fillna(0).rename_axis('Tag'))
email_tag = email_tag.reset_index()
#use set_index to set Tag column values to be column names
email_tag= email_tag.set_index("Tag").T.reset_index(drop=True).rename_axis(None, axis=1)
#select Tags that I need
pos_columns = ['PRP','MD','JJ','JJR','JJS','RB','RBR','RBS', 'NN', 'NNS','VB', 'VBS', 'VBG','VBN','VBP','VBZ']
for pos in pos_columns:
if pos not in email_tag.columns:
email_tag[pos] = 0
email_tag = email_tag[pos_columns]
return email_tag
upper and lower dataframes have two columns called Date. You are extracting both by using upper['Date'].
Solution: Rename at least one of the columns to sth different than date and than apply your function to each column seperately.
See https://stackoverflow.com/a/54608016/6646710 for further details.
- Python code which returns a line graph of the record high and record low temperatures by day of the year over the period 2005-2014. The area between the record high and record low temperatures for each day should be shaded.
- Then, overlay a scatter of the 2015 data for any points (highs and lows) for which the ten year record (2005-2014) record high or record low was broken in 2015.
Remove leap year dates (i.e. 29th February).
from datetime import datetime import pandas as pd import matplotlib.pyplot as plt pd.set_option("display.max_rows",None,"display.max_columns",None) data = pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv') newdata = data[(data['Date'] >= '2005-01-01') & (data['Date'] <= '2014-12-12')] datamax = newdata[newdata['Element']=='TMAX'] datamin = newdata[newdata['Element']=='TMIN'] datamax['Date'] = pd.to_datetime(datamax['Date']) datamin['Date'] = pd.to_datetime(datamin['Date']) datamax["day_of_year"] = datamax["Date"].dt.dayofyear datamax = datamax.groupby('day_of_year').max() datamin["day_of_year"] = datamin["Date"].dt.dayofyear datamin = datamin.groupby('day_of_year').min() datamax = datamax.reset_index() datamin = datamin.reset_index() datamin['Date'] = datamin['Date'].dt.strftime('%Y-%m-%d') datamax['Date'] = datamax['Date'].dt.strftime('%Y-%m-%d') datamax = datamax[~datamax['Date'].str.contains("02-29")] datamin = datamin[~datamin['Date'].str.contains("02-29")] breakoutdata = data[(data['Date'] > '2014-12-31')] datamax2015 = breakoutdata[breakoutdata['Element']=='TMAX'] datamin2015 = breakoutdata[breakoutdata['Element']=='TMIN'] datamax2015['Date'] = pd.to_datetime(datamax2015['Date']) datamin2015['Date'] = pd.to_datetime(datamin2015['Date']) datamax2015["day_of_year"] = datamax2015["Date"].dt.dayofyear datamax2015 = datamax2015.groupby('day_of_year').max() datamin2015["day_of_year"] = datamin2015["Date"].dt.dayofyear datamin2015 = datamin2015.groupby('day_of_year').min() datamax2015 = datamax2015.reset_index() datamin2015 = datamin2015.reset_index() datamin2015['Date'] = datamin2015['Date'].dt.strftime('%Y-%m-%d') datamax2015['Date'] = datamax2015['Date'].dt.strftime('%Y-%m-%d') datamax2015 = datamax2015[~datamax2015['Date'].str.contains("02-29")] datamin2015 = datamin2015[~datamin2015['Date'].str.contains("02-29")] dataminappend = datamin2015.join(datamin,on="day_of_year",rsuffix="_new") lower = dataminappend.loc[dataminappend["Data_Value_new"]>dataminappend["Data_Value"]] datamaxappend = datamax2015.join(datamax,on="day_of_year",rsuffix="_new") upper = datamaxappend.loc[datamaxappend["Data_Value_new"]<datamaxappend["Data_Value"]] upper['Date'] = pd.to_datetime(upper['Date']) lower['Date'] = pd.to_datetime(lower['Date']) datamax['Date'] = pd.to_datetime(datamax['Date']) datamin['Date'] = pd.to_datetime(datamin['Date']) ax = plt.gca() plt.plot(datamax['day_of_year'],datamax['Data_Value'],color='red') plt.plot(datamin['day_of_year'],datamin['Data_Value'], color='blue') plt.scatter(upper['day_of_year'],upper['Data_Value'],color='purple') plt.scatter(lower['day_of_year'],lower['Data_Value'], color='cyan') plt.ylabel("Temperature (degrees C)",color='navy') plt.xlabel("Day of the year",color='navy',labelpad=15) plt.title('Record high and low temperatures by day between 2005-2014)', alpha=1.0,color='brown',y=1.08) ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.35),fancybox=False,labels=['Record high','Record low']) plt.xticks(rotation=30) plt.fill_between(range(len(datamax['Date'])), datamax['Data_Value'], datamin['Data_Value'],color='yellow',alpha=0.8) plt.show()I have converted the 'Date' column to a string using Datamin['Date'] = datamin['Date'].dt.strftime('%Y-%m-%d').
I have then converted this back to 'datetime' format using upper['Date'] = pd.to_datetime(upper['Date'])
I then used 'date of year' as the x-value.

I had the same problem happening on some code that was working perfectly fine after migrating to the latest Pycharm version.
I assume you are using the latest Pycharm version (2019.2). I don't have an explanation to why this is causing the issue but installing the older Pycharm 2019.1.4 fixed the problem for me.
Agreed, run this on pycharm 2019.2 with no problem. Put a break point somewhere, debug and error will happen
spark = SparkSession.builder.getOrCreate()
pdf = pd.DataFrame({'A': ['asdf', 'fdsa'], 'B': [1, 2]})
sdf = spark.createDataFrame(pdf)
print(pdf)
sdf.show()
value_counts is a Series method rather than a DataFrame method (and you are trying to use it on a DataFrame, clean). You need to perform this on a specific column:
clean[column_name].value_counts()
It doesn't usually make sense to perform value_counts on a DataFrame, though I suppose you could apply it to every entry by flattening the underlying values array:
pd.value_counts(df.values.flatten())
To get all the counts for all the columns in a dataframe, it's just df.count()
Mapping from one column to another such as below works fine, however the requirements have changed and now need to map two columns to the summary table, and am getting the error 'DataFrame' object has no attribute 'map'. I'm sure it is something simple like a bracket or parentheses out of place, but right now not quite sure.
score['#_%_to_Total'] = (score['Total_#_Genuine'] / score['mop_'].map(summary.set_index(['mop_'])['count_Not Fraud']))*100 #Below is the line of code giving the AttributeError score['#_%_to_Total'] = (score['Total_#_Genuine'] / score[['merchant_merchantid_','mop_']].map(summary.set_index(['merchant_merchantid_','mop_'])['count_Not Fraud']))*100 AttributeError: 'DataFrame' object has no attribute 'map'
Check your DataFrame with data.columns
It should print something like this
Index([u'regiment', u'company', u'name',u'postTestScore'], dtype='object')
Check for hidden white spaces..Then you can rename with
data = data.rename(columns={'Number ': 'Number'})
I think the column name that contains "Number" is something like " Number" or "Number ". I'm assuming you might have a residual space in the column name. Please run print "<{}>".format(data.columns[1]) and see what you get. If it's something like < Number>, it can be fixed with:
data.columns = data.columns.str.strip()
See pandas.Series.str.strip
In general, AttributeError: 'DataFrame' object has no attribute '...', where ... is some column name, is caused because . notation has been used to reference a nonexistent column name or pandas method.
pandas methods are accessed with a .. pandas columns can also be accessed with a . (e.g. data.col) or with brackets (e.g. ['col'] or [['col1', 'col2']]).
data.columns = data.columns.str.strip() is a fast way to quickly remove leading and trailing spaces from all column names. Otherwise verify the column or attribute is correctly spelled.
The function pd.read_csv() is already a DataFrame and thus that kind of object does not support calling .to_dataframe().
You can check the type of your variable ds using print(type(ds)), you will see that it is a pandas DataFrame type.
According to what I understand. You are loading loanapp_c.csv in ds using this code:
ds = pd.read_csv('desktop/python ML/loanapp_c.csv')
ds over here is a DataFrame object. What you are doing is calling to_dataframe on an object which a DataFrame already.
Removing this dataset = ds.to_dataframe() from your code should solve the error