As @DSM points out, you can do this more directly using the vectorised string methods:
df['Date'].str[-4:].astype(int)
Or using extract (assuming there is only one set of digits of length 4 somewhere in each string):
df['Date'].str.extract('(?P<year>\d{4})').astype(int)
An alternative slightly more flexible way, might be to use apply (or equivalently map) to do this:
df['Date'] = df['Date'].apply(lambda x: int(str(x)[-4:]))
# converts the last 4 characters of the string to an integer
The lambda function, is taking the input from the Date and converting it to a year.
You could (and perhaps should) write this more verbosely as:
def convert_to_year(date_in_some_format):
date_as_string = str(date_in_some_format) # cast to string
year_as_string = date_in_some_format[-4:] # last four characters
return int(year_as_string)
df['Date'] = df['Date'].apply(convert_to_year)
Perhaps 'Year' is a better name for this column...
Answer from Andy Hayden on Stack OverflowAs @DSM points out, you can do this more directly using the vectorised string methods:
df['Date'].str[-4:].astype(int)
Or using extract (assuming there is only one set of digits of length 4 somewhere in each string):
df['Date'].str.extract('(?P<year>\d{4})').astype(int)
An alternative slightly more flexible way, might be to use apply (or equivalently map) to do this:
df['Date'] = df['Date'].apply(lambda x: int(str(x)[-4:]))
# converts the last 4 characters of the string to an integer
The lambda function, is taking the input from the Date and converting it to a year.
You could (and perhaps should) write this more verbosely as:
def convert_to_year(date_in_some_format):
date_as_string = str(date_in_some_format) # cast to string
year_as_string = date_in_some_format[-4:] # last four characters
return int(year_as_string)
df['Date'] = df['Date'].apply(convert_to_year)
Perhaps 'Year' is a better name for this column...
You can do a column transformation by using apply
Define a clean function to remove the dollar and commas and convert your data to float.
def clean(x):
x = x.replace("$", "").replace(",", "").replace(" ", "")
return float(x)
Next, call it on your column like this.
data['Revenue'] = data['Revenue'].apply(clean)
Videos
First of all, if you need to sort over datetimes, I would suggest to either use the YYYYMMDD string representation of dates (e.g. 20191108 for the first record) or to use actual datetime data types. Using the American notation is confusing and not easy to sort on.
In any case, to solve your issue I would advise to use pandas pivot function first, followed by a fill NaN (i.e. fillna) function with a backfill (i.e. bfill) method.
EDIT: If you want to keep the Country column, it seems that using it as a multi-index with the Date column won't work with pivot. What you can do is to keep the original df and join it with the new one on the Date column.
import pandas as pd
import datetime as dt
# Create DataFrame similar to example
df = pd.DataFrame(data={'Date': ['11/8/2019','2/20/2019','9/22/2017','6/28/2016','6/27/2016','6/24/2016','6/12/2015','6/13/2014'],
'Team': ['Team A','Team B','Team A','Team B','Team C','Team A','Team C','Team C'],
'Rating': [95,90,90,90,90,95,100,100]})
# Convert strings to datetimes
df['Date'] = df['Date'].map(lambda x: dt.datetime.strptime(x, '%m/%d/%Y'))
df['Country'] = 'United Kingdom'
# Pivot DataFrame
dfp = df.pivot(columns='Team', values='Rating')
# Join with Country from original df
dfp = df[['Date', 'Country']].join(dfp)
# sort descending on Date
dfp.sort_values(by='Date', ascending=False, inplace=True)
# dfp is:
# Date Country Team A Team B Team C
# 2019-11-08 United Kingdom 95.0 NaN NaN
# 2019-02-20 United Kingdom NaN 90.0 NaN
# 2017-09-22 United Kingdom 90.0 NaN NaN
# ...
# Fill NaN values using the "next" row value
dfp.fillna(method='bfill', inplace=True)
# dfp is:
# Date Country Team A Team B Team C
# 2019-11-08 United Kingdom 95.0 90.0 90.0
# 2019-02-20 United Kingdom 90.0 90.0 90.0
# 2017-09-22 United Kingdom 90.0 90.0 90.0
# ...
Basically, what you need is:
data.pivot_table(index=['Country', 'Date'], columns='Team', values='Rating').reset_index()\
.sort_values(['Country', 'Date'], ascending=False).fillna(method='bfill', axis=0)
It will create a pivot_table, sort the values in the irregular order you have, and pull the last existing values where missing.
Tried a few approaches, and I believe the best working one so far is below:
df['v4'] = df['v2'].apply(myFunction.classify)
Can your function accept a column and output a column? If so you do not need to iterate over your df. Just pass in a column and assign the output to v4.
v4 = myFunction.classify(df['v2'])
df['v4'] = v4
If you function needs individual input then create the column 'v4' first and then replace values you iterate over rows. Again, you would not need append here.
Another option in the individual input case would be to use the python built-in map() to apply your function to the entire column of df['v2'] and then assign that output as above.
df['v4'] = map(myFunction.classify, df['v2'])