For this kind of regression, I usually convert the dates or timestamps to an integer number of days since the start of the data.
This does the trick nicely:
df = pd.read_csv('test.csv')
df['date'] = pd.to_datetime(df['date'])
df['date_delta'] = (df['date'] - df['date'].min()) / np.timedelta64(1,'D')
city_data = df[df['city'] == 'London']
result = sm.ols(formula = 'sales ~ date_delta', data = city_data).fit()
The advantage of this method is that you're sure of the units involved in the regression (days), whereas an automatic conversion may implicitly use other units, creating confusing coefficients in your linear model. It also allows you to combine data from multiple sales campaigns that started at different times into your regression (say you're interested in effectiveness of a campaign as a function of days into the campaign). You could also pick Jan 1st as your 0 if you're interested in measuring the day of year trend. Picking your own 0 date puts you in control of all that.
There's also evidence that statsmodels supports timeseries from pandas. You may be able to apply this to linear models as well: http://statsmodels.sourceforge.net/stable/examples/generated/ex_dates.html
Also, a quick note: You should be able to read column names directly out of the csv automatically as in the sample code I posted. In your example I see there are spaces between the commas in the first line of the csv file, resulting in column names like ' date'. Remove the spaces and automatic csv header reading should just work.
Answer from Tom Q. on Stack OverflowFor this kind of regression, I usually convert the dates or timestamps to an integer number of days since the start of the data.
This does the trick nicely:
df = pd.read_csv('test.csv')
df['date'] = pd.to_datetime(df['date'])
df['date_delta'] = (df['date'] - df['date'].min()) / np.timedelta64(1,'D')
city_data = df[df['city'] == 'London']
result = sm.ols(formula = 'sales ~ date_delta', data = city_data).fit()
The advantage of this method is that you're sure of the units involved in the regression (days), whereas an automatic conversion may implicitly use other units, creating confusing coefficients in your linear model. It also allows you to combine data from multiple sales campaigns that started at different times into your regression (say you're interested in effectiveness of a campaign as a function of days into the campaign). You could also pick Jan 1st as your 0 if you're interested in measuring the day of year trend. Picking your own 0 date puts you in control of all that.
There's also evidence that statsmodels supports timeseries from pandas. You may be able to apply this to linear models as well: http://statsmodels.sourceforge.net/stable/examples/generated/ex_dates.html
Also, a quick note: You should be able to read column names directly out of the csv automatically as in the sample code I posted. In your example I see there are spaces between the commas in the first line of the csv file, resulting in column names like ' date'. Remove the spaces and automatic csv header reading should just work.
get date as floating point year
I prefer a date-format, which can be understood without context. Hence, the floating point year representation.
The nice thing here is, that the solution works on a numpy level - hence should be fast.
import numpy as np
import pandas as pd
def dt64_to_float(dt64):
"""Converts numpy.datetime64 to year as float.
Rounded to days
Parameters
----------
dt64 : np.datetime64 or np.ndarray(dtype='datetime64[X]')
date data
Returns
-------
float or np.ndarray(dtype=float)
Year in floating point representation
"""
year = dt64.astype('M8[Y]')
# print('year:', year)
days = (dt64 - year).astype('timedelta64[D]')
# print('days:', days)
year_next = year + np.timedelta64(1, 'Y')
# print('year_next:', year_next)
days_of_year = (year_next.astype('M8[D]') - year.astype('M8[D]')
).astype('timedelta64[D]')
# print('days_of_year:', days_of_year)
dt_float = 1970 + year.astype(float) + days / (days_of_year)
# print('dt_float:', dt_float)
return dt_float
if __name__ == "__main__":
dates = np.array([
'1970-01-01', '2014-01-01', '2020-12-31', '2019-12-31', '2010-04-28'],
dtype='datetime64[D]')
df = pd.DataFrame({
'date': dates,
'number': np.arange(5)
})
df['date_float'] = dt64_to_float(df['date'].to_numpy())
print('df:', df, sep='\n')
print()
dt64 = np.datetime64( "2011-11-11" )
print('dt64:', dt64_to_float(dt64))
output
df:
date number date_float
0 1970-01-01 0 1970.000000
1 2014-01-01 1 2014.000000
2 2020-12-31 2 2020.997268
3 2019-12-31 3 2019.997260
4 2010-04-28 4 2010.320548
dt64: 2011.8602739726027
You can use time deltas to do this more directly:
In [11]: s = pd.Series(["00:10:30"])
In [12]: s = pd.to_timedelta(s)
In [13]: s
Out[13]:
0 00:10:30
dtype: timedelta64[ns]
In [14]: s / pd.offsets.Minute(1)
Out[14]:
0 10.5
dtype: float64
I would convert the string to a datetime and then use the dt accessor to access the components of the time and generate your minutes column:
In [16]:
df = pd.DataFrame({'time':['00:10:30']})
df['time'] = pd.to_datetime(df['time'])
df['minutes'] = df['time'].dt.hour * 60 + df['time'].dt.minute + df['time'].dt.second/60
df
Out[16]:
time minutes
0 2015-02-05 00:10:30 10.5
python - How to turn a series that contains pandas.datetime.timedelta to floats e.g. hours or days? - Data Science Stack Exchange
python - Converting Pandas DatetimeIndex to a numeric format - Stack Overflow
python - Convert float64 column to datetime pandas - Stack Overflow
What is the float number associated with datetime objects when plotting?
Videos
One way to do this would be to use Series.dt.seconds and ´Series.dt.days´ and multiply with a factor for the desired unit:
(Series.dt.seconds/3600) + (Series.dt.days*24) # for values with [hours]
You can divide by the timedelta you want to use as unit:
totalDays = my_timedelta = pd.TimeDelta('1D')
Convert to Timedelta and extract the total seconds from dt.total_seconds:
df
date
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
4 2013-01-05
5 2013-01-06
6 2013-01-07
7 2013-01-08
8 2013-01-09
9 2013-01-10
pd.to_timedelta(df.date).dt.total_seconds()
0 1.356998e+09
1 1.357085e+09
2 1.357171e+09
3 1.357258e+09
4 1.357344e+09
5 1.357430e+09
6 1.357517e+09
7 1.357603e+09
8 1.357690e+09
9 1.357776e+09
Name: date, dtype: float64
Or, maybe, the data would be more useful presented as an int type:
pd.to_timedelta(df.date).dt.total_seconds().astype(int)
0 1356998400
1 1357084800
2 1357171200
3 1357257600
4 1357344000
5 1357430400
6 1357516800
7 1357603200
8 1357689600
9 1357776000
Name: date, dtype: int64
Use astype float i.e if you have a dataframe like
df = pd.DataFrame({'date': ['1998-03-01 00:00:01', '2001-04-01 00:00:01','1998-06-01 00:00:01','2001-08-01 00:00:01','2001-05-03 00:00:01','1994-03-01 00:00:01'] })
df['date'] = pd.to_datetime(df['date'])
df['x'] = list('abcdef')
df = df.set_index('date')
Then
df.index.values.astype(float)
array([ 8.88710401e+17, 9.86083201e+17, 8.96659201e+17,
9.96624001e+17, 9.88848001e+17, 7.62480001e+17])
pd.to_datetime(df.index.values.astype(float))
DatetimeIndex(['1998-03-01 00:00:01', '2001-04-01 00:00:01',
'1998-06-01 00:00:01', '2001-08-01 00:00:01',
'2001-05-03 00:00:01', '1994-03-01 00:00:01'],
dtype='datetime64[ns]', freq=None)
You can use:
df['TradeDate'] = pd.to_datetime(df['TradeDate'], format='%Y%m%d.0')
print (df)
TradeDate
0 2010-03-29
1 2010-03-28
2 2010-03-29
But if some bad values, add errors='coerce' for replace them to NaT
print (df)
TradeDate
0 20100329.0
1 20100328.0
2 20100329.0
3 20153030.0
4 yyy
df['TradeDate'] = pd.to_datetime(df['TradeDate'], format='%Y%m%d.0', errors='coerce')
print (df)
TradeDate
0 2010-03-29
1 2010-03-28
2 2010-03-29
3 NaT
4 NaT
You can use to_datetime with a custom format on a string representation of the values:
import pandas as pd
pd.to_datetime(pd.Series([20100329.0, 20100328.0, 20100329.0]).astype(str), format='%Y%m%d.0')
I am plotting time series data from a pandas dataframe using matplotlib. When I plot the data and open up the figure options window from the matplotlib figure toolbar to adjust axis scales the x-axis (datetime) is given as floats, sometimes with quite a few decimal places.
https://imgur.com/a/eq0PDyo
https://imgur.com/a/JzIbbpv
I want to be able to set my x-scale from this "Figure options" window. How do I figure out the float that corresponds to my desired datetime?
If more info is needed... I am reading a csv file and converting a "Time" column of strings to datetime using `df["Time"] = pd.to_datetime(df["Time"])`
When you read the excel file specify the dtype of col itime as a str:
df = pd.read_excel("test.xlsx", dtype={'itime':str})
then you will have a time column of strings looking like:
df = pd.DataFrame({'itime':['2300', '0100', '0500', '1000']})
Then specify the format and convert to time:
df['Time'] = pd.to_datetime(df['itime'], format='%H%M').dt.time
itime Time
0 2300 23:00:00
1 0100 01:00:00
2 0500 05:00:00
3 1000 10:00:00
Just addon to Chris answer, if you are unable to convert because there is no zero in the front, apply the following to the dataframe.
df['itime'] = df['itime'].apply(lambda x: x.zfill(4))
So basically is that because the original format does not have even leading digit (4 digit). Example: 945 instead of 0945.