Skip to content

ENH: resample(..., base='start') for automaticly determining base. #8521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dimbeto opened this issue Oct 9, 2014 · 8 comments · Fixed by #31809
Closed

ENH: resample(..., base='start') for automaticly determining base. #8521

dimbeto opened this issue Oct 9, 2014 · 8 comments · Fixed by #31809
Milestone

Comments

@dimbeto
Copy link

dimbeto commented Oct 9, 2014

(edit):

import pandas as pd
import datetime as dt
import numpy as np

datetime_start = dt.datetime(2014, 9, 1, 9, 30)
datetime_end = dt.datetime(2014, 9, 1, 16, 0)

tt = pd.date_range(datetime_start, datetime_end, freq='1Min')
df = pd.DataFrame(np.arange(len(tt)), index=tt, columns=['A'])

The first item is at 9:30, which is not divisible by 8 minutes. We'd like df.resample('8T', base='start').first() to be equivalent to df.resample('8T', base=2).

In [13]: df.resample("8T", base=2).first()
Out[13]:
                       A
2014-09-01 09:30:00    0
2014-09-01 09:38:00    8
2014-09-01 09:46:00   16
2014-09-01 09:54:00   24
2014-09-01 10:02:00   32

It seems that for 1Min bar data, resample() with sampling frequency of any multiple of 8 has a bug. The code below illustrates the bug when resampling is done at [3, 5, 6, 8, 16] Min. For both 3 and 5 frequency, the first entry of the resampled dataframe index starts at the base timestamp (9:30 in this case) while for frequencies 8 and 16, the resampled index starts at 9:26 and 9:18 respectively.

import pandas as pd
import datetime as dt
import numpy as np

datetime_start = dt.datetime(2014, 9, 1, 9, 30)
datetime_end = dt.datetime(2014, 9, 1, 16, 0)

tt = pd.date_range(datetime_start, datetime_end, freq='1Min')
df = pd.DataFrame(np.arange(len(tt)), index=tt, columns=['A'])

for freq in [3, 5, 6, 8, 16]:
    print(freq)
    print(df.resample(str(freq) + 'Min', how='first', base=30).head(2))

produces the following output:

3
                     A
2014-09-01 09:30:00  0
2014-09-01 09:33:00  3
5
                     A
2014-09-01 09:30:00  0
2014-09-01 09:35:00  5
6
                     A
2014-09-01 09:30:00  0
2014-09-01 09:36:00  6
8
                     A
2014-09-01 09:26:00  0
2014-09-01 09:34:00  4
16
                     A
2014-09-01 09:18:00  0
2014-09-01 09:34:00  4
@dimbeto
Copy link
Author

dimbeto commented Oct 9, 2014

Note that this is not the same as issue #8371 . Resample seems to work fine with any frequency but a multiple of 8.

@jreback
Copy link
Contributor

jreback commented Oct 9, 2014

@dimbeto not excatly the same as #8371, but its coming from the same code block. fix upcoming.

@jreback jreback added Bug Resample resample method labels Oct 9, 2014
@jreback jreback added this to the 0.15.0 milestone Oct 9, 2014
@jreback
Copy link
Contributor

jreback commented Oct 9, 2014

@dimbeto

if your first date were 09:30:01 how is pandas to know that you DON't want this
e.g. resample snaps the first date to the interval, so both get mapped to 09:28

Since the other intervals 'happen' to evenly divide your first date you are not noticing this.

their are several options to 'fix' this,

e.g. loffset=pd.Timedelta('2min') or base=2
will do what you want.

However, not sure that this is possible with the current option set to do automatically.

You almost want an option to base='start' which will use your own starting date if its an evently divisible snap interval.

@jreback jreback added API Design and removed Bug labels Oct 9, 2014
@jreback jreback modified the milestones: 0.15.1, 0.15.0 Oct 9, 2014
@dimbeto
Copy link
Author

dimbeto commented Oct 9, 2014

Thanks.

Base='start' does look like a good idea. Are you assuming that the new frequency is a multiple of the old one?

And currently does pandas have a way to check that the index of a Series is, a concatenated sequence of, say 1 minute dateranges all starting from 9:30 on each day (but not necessarily ending at 16:00 for the case of holidays)? Currently, in order to create 7 minute bars from 1 minute bars, I would have to either create my own concatenated sequence of sub-dateranges, or I could group by day and use resample + between_time.

@jreback
Copy link
Contributor

jreback commented Oct 9, 2014

I will think about this for 0.15.1. But I think base='start' could align the resample to the first value so that the base freq (e.g. 1min) is the same as the first value. w/o having to specify base=2 in this case.

if you check s.index.freq if it is not None then it is a 'regular' freq. However gaps will cause this issues.

Their are various ways do this this, most notably:

In [33]: s = Series(np.arange(10),pd.date_range('20130101 09:30',periods=9,freq='1T').union([Timestamp('20130101 09:50')]))

In [34]: s
Out[34]: 
2013-01-01 09:30:00    0
2013-01-01 09:31:00    1
2013-01-01 09:32:00    2
2013-01-01 09:33:00    3
2013-01-01 09:34:00    4
2013-01-01 09:35:00    5
2013-01-01 09:36:00    6
2013-01-01 09:37:00    7
2013-01-01 09:38:00    8
2013-01-01 09:50:00    9
dtype: int64

In [35]: si = s.index.to_series()

In [36]: si
Out[36]: 
2013-01-01 09:30:00   2013-01-01 09:30:00
2013-01-01 09:31:00   2013-01-01 09:31:00
2013-01-01 09:32:00   2013-01-01 09:32:00
2013-01-01 09:33:00   2013-01-01 09:33:00
2013-01-01 09:34:00   2013-01-01 09:34:00
2013-01-01 09:35:00   2013-01-01 09:35:00
2013-01-01 09:36:00   2013-01-01 09:36:00
2013-01-01 09:37:00   2013-01-01 09:37:00
2013-01-01 09:38:00   2013-01-01 09:38:00
2013-01-01 09:50:00   2013-01-01 09:50:00
dtype: datetime64[ns]

In [37]: si-si.shift(1)
Out[37]: 
2013-01-01 09:30:00        NaT
2013-01-01 09:31:00   00:01:00
2013-01-01 09:32:00   00:01:00
2013-01-01 09:33:00   00:01:00
2013-01-01 09:34:00   00:01:00
2013-01-01 09:35:00   00:01:00
2013-01-01 09:36:00   00:01:00
2013-01-01 09:37:00   00:01:00
2013-01-01 09:38:00   00:01:00
2013-01-01 09:50:00   00:12:00
dtype: timedelta64[ns]

In [38]: (si-si.shift(1))==pd.Timedelta('1min')
Out[38]: 
2013-01-01 09:30:00    False
2013-01-01 09:31:00     True
2013-01-01 09:32:00     True
2013-01-01 09:33:00     True
2013-01-01 09:34:00     True
2013-01-01 09:35:00     True
2013-01-01 09:36:00     True
2013-01-01 09:37:00     True
2013-01-01 09:38:00     True
2013-01-01 09:50:00    False
dtype: bool

@dimbeto
Copy link
Author

dimbeto commented Oct 9, 2014

That works! Thanks.

@jreback
Copy link
Contributor

jreback commented Oct 9, 2014

@dimbeto great!

@jreback jreback modified the milestones: 0.16.0, 0.15.2 Dec 4, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@TomAugspurger TomAugspurger changed the title Resample bug when frequency is multiple of 8 ENH: resample(..., base='start') for automaticly determining base. Jul 6, 2018
@datapythonista datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018
@hasB4K
Copy link
Member

hasB4K commented Apr 11, 2020

#31809 should also help to fix this issue:

import pandas as pd
import datetime as dt
import numpy as np

datetime_start = dt.datetime(2014, 9, 1, 9, 30)
datetime_end = dt.datetime(2014, 9, 1, 16, 0)

tt = pd.date_range(datetime_start, datetime_end, freq='1Min')
df = pd.DataFrame(np.arange(len(tt)), index=tt, columns=['A'])

df.resample("8T", origin="start").first()

Outputs:

                       A
2014-09-01 09:30:00    0
2014-09-01 09:38:00    8
2014-09-01 09:46:00   16
2014-09-01 09:54:00   24
2014-09-01 10:02:00   32

[truncated]

EDIT: use start option on origin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants