ENH: resample(..., base='start') for automaticly determining base. #8521

dimbeto · 2014-10-09T14:19:41Z

(edit):

import pandas as pd
import datetime as dt
import numpy as np

datetime_start = dt.datetime(2014, 9, 1, 9, 30)
datetime_end = dt.datetime(2014, 9, 1, 16, 0)

tt = pd.date_range(datetime_start, datetime_end, freq='1Min')
df = pd.DataFrame(np.arange(len(tt)), index=tt, columns=['A'])

The first item is at 9:30, which is not divisible by 8 minutes. We'd like df.resample('8T', base='start').first() to be equivalent to df.resample('8T', base=2).

In [13]: df.resample("8T", base=2).first()
Out[13]:
                       A
2014-09-01 09:30:00    0
2014-09-01 09:38:00    8
2014-09-01 09:46:00   16
2014-09-01 09:54:00   24
2014-09-01 10:02:00   32

It seems that for 1Min bar data, resample() with sampling frequency of any multiple of 8 has a bug. The code below illustrates the bug when resampling is done at [3, 5, 6, 8, 16] Min. For both 3 and 5 frequency, the first entry of the resampled dataframe index starts at the base timestamp (9:30 in this case) while for frequencies 8 and 16, the resampled index starts at 9:26 and 9:18 respectively.

import pandas as pd
import datetime as dt
import numpy as np

datetime_start = dt.datetime(2014, 9, 1, 9, 30)
datetime_end = dt.datetime(2014, 9, 1, 16, 0)

tt = pd.date_range(datetime_start, datetime_end, freq='1Min')
df = pd.DataFrame(np.arange(len(tt)), index=tt, columns=['A'])

for freq in [3, 5, 6, 8, 16]:
    print(freq)
    print(df.resample(str(freq) + 'Min', how='first', base=30).head(2))

produces the following output:

3
                     A
2014-09-01 09:30:00  0
2014-09-01 09:33:00  3
5
                     A
2014-09-01 09:30:00  0
2014-09-01 09:35:00  5
6
                     A
2014-09-01 09:30:00  0
2014-09-01 09:36:00  6
8
                     A
2014-09-01 09:26:00  0
2014-09-01 09:34:00  4
16
                     A
2014-09-01 09:18:00  0
2014-09-01 09:34:00  4

The text was updated successfully, but these errors were encountered:

dimbeto · 2014-10-09T14:21:28Z

Note that this is not the same as issue #8371 . Resample seems to work fine with any frequency but a multiple of 8.

jreback · 2014-10-09T14:33:31Z

@dimbeto not excatly the same as #8371, but its coming from the same code block. fix upcoming.

jreback · 2014-10-09T14:51:37Z

@dimbeto

if your first date were 09:30:01 how is pandas to know that you DON't want this
e.g. resample snaps the first date to the interval, so both get mapped to 09:28

Since the other intervals 'happen' to evenly divide your first date you are not noticing this.

their are several options to 'fix' this,

e.g. loffset=pd.Timedelta('2min') or base=2
will do what you want.

However, not sure that this is possible with the current option set to do automatically.

You almost want an option to base='start' which will use your own starting date if its an evently divisible snap interval.

dimbeto · 2014-10-09T15:27:52Z

Thanks.

Base='start' does look like a good idea. Are you assuming that the new frequency is a multiple of the old one?

And currently does pandas have a way to check that the index of a Series is, a concatenated sequence of, say 1 minute dateranges all starting from 9:30 on each day (but not necessarily ending at 16:00 for the case of holidays)? Currently, in order to create 7 minute bars from 1 minute bars, I would have to either create my own concatenated sequence of sub-dateranges, or I could group by day and use resample + between_time.

jreback · 2014-10-09T15:39:30Z

I will think about this for 0.15.1. But I think base='start' could align the resample to the first value so that the base freq (e.g. 1min) is the same as the first value. w/o having to specify base=2 in this case.

if you check s.index.freq if it is not None then it is a 'regular' freq. However gaps will cause this issues.

Their are various ways do this this, most notably:

In [33]: s = Series(np.arange(10),pd.date_range('20130101 09:30',periods=9,freq='1T').union([Timestamp('20130101 09:50')]))

In [34]: s
Out[34]: 
2013-01-01 09:30:00    0
2013-01-01 09:31:00    1
2013-01-01 09:32:00    2
2013-01-01 09:33:00    3
2013-01-01 09:34:00    4
2013-01-01 09:35:00    5
2013-01-01 09:36:00    6
2013-01-01 09:37:00    7
2013-01-01 09:38:00    8
2013-01-01 09:50:00    9
dtype: int64

In [35]: si = s.index.to_series()

In [36]: si
Out[36]: 
2013-01-01 09:30:00   2013-01-01 09:30:00
2013-01-01 09:31:00   2013-01-01 09:31:00
2013-01-01 09:32:00   2013-01-01 09:32:00
2013-01-01 09:33:00   2013-01-01 09:33:00
2013-01-01 09:34:00   2013-01-01 09:34:00
2013-01-01 09:35:00   2013-01-01 09:35:00
2013-01-01 09:36:00   2013-01-01 09:36:00
2013-01-01 09:37:00   2013-01-01 09:37:00
2013-01-01 09:38:00   2013-01-01 09:38:00
2013-01-01 09:50:00   2013-01-01 09:50:00
dtype: datetime64[ns]

In [37]: si-si.shift(1)
Out[37]: 
2013-01-01 09:30:00        NaT
2013-01-01 09:31:00   00:01:00
2013-01-01 09:32:00   00:01:00
2013-01-01 09:33:00   00:01:00
2013-01-01 09:34:00   00:01:00
2013-01-01 09:35:00   00:01:00
2013-01-01 09:36:00   00:01:00
2013-01-01 09:37:00   00:01:00
2013-01-01 09:38:00   00:01:00
2013-01-01 09:50:00   00:12:00
dtype: timedelta64[ns]

In [38]: (si-si.shift(1))==pd.Timedelta('1min')
Out[38]: 
2013-01-01 09:30:00    False
2013-01-01 09:31:00     True
2013-01-01 09:32:00     True
2013-01-01 09:33:00     True
2013-01-01 09:34:00     True
2013-01-01 09:35:00     True
2013-01-01 09:36:00     True
2013-01-01 09:37:00     True
2013-01-01 09:38:00     True
2013-01-01 09:50:00    False
dtype: bool

dimbeto · 2014-10-09T16:55:02Z

That works! Thanks.

jreback · 2014-10-09T16:55:35Z

@dimbeto great!

hasB4K · 2020-04-11T17:42:50Z

#31809 should also help to fix this issue:

import pandas as pd
import datetime as dt
import numpy as np

datetime_start = dt.datetime(2014, 9, 1, 9, 30)
datetime_end = dt.datetime(2014, 9, 1, 16, 0)

tt = pd.date_range(datetime_start, datetime_end, freq='1Min')
df = pd.DataFrame(np.arange(len(tt)), index=tt, columns=['A'])

df.resample("8T", origin="start").first()

Outputs:

                       A
2014-09-01 09:30:00    0
2014-09-01 09:38:00    8
2014-09-01 09:46:00   16
2014-09-01 09:54:00   24
2014-09-01 10:02:00   32

[truncated]

EDIT: use start option on origin.

jreback added Bug Resample resample method labels Oct 9, 2014

jreback added this to the 0.15.0 milestone Oct 9, 2014

jreback added API Design and removed Bug labels Oct 9, 2014

jreback modified the milestones: 0.15.1, 0.15.0 Oct 9, 2014

jreback added the Enhancement label Dec 4, 2014

jreback modified the milestones: 0.16.0, 0.15.2 Dec 4, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

TomAugspurger changed the title ~~Resample bug when frequency is multiple of 8~~ ENH: resample(..., base='start') for automaticly determining base. Jul 6, 2018

datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018

hasB4K mentioned this issue Apr 11, 2020

ENH: add 'origin' and 'offset' arguments to 'resample' and 'pd.Grouper' #31809

Merged

9 tasks

jreback modified the milestones: Someday, 1.1 May 10, 2020

jreback closed this as completed in #31809 May 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: resample(..., base='start') for automaticly determining base. #8521

ENH: resample(..., base='start') for automaticly determining base. #8521

dimbeto commented Oct 9, 2014 •

edited by TomAugspurger

Loading

dimbeto commented Oct 9, 2014

jreback commented Oct 9, 2014

jreback commented Oct 9, 2014

dimbeto commented Oct 9, 2014

jreback commented Oct 9, 2014

dimbeto commented Oct 9, 2014

jreback commented Oct 9, 2014

hasB4K commented Apr 11, 2020 •

edited

Loading

ENH: resample(..., base='start') for automaticly determining base. #8521

ENH: resample(..., base='start') for automaticly determining base. #8521

Comments

dimbeto commented Oct 9, 2014 • edited by TomAugspurger Loading

dimbeto commented Oct 9, 2014

jreback commented Oct 9, 2014

jreback commented Oct 9, 2014

dimbeto commented Oct 9, 2014

jreback commented Oct 9, 2014

dimbeto commented Oct 9, 2014

jreback commented Oct 9, 2014

hasB4K commented Apr 11, 2020 • edited Loading

dimbeto commented Oct 9, 2014 •

edited by TomAugspurger

Loading

hasB4K commented Apr 11, 2020 •

edited

Loading