Skip to content

PeriodIndex test keys that aren't strings #10801

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

max-sixty
Copy link
Contributor

Closes #10798

Not sure if the testing is done in a good way - I just copy & pasted code to test PeriodIndex & DatetimeIndex (it wouldn't work for LikeDatetimeIndex because of TimeDeltaIndex) - very open to feedback on better ways to do this.

@@ -2959,6 +2959,29 @@ def test_nat(self):
self.assertIs(DatetimeIndex([np.nan])[0], pd.NaT)


def test_contains(self):

i = self.create_index()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of copy-pasting :)

You can make a new mixin-class like DatetimeLike and just include it inDatetimeIndex/PeriodIndex

@max-sixty max-sixty force-pushed the datetime-as-periodindex-key branch from eb98943 to 6019a84 Compare August 12, 2015 02:09
@max-sixty
Copy link
Contributor Author

@jreback ready to go, cheers for the comments

class DatetimeAbsoluteLike(DatetimeLike):

def test_contains(self):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment
separate the lines when u have s comment

@max-sixty max-sixty force-pushed the datetime-as-periodindex-key branch from 6019a84 to 51b0a6d Compare August 12, 2015 15:09
@max-sixty
Copy link
Contributor Author

@jreback:

Period didn't work for DatetimeIndex, because it tried to coerce to Timestamp and fails

So I changed the get_loc in DatetimeIndex to check if it's a Period and call to_timestamp() if so

Is that the best way? I'm not sure why Timestamp(period) isn't the same as period.to_timestamp()? Feels like this is a hack because of that issue (but imagine there's a good reason) (EDIT: without a freq, there's too much ambiguity. I've since added a freq check

>>> pd.Timestamp(period_index[1])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-24-22f6327b2caf> in <module>()
----> 1 pd.Timestamp(period_index[1])

pandas/tslib.pyx in pandas.tslib.Timestamp.__new__ (pandas/tslib.c:7638)()

pandas/tslib.pyx in pandas.tslib.convert_to_tsobject (pandas/tslib.c:21357)()

ValueError: Cannot convert Period to Timestamp unambiguously. Use to_timestamp

@jreback jreback added the Period Period data type label Aug 13, 2015
@jreback jreback added this to the 0.17.0 milestone Aug 13, 2015
self.assertTrue(pd.Timestamp('2013-1-1') in i)

# pandas Period
self.assertTrue(pd.Period('2013-1-1') in i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add with a test with a different frequency

@max-sixty max-sixty force-pushed the datetime-as-periodindex-key branch 2 times, most recently from b13bd9f to 2fe6a87 Compare August 13, 2015 23:22
@max-sixty
Copy link
Contributor Author

@jreback updated, cheers!

class DatetimeAbsoluteLike(DatetimeLike):

def test_datetimeabsolute_contains(self):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment (the original one)

@max-sixty max-sixty force-pushed the datetime-as-periodindex-key branch from 2fe6a87 to 0c2e9ed Compare August 15, 2015 20:03
@max-sixty
Copy link
Contributor Author

@jreback updated, cheers!

@@ -1257,6 +1257,12 @@ def get_loc(self, key, method=None):
'when key is a time object')
return self.indexer_at_time(key)

# check if it's a Period and the frequencies are the same - otherwise a monthly period would match for
# a daily timestamp at the beginning of the month. NB: 'B' and 'D' therefore won't match
if isinstance(key, com.ABCPeriod) and key.freq == self.freq:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ok IF the datetimeindex has a freq, but if its None (or a multiple) then what?

In [5]: '2013-03-31' in date_range('2013-01',periods=10,freq='M')[2:4:2]
Out[5]: True

In [10]: '2013-03-31' in date_range('2013-01',periods=10,freq='M')[2:4:2].union([Timestamp('2013-01-02')])
Out[10]: True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback: the additional check is if it's a Period - the examples returning True above aren't a change.

Regarding the desired behavior there - I don't have a confident view here - but I think those probably should return True. That's what DatetimeIndex is - just a series of stamps. I wouldn't have thought the __contains__ method should have any knowledge about the frequency of the data - it's just asking if it's a member of the sequence. Anything else and people should be using PeriodIndex.

Where I could see this being an issue is where you have a DatetimeIndex that has dates at the beginning of each month, and then you ask if a Period is in that. The .to_timestamp() method will convert to the end of month and __contains__ will return False. I think that's an unlikely case given how difficult it seems to be to create such an index, but escalating to you regardless.

Am I missing something in the prior example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not what I meant. I think these should work IF its a Period. My question is that the freq is irrelevant if either the start/end date is in the DatetimeIndex? So give some examples with a freq of None , but that the Period can match

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Do you think any of these should be True? I had thought that they should all be False - that unless there in an explicit periodicity / freq to a DatetimeIndex, it couldn't match a Period.
I'm pushing that in the last case, but it seems that if you start guessing there you enter a murky gray area, and a better way of operating is for the user to specify the freq for explicitness

In [7]: pd.date_range('2013-01',periods=10,freq='M')[2:4:2]
Out[7]: DatetimeIndex(['2013-03-31'], dtype='datetime64[ns]', freq='2M', tz=None)

In [8]: pd.Period('2013-03-31','D') in _
Out[8]: False
In [9]: pd.date_range('2013-01', periods=10, freq='D')[::2]
Out[9]: 
DatetimeIndex(['2013-01-01', '2013-01-03', '2013-01-05', '2013-01-07',
               '2013-01-09'],
              dtype='datetime64[ns]', freq='2D', tz=None)

In [10]: pd.Period('2013-01-01','D') in _
Out[10]: False
In [11]: pd.date_range('2013-01',periods=10,freq='M')
Out[11]: 
DatetimeIndex(['2013-01-31', '2013-02-28', '2013-03-31', '2013-04-30',
               '2013-05-31', '2013-06-30', '2013-07-31', '2013-08-31',
               '2013-09-30', '2013-10-31'],
              dtype='datetime64[ns]', freq='M', tz=None)

In [12]: pd.Period('2013-10-31','D') in _
Out[12]: False
In [17]: idx=pd.date_range('2013-01',periods=10)

In [18]: idx.freq=None

In [19]: idx
Out[19]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08',
               '2013-01-09', '2013-01-10'],
              dtype='datetime64[ns]', freq=None, tz=None)

In [20]: pd.Period('2013-01-01', 'B') in idx
Out[20]: False

In [21]: pd.Period('2013-01-01', 'D') in idx
Out[21]: False

@max-sixty max-sixty force-pushed the datetime-as-periodindex-key branch from 0c2e9ed to d5499a3 Compare August 16, 2015 05:29
@max-sixty max-sixty force-pushed the datetime-as-periodindex-key branch from d5499a3 to 0412053 Compare August 16, 2015 19:12
@jreback
Copy link
Contributor

jreback commented Aug 18, 2015

@MaximilianR In fact I would go further. I would say ALL Period(...) in DatetimeIndex should be False.

We cannot compare a Timespan with a Point index; If we do also something like that (e.g. IntervalIndex, which is eventually going to be the super-class of PeriodIndex, cc @shoyer ), then should have a method for doing it as its non-obvious how it should work.

I don't think this worked before so this is not an API change then, right?

@shoyer
Copy link
Member

shoyer commented Aug 18, 2015

I agree with @jreback. It doesn't make sense for a period to be contained in any timestamp, because periods represent spans of time and timestamps represent instants in time.

@max-sixty
Copy link
Contributor Author

OK, @jreback, @shoyer, and you're confident these shouldn't be True either? I had thought the intention was fairly clear here, but can appreciate the principle of Period vs Timestamp.

In [5]: pd.date_range('2013-01-01', periods=10, freq='D')
Out[5]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08',
               '2013-01-09', '2013-01-10'],
              dtype='datetime64[ns]', freq='D', tz=None)

In [6]: pd.Period('2013-01-02','D') in _
Out[6]: True
In [9]: pd.date_range('2013-01', periods=10, freq='M')
Out[9]: 
DatetimeIndex(['2013-01-31', '2013-02-28', '2013-03-31', '2013-04-30',
               '2013-05-31', '2013-06-30', '2013-07-31', '2013-08-31',
               '2013-09-30', '2013-10-31'],
              dtype='datetime64[ns]', freq='M', tz=None)

In [10]: pd.Period('2013-02','M') in _
Out[10]: True

In which case, if we're considering the in to mean 'within the span represented by the index', rather than 'is one of the values in the index', should this be True? Because it's a stamp within the period / interval?
(although we could leave the implementation of this to a separate PR / IntervalIndex)

In [2]: pd.period_range('2013-01',periods=10,freq='M')
Out[2]: 
PeriodIndex(['2013-01', '2013-02', '2013-03', '2013-04', '2013-05', '2013-06',
             '2013-07', '2013-08', '2013-09', '2013-10'],
            dtype='int64', freq='M')

In [3]: pd.Timestamp('2013-01-03','D') in _
Out[3]: False

@jreback
Copy link
Contributor

jreback commented Aug 19, 2015

@MaximilianR I think all of those 3 examples should be False. in needs to be pretty strict.

  • in won't do a point time (Timestamp) in a span (Period), this should be left to another (named) method (.isin sounds good but already taken)
  • similarly just because the freq of a span and a point time match, doesn't meet my definition of strictness (aside from the fact that it makes the in operation communitive which is a good property). The basic issues that full-overlap cannot be distinguished from partial overlap , e.g. what if the freq matches but only 1 point is in the DatetimeIndex, or what if you have a monthly freq DatetimeIndex does a quarterly freq Period match that for in? (assume that all months in the quarter are present).

@max-sixty
Copy link
Contributor Author

OK, that makes sense @jreback. I think it follows that PeriodIndex.__contains__ allowing Timestamp is also invalid. So this should actually be False:

In [16]: pd.period_range('2013-01',periods=10,freq='D')
Out[16]: 
PeriodIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
             '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08',
             '2013-01-09', '2013-01-10'],
            dtype='int64', freq='D')

In [17]: pd.Timestamp('2013-01-03') in _
Out[17]: False

While this should be True, because the stamp vs period nature of a string isn't yet defined?

In [19]: '2013-01-03' in _
Out[19]: True

How about for datetime objects? I would have thought those would be similar to strings.

Agreed on the additional method, within could be the name, with a partial arg stating whether a partial overlap was acceptable. But that's for the next one / IntervalIndex.

@jreback
Copy link
Contributor

jreback commented Aug 19, 2015

@MaximilianR

Timestamp are a sub-class of datetime, so they should be treated exactly the same.

However a string can be directly coerced to a Period (with the frequency of the period), so is deemed acceptable.

IIRC the original issue was:

period_index
Out[9]:
PeriodIndex(['2015-01-01', '2015-01-02', '2015-01-05', '2015-01-06',
             '2015-01-07', '2015-01-08', '2015-01-09', '2015-01-12',
             '2015-01-13', '2015-01-14', '2015-01-15', '2015-01-16',
             '2015-01-19', '2015-01-20', '2015-01-21', '2015-01-22',
             '2015-01-23', '2015-01-26', '2015-01-27', '2015-01-28',
             '2015-01-29', '2015-01-30'],
            dtype='int64', freq='B')
In [10]:

'2015-01-06' in period_index
Out[10]:
True
In [15]:

dt = datetime.datetime(2015,1,6) 
dt
Out[15]:
datetime.datetime(2015, 1, 6, 0, 0)
In [16]:

dt in period_index
Out[16]:
False
In [17]:

d = datetime.date(2015,1,6) 
d
Out[17]:
datetime.date(2015, 1, 6)
In [18]:

d in period_index
Out[18]:
False

So we are back full-circle. If we don't allow Timestamp/datetime.datetime then datetime.date should not be allowed either. And the converse should be True (IOW if we DO allow Timestamp then it is ok).

And from your example above.

In [1]: p = pd.period_range('2013-01',periods=10,freq='D')

In [2]: p
Out[2]: PeriodIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08', '2013-01-09', '2013-01-10'], dtype='int64', freq='D')

In [3]: '2013-01-02' in p
Out[3]: True

In [4]: Timestamp('2013-01-02') in p
Out[4]: False

In [5]: Timestamp('2013-01-02').to_pydatetime() in p
Out[5]: False

Say we allowed [4], and [5] to be True. What would you do here?

In [3-1]: '2013-01-02 10:00:00' in p
In [4-1]: Timestamp('2013-01-02 10:00:00') in p
In [5-1]: Timestamp('2013-01-02 10:00:00').to_pydatetime() in p

@max-sixty
Copy link
Contributor Author

@jreback I agree overall.

But there's one case here that remains clear in the specific; I'm not sure if it's too specific to be generalizable though.

A datetime.date seems similar in many ways to a day Period. It's a discrete ordinal within a countable set, rather than a single point within a continuous span (which is the difference between this and your most recent example). It's not exactly the same, but it's certainly comparable.

So I would propose that the following is True:

In [11]: pd.period_range('2001-01-01', freq='D', periods=20)
Out[11]: 
PeriodIndex(['2001-01-01', '2001-01-02', '2001-01-03', '2001-01-04',
             '2001-01-05', '2001-01-06', '2001-01-07', '2001-01-08',
             '2001-01-09', '2001-01-10', '2001-01-11', '2001-01-12',
             '2001-01-13', '2001-01-14', '2001-01-15', '2001-01-16',
             '2001-01-17', '2001-01-18', '2001-01-19', '2001-01-20'],
            dtype='int64', freq='D')

In [12]: datetime.date(2001,1,3) in _
Out[12]: True

In our case, our dates come from our DB as datetime.dates, and most of our pandas data is in PeriodIndex form - this is the case that inspired the issue & PR.

If there were other comparable types, like a datetime.month, then I'd hold the parallel logic (but there aren't).

What are your thoughts?

@shoyer
Copy link
Member

shoyer commented Aug 20, 2015

@MaximilianR I would be inclined to agree if not for this fact:

In [36]: import datetime

In [37]: issubclass(datetime.date, datetime.datetime)
Out[37]: False

In [38]: issubclass(datetime.datetime, datetime.date)
Out[38]: True

In Python, datetime is a subclass of date, not the other way around. This seems highly strange to me...

@max-sixty
Copy link
Contributor Author

@shoyer Yes that is odd.
Does that matter in this case though? We could just check type(key)==datetime.date?

@jreback
Copy link
Contributor

jreback commented Aug 21, 2015

@MaximilianR why don't you just

In [1]: Period(datetime.date(2015,1,1),'M')
Out[1]: Period('2015-01', 'M')

to the keys.

@max-sixty
Copy link
Contributor Author

@jreback That's what we do - and this is one of those things we have that seems like it should be in the library for all...

But your guys' call - if you think it doesn't make sense then we should close

@jreback
Copy link
Contributor

jreback commented Aug 26, 2015

@MaximilianR I don't think we can make this change on ONLY datetime.date w/o making it on datetime.datetime and Timestamp as that's way confusing. This brings up the timespan / point-in-time argument. So I think let's close this for now (i'll leave the issue open though) and can revisit in the future.

@MaximilianR thanks for the discussion.

@max-sixty
Copy link
Contributor Author

OK, cheers @jreback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Period Period data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants