Skip to content

BUG: DatetimeIndex._data should return an ndarray #20912

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 10, 2018

Conversation

reidy-p
Copy link
Contributor

@reidy-p reidy-p commented May 1, 2018

The change I made seems to fix the case in the original issue without breaking any tests.

On my branch:

In [1]: idx1 = pd.DatetimeIndex(start="2012-01-01", periods=3, freq='D') # date_range kind of construction

In [2]: idx1._data
array(['2012-01-01T00:00:00.000000000', '2012-01-02T00:00:00.000000000',
       '2012-01-03T00:00:00.000000000'], dtype='datetime64[ns]')

In [3]: idx2 = pd.DatetimeIndex(idx1)

In [4]: idx2._data
Out[4]: 
array(['2012-01-01T00:00:00.000000000', '2012-01-02T00:00:00.000000000',
       '2012-01-03T00:00:00.000000000'], dtype='datetime64[ns]')

But is the solution too simple or is something more sophisticated required?

And do we need tests for this issue?

@jreback
Copy link
Contributor

jreback commented May 1, 2018

this is a band aid
it shouldn’t be set in the first place like this

@codecov
Copy link

codecov bot commented May 2, 2018

Codecov Report

Merging #20912 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20912      +/-   ##
==========================================
+ Coverage   91.92%   91.92%   +<.01%     
==========================================
  Files         160      160              
  Lines       49913    49915       +2     
==========================================
+ Hits        45882    45884       +2     
  Misses       4031     4031
Flag Coverage Δ
#multiple 90.3% <100%> (ø) ⬆️
#single 42.11% <90.9%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/pytables.py 92.48% <100%> (ø) ⬆️
pandas/core/indexes/base.py 96.58% <100%> (-0.06%) ⬇️
pandas/core/indexes/datetimes.py 95.21% <100%> (+0.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d05e8f2...e18d996. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments

@gfyoung gfyoung added Bug Datetime Datetime data dtype Compat pandas objects compatability with Numpy or Python functions labels May 8, 2018
@reidy-p reidy-p force-pushed the datetimeindex_data branch from 0b635d0 to 6b0b72b Compare May 8, 2018 21:25
tz,
ambiguous=ambiguous)
index = index.view(_NS_DTYPE)
arr = conversion.tz_localize_to_utc(_ensure_int64(index),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tz_localize_to_utc returns an array and not a DatetimeIndex so I then convert this array to a DatetimeIndex called index so I can pass index.values to _simple_new below

@reidy-p reidy-p force-pushed the datetimeindex_data branch from 6b0b72b to 2735818 Compare June 2, 2018 12:36
@@ -111,3 +111,4 @@ Other

- Tab completion on :class:`Index` in IPython no longer outputs deprecation warnings (:issue:`21125`)
- Bug preventing pandas from being importable with -OO optimization (:issue:`21071`)
- ``DatetimeIndex._data`` now returns a numpy array in all cases (:issue:`20810`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to add this in the whatsnew, since this is not a user facing change (as user should not be aware of or use the _data attribute)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks.

@reidy-p reidy-p force-pushed the datetimeindex_data branch from 2735818 to 7feeddb Compare June 6, 2018 19:11
@reidy-p reidy-p force-pushed the datetimeindex_data branch 2 times, most recently from 46e18aa to 71694ae Compare June 14, 2018 19:57
@jreback
Copy link
Contributor

jreback commented Jun 19, 2018

happy to take a patch with a non-band aid fix (or can close for now)

@jorisvandenbossche
Copy link
Member

this is a band aid
it shouldn’t be set in the first place like this

Can you be more specific than this?
I don't think this fix is necessarily a band aid.

Currently, the index object (which at the end of the _generate method is passed to _simple_new) is generated in some different ways:

  • cls._cached_range -> returns DatetimeIndex
  • _generate_regular_range -> returns DatetimeIndex
  • passed through conversion.tz_localize_to_utc -> returns array
  • tools.to_datetime -> returns DatetimeIndex

So possible fixes I see:

  1. make sure that index is a DatetimeIndex in the end in all cases and update the final _simple_new call (this is what @reidy-p did)
  2. make sure that each case results in a datetime64 array (this seems more work to do that conversion in each place)
  3. just before the _simple_new call, check if index is a DatetimeIndex or not, and convert there to ndarray if needed
  4. change _simple_new to convert DatetimeIndex to ndarray if passed one.

From those options, the first seems reasonable to me. I think 3) is also fine (although that is less explicit).

@reidy-p reidy-p force-pushed the datetimeindex_data branch 2 times, most recently from 038ca34 to 58c8f5c Compare June 22, 2018 15:10
@reidy-p reidy-p force-pushed the datetimeindex_data branch from 58c8f5c to ccc874d Compare June 29, 2018 19:50
@reidy-p
Copy link
Contributor Author

reidy-p commented Jun 29, 2018

@jorisvandenbossche thanks for that summary!

@jreback do you agree with the above comment or is this still a band-aid?

@@ -588,7 +588,9 @@ def _generate(cls, start, end, periods, name, freq,
index = index[1:]
if not right_closed and len(index) and index[-1] == end:
index = index[:-1]
index = cls._simple_new(index, name=name, freq=freq, tz=tz)

index = cls._simple_new(index.values, name=name, freq=freq, tz=tz)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don’t like having 2 different construction paths generally: eg we always need to be an ndarray or already converted to a DTI by the time _simple_new gets called

what i would do is run all of the index tests and see what the current state is
then probably settle on an ndarray input to _simple_new and put an assertion to validate this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which 2 different construction paths do you mean?
With the update above, index is always an index, and it's always the values that are passed to _simple_new. So it is ndarray input to _simple_new.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is that we need an assertion to validate this. it may be that everything is fixed, but we should actually test this.

@reidy-p reidy-p force-pushed the datetimeindex_data branch from ccc874d to 1ab2770 Compare July 6, 2018 21:16
@@ -609,6 +611,8 @@ def _simple_new(cls, values, name=None, freq=None, tz=None,
dtype=dtype, **kwargs)
values = np.array(values, copy=False)

assert isinstance(values, np.ndarray)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was suggested above that we should have an assertion to check whether the input to _simple_new is actually always an ndarray with the new changes. It turns out that it's still not guaranteed to be an ndarray. In particular, _shallow_copy sometimes calls _simple_new with a non-ndarray input. Some of these cases are handled by the code directly above this new assert statement but one case that is not handled is the DatetimeIndex (i.e., it is not converted to an ndarray). This is why I have put code to convert a DTI in _shallow_copy to an ndarray, although I realise this may not be the correct way to handle this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add .a message on the assert as well
can also assert is_integer_dtype(values)

@@ -506,6 +506,9 @@ def _shallow_copy(self, values=None, **kwargs):
attributes.update(kwargs)
if not len(values) and 'dtype' not in kwargs:
attributes['dtype'] = self.dtype
from pandas import DatetimeIndex
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed below, this code converts a DTI to ndarray before calling _simple_new. All the other cases either seem to be an ndarray already or are converted to ndarray in the _simple_new function. I expect that there is probably a better way of handling this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just do

with a comment
values = getattr(values, 'values', values)

@reidy-p reidy-p force-pushed the datetimeindex_data branch from 1ab2770 to 22fa07a Compare July 7, 2018 15:50
@pep8speaks
Copy link

pep8speaks commented Jul 7, 2018

Hello @reidy-p! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 09, 2018 at 16:01 Hours UTC

@reidy-p reidy-p force-pushed the datetimeindex_data branch 2 times, most recently from 41fd5ee to deca8a4 Compare July 7, 2018 15:52
@@ -607,6 +611,9 @@ def _simple_new(cls, values, name=None, freq=None, tz=None,
dtype=dtype, **kwargs)
values = np.array(values, copy=False)

# values should be a numpy array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the format, not commment is needed
assert ...., "values are not an np.ndarray"
assert the integer dtype as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the intention of an assert is_integer_dtype(values)? values is often an ndarray of datetime64[ns] at this stage which means this assert fails very often.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right, sorry the assert should be assert is_datetime64_dtype, it always should be this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its just better that we have certain guarantees in theses low level constructors

tz,
ambiguous=ambiguous)

arr = arr.view(_NS_DTYPE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think can remove this arr.view(...)

@reidy-p reidy-p force-pushed the datetimeindex_data branch from 9dd0150 to 3d4bc3c Compare July 7, 2018 21:38
@@ -2087,6 +2094,8 @@ def _generate_regular_range(start, end, periods, freq):
"if a 'period' is given.")

data = np.arange(b, e, stride, dtype=np.int64)

# _simple_new is getting an array of int64 here
data = DatetimeIndex._simple_new(data, None, tz=tz)
Copy link
Contributor Author

@reidy-p reidy-p Jul 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a new assert statement in _simple_new to check whether the input is an array of datetime64[ns]. But in this case data is an array of int64 so the assert statement fails. Is there a convenient way to rewrite this part to make data an array of datetime64[ns] before calling _simple_new so the assert works?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yes, .view(_NS_DTYPE)

@jreback
Copy link
Contributor

jreback commented Jul 8, 2018

@reidy-p if you can rebase. datetimes have been changing a bit as getting ready for DatetimeArray cc @jbrockmendel

@reidy-p
Copy link
Contributor Author

reidy-p commented Jul 8, 2018

Yeah sorry I just saw the new changes. I'll rebase.

@reidy-p reidy-p force-pushed the datetimeindex_data branch from 47d71ef to 98b3e80 Compare July 8, 2018 21:15
@@ -608,12 +610,14 @@ def _simple_new(cls, values, name=None, freq=None, tz=None,
dtype=dtype, **kwargs)
values = np.array(values, copy=False)

if is_object_dtype(values):
return cls(values, name=name, freq=freq, tz=tz,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this just never hit?

Copy link
Contributor Author

@reidy-p reidy-p Jul 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think it's never hit

@reidy-p reidy-p force-pushed the datetimeindex_data branch from 68e1d46 to ea39791 Compare July 9, 2018 15:55
@reidy-p reidy-p force-pushed the datetimeindex_data branch from ea39791 to e18d996 Compare July 9, 2018 16:01
@jreback jreback added this to the 0.24.0 milestone Jul 9, 2018
@jreback
Copy link
Contributor

jreback commented Jul 9, 2018

@reidy-p lgtm. ping on green.

@reidy-p
Copy link
Contributor Author

reidy-p commented Jul 10, 2018

@jreback thanks! This is green now

@jreback jreback merged commit eeab164 into pandas-dev:master Jul 10, 2018
@jreback
Copy link
Contributor

jreback commented Jul 10, 2018

thanks @reidy-p

values = _ensure_int64(values).view(_NS_DTYPE)

values = getattr(values, 'values', values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this one still needed? I thought the idea was now to ensure ndarrays are passed to _simple_new and not DatetimeIndexes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-spotted! I inserted this when I was trying to investigate why some tests were failing and I meant to move it before pushing the commit but forgot. I think we can move this line to just before the call to _simple_new in this file:

def _shallow_copy(self, values=None, **kwargs):
if values is None:
# Note: slightly different from Index implementation which defaults
# to self.values
values = self._ndarray_values
attributes = self._get_attributes_dict()
attributes.update(kwargs)
if not len(values) and 'dtype' not in kwargs:
attributes['dtype'] = self.dtype
return self._simple_new(values, **attributes)

Does this make sense? I did the same thing here:

@Appender(_index_shared_docs['_shallow_copy'])
def _shallow_copy(self, values=None, **kwargs):
if values is None:
values = self.values
attributes = self._get_attributes_dict()
attributes.update(kwargs)
if not len(values) and 'dtype' not in kwargs:
attributes['dtype'] = self.dtype
# _simple_new expects an ndarray
values = getattr(values, 'values', values)
return self._simple_new(values, **attributes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Compat pandas objects compatability with Numpy or Python functions Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: inconsistent state of DatetimeIndex._data
5 participants