Skip to content

BUG: Fix issue with incorrect groupby handling of NaT #10625

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 3, 2015

Conversation

larvian
Copy link
Contributor

@larvian larvian commented Jul 19, 2015

closes #10590

For groupby the time stamps gets converted to integervalue tslib.iNaT
which is -9223372036854775808. The aggregation is then done using this
value with incorrect result as a consequence. The solution proposed here
is to replace its value by np.nan in case it is a datetime64[ns]

@larvian larvian changed the title BUG: Fix issue with incorrect groupby handling of NaT #10590 BUG: Fix issue with incorrect groupby handling of NaT Jul 19, 2015
@jreback
Copy link
Contributor

jreback commented Jul 19, 2015

test would be nice

@sinhrks sinhrks added Datetime Datetime data dtype Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jul 20, 2015
@sinhrks sinhrks added this to the 0.17.0 milestone Jul 20, 2015
@jreback jreback removed this from the 0.17.0 milestone Jul 20, 2015
@larvian
Copy link
Contributor Author

larvian commented Jul 23, 2015

I have tried to add a test and update to doc. I am not very sure I have done this correctly :)

@larvian
Copy link
Contributor Author

larvian commented Jul 24, 2015

I don't know what the next step here is? Am I supposed to do something now or am I waiting for someone to have a look at what I have done? There is a This branch has conflicts that must be resolved Can I see what the conflicts are and fix them?

@max-sixty
Copy link
Contributor

@larvian rebase your branch on upstream master to resolve the conflicts

@larvian
Copy link
Contributor Author

larvian commented Jul 24, 2015

@MaximilianR Thanks! I will try to figure out how to do that :)

@jreback
Copy link
Contributor

jreback commented Jul 24, 2015

lots of tips here

@sinhrks
Copy link
Member

sinhrks commented Jul 24, 2015

Also, can you add tests for timedelta which looks affected also?

df = pd.DataFrame({'IX': ['A', 'A'], 'num': [np.nan, 100],
                  'dt': [np.nan, datetime.datetime.now()],
                  'td':[np.nan, datetime.timedelta(days=1)]})
# OK
df.min()
# IX                              A
# dt    2015-07-25 07:41:47.871654
# num                           100
# td                1 days 00:00:00
# dtype: object

# NG
# df.groupby('IX').min()
#     dt  num  td
# IX             
# A  NaT  100 NaT

I think you can use com.is_datetime_or_timedelta_dtype to check dtype.

@larvian
Copy link
Contributor Author

larvian commented Jul 24, 2015

@jreback Thanks!
@sinhrks Ok I will look at that too.

@larvian
Copy link
Contributor Author

larvian commented Jul 25, 2015

Hm... I can't get this rebase to work. It gives conflicts which seems to come back after I fix :(
It seems to get messier the more I try.
Perhaps it is easiest to create a new pandas clone and apply my changes there?
I have to leave it for another day.

@TomAugspurger
Copy link
Contributor

We can help walk you through the git problems. Could be easier in the chat room: https://gitter.im/pydata/pandas

@larvian
Copy link
Contributor Author

larvian commented Jul 26, 2015

Thanks Tom!
I think I need it :)

@larvian larvian force-pushed the fix-issue-groupby-NaT branch 2 times, most recently from d716ba0 to f2759bc Compare July 31, 2015 22:47
@larvian
Copy link
Contributor Author

larvian commented Aug 3, 2015

Just curious, what's next with this PR?
For now is there anything else I should do?

@@ -2608,6 +2608,8 @@ def _cython_agg_blocks(self, how, numeric_only=True):
for block in data.blocks:

values = block._try_operate(block.values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to do this differently.

move the ._try_operate call to inside .aggregate (its line 1533). Then you can use do the iNaT to np.nan there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok,
Then I assume I would do the ._try_operate after checking is_datetime_or_timedelta_dtype not to lose dtype.
But ._try_operate on DatetimeBlock is just values.view('i8')
However the code just after is_datetime_or_timedelta_dtype is values = values.view('int64')
I assume both are not needed so I can remove ._try_operate
Hope this doesn't destroy something else.

@larvian larvian force-pushed the fix-issue-groupby-NaT branch 2 times, most recently from 4a10c42 to 476891f Compare August 6, 2015 07:22
def test_first_last_max_min_on_time_data(self):
from datetime import timedelta as td
DF_dt_test=DataFrame({'dt':[nan,'2015-07-24 10:10','2015-07-25 11:11','2015-07-23 12:12',nan],'td':[nan,td(days=1) ,td(days=2),td(days=3),nan]})
DF_dt_test.dt=pd.to_datetime(DF_dt_test.dt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrap this line
use df as a variable name (or maybe df_test)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment

@larvian larvian force-pushed the fix-issue-groupby-NaT branch 2 times, most recently from 0c4d1d9 to c200f66 Compare August 8, 2015 21:32
@larvian larvian force-pushed the fix-issue-groupby-NaT branch 2 times, most recently from a4d5cf8 to aa402ae Compare August 19, 2015 20:13
@larvian larvian force-pushed the fix-issue-groupby-NaT branch from aa402ae to b112b3a Compare August 22, 2015 05:13
@jreback
Copy link
Contributor

jreback commented Sep 1, 2015

can you rebase / update

@larvian larvian force-pushed the fix-issue-groupby-NaT branch from b112b3a to 1cf7b2f Compare September 1, 2015 12:57
@larvian
Copy link
Contributor Author

larvian commented Sep 1, 2015

Sure... rebase done. Queued for Travis CI

For groupby the time stamps gets converted to integervalue tslib.iNaT
which is -9223372036854775808. The aggregation is then done using this
value with incorrect result as a consequence. The solution proposed here
is to replace its value by np.nan in case it is a datetime or timedelta.
@larvian larvian force-pushed the fix-issue-groupby-NaT branch from 1cf7b2f to 9c2d1a6 Compare September 2, 2015 14:13
@larvian
Copy link
Contributor Author

larvian commented Sep 2, 2015

Rebased yesterday again and it passed Travis CI.
Today again I have "This branch has conflicts that must be resolved".
This seem to happen all the time due to changes to v0.17.0.txt .
It seems wasteful of Travis time to run CI just because updates due to changes to v0.17.0.txt.
How to avoid the constant conflicts due to changes in this file?
This time I try to insert my line above the last line.

@jreback
Copy link
Contributor

jreback commented Sep 2, 2015

to avoid conflicts, don't insert the line at the end of the notes, put it somewhere else above. This is why I have left lots of blank space (which gradually gets filled up).

@jreback jreback added this to the 0.17.0 milestone Sep 2, 2015
@jreback
Copy link
Contributor

jreback commented Sep 2, 2015

ping when green.

@larvian
Copy link
Contributor Author

larvian commented Sep 3, 2015

This one is green now

jreback added a commit that referenced this pull request Sep 3, 2015
BUG: Fix issue with incorrect groupby handling of NaT
@jreback jreback merged commit 3991731 into pandas-dev:master Sep 3, 2015
@jreback
Copy link
Contributor

jreback commented Sep 3, 2015

thanks!

@behzadnouri
Copy link
Contributor

this added line has broken cython aggregation. You cannot assign nan to an int64 view of an array.

before this patch:

In [1]: from string import ascii_lowercase

In [2]: np.random.seed(2718281)

In [3]: n = 1 << 21

In [4]: dr = date_range('2015-08-30', periods=n // 10, freq='T')

In [5]: df = DataFrame({
   ...:         '1st':np.random.choice(list(ascii_lowercase), n),
   ...:         '2nd':np.random.randint(0, 5, n),
   ...:         '3rd':np.random.choice(dr, n)})

In [6]: df.loc[np.random.choice(n, n // 10), '3rd'] = np.nan

In [7]: gr = df.groupby(['1st', '2nd'])

In [8]: %timeit gr.count()
The slowest run took 21.22 times longer than the fastest. This could mean that an intermediate result is being cached 
1 loops, best of 3: 13.3 ms per loop

In [9]: %timeit gr.count()
100 loops, best of 3: 13.8 ms per loop

In [10]: pd.__version__
Out[10]: '0.16.2+521.g207efc2'

with this patch:

In [8]: %timeit gr.count()
1 loops, best of 3: 144 ms per loop

In [9]: %timeit gr.count()
10 loops, best of 3: 149 ms per loop

In [10]: pd.__version__
Out[10]: '0.16.2+522.g9c2d1a6'

@jreback
Copy link
Contributor

jreback commented Sep 5, 2015

yes that should be iNaT

odd that the tests didn't pick this up

@behzadnouri
Copy link
Contributor

@jreback it is modifying values which are already iNaT:

values[values == tslib.iNaT] = np.nan

@jreback
Copy link
Contributor

jreback commented Sep 5, 2015

this should already be a float here

@jreback
Copy link
Contributor

jreback commented Sep 5, 2015

ahh I see it is raising, being cause and taking the slow path...ok.thanks for the catch

@larvian
Copy link
Contributor Author

larvian commented Sep 6, 2015

@behzadnouri @jreback Thanks for pointing out the problem.
Is it that the cython aggregation function accepts int64 does not like the nan and raises exception?
Not sure how to deal with it. Should all the NaT be dropped instead before aggregating?
Are there some aggregations where the NaT are important?

@larvian larvian deleted the fix-issue-groupby-NaT branch February 14, 2016 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistency, NaT included in result of groupby method first but not NaN
6 participants