You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am not sure whether I should report this here or on numpy. But this is what lead me to the problem:
In [11]: dAllTags.describe()
Out [11]:
finalPeriod
count 74501
mean -1 days +02:40:08.792662
std 500 days 06:32:37.640848
min 2 days 00:51:49.730000
25% 498 days 19:11:28.576000
50% 846 days 00:46:56.656000
75% 1245 days 17:11:58.493000
max 2224 days 07:03:26.593000
All the values are positive (the minimum is 2 days) but the mean calculated is negative. This happens because the underlying type of np.timedelta64 is int64 which overflows while calculating the mean.
Now the issue of numerical stability in numpy has had a long history:
And though some steps have been taken to introduce precision accuracy (e.g. by providing fsum and using pairwise summation), there doesn't seem to be a consensus for using a numerically stable method for mean.
I was wondering if something could be done on the Pandas level to resolve this issue.
Currently, I am working around the issue by using the rather elaborate scheme:
df.finalPeriod.view(int).astype(float).mean()
since timedelta64 cannot be directly converted to float64. Is there a better/more intuitive way to do this?
The text was updated successfully, but these errors were encountered:
pull-requests are welcome. This just needs to be addressed in core/nanops.py by adjusting the precision of sum (which is the basis of most of the other ops).
I am not sure whether I should report this here or on
numpy
. But this is what lead me to the problem:All the values are positive (the minimum is
2 days
) but themean
calculated is negative. This happens because the underlying type ofnp.timedelta64
isint64
which overflows while calculating the mean.Now the issue of numerical stability in
numpy
has had a long history:And though some steps have been taken to introduce precision accuracy (e.g. by providing
fsum
and using pairwise summation), there doesn't seem to be a consensus for using a numerically stable method formean
.I was wondering if something could be done on the Pandas level to resolve this issue.
Currently, I am working around the issue by using the rather elaborate scheme:
since
timedelta64
cannot be directly converted tofloat64
. Is there a better/more intuitive way to do this?The text was updated successfully, but these errors were encountered: