Skip to content

Numerically unstable mean calculation for Timedeltas. #9670

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
musically-ut opened this issue Mar 17, 2015 · 1 comment
Closed

Numerically unstable mean calculation for Timedeltas. #9670

musically-ut opened this issue Mar 17, 2015 · 1 comment
Labels
Bug Duplicate Report Duplicate issue or pull request Timedelta Timedelta data type

Comments

@musically-ut
Copy link
Contributor

I am not sure whether I should report this here or on numpy. But this is what lead me to the problem:

 In [11]: dAllTags.describe()
Out [11]:
                     finalPeriod
count                      74501
mean    -1 days +02:40:08.792662
std     500 days 06:32:37.640848
min       2 days 00:51:49.730000
25%     498 days 19:11:28.576000
50%     846 days 00:46:56.656000
75%    1245 days 17:11:58.493000
max    2224 days 07:03:26.593000

All the values are positive (the minimum is 2 days) but the mean calculated is negative. This happens because the underlying type of np.timedelta64 is int64 which overflows while calculating the mean.

Now the issue of numerical stability in numpy has had a long history:

And though some steps have been taken to introduce precision accuracy (e.g. by providing fsum and using pairwise summation), there doesn't seem to be a consensus for using a numerically stable method for mean.

I was wondering if something could be done on the Pandas level to resolve this issue.


Currently, I am working around the issue by using the rather elaborate scheme:

df.finalPeriod.view(int).astype(float).mean()

since timedelta64 cannot be directly converted to float64. Is there a better/more intuitive way to do this?

@jreback
Copy link
Contributor

jreback commented Mar 17, 2015

this is a dupe of #9442

pull-requests are welcome. This just needs to be addressed in core/nanops.py by adjusting the precision of sum (which is the basis of most of the other ops).

@jreback jreback closed this as completed Mar 17, 2015
@jreback jreback added Bug Timedelta Timedelta data type Duplicate Report Duplicate issue or pull request labels Mar 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Duplicate Report Duplicate issue or pull request Timedelta Timedelta data type
Projects
None yet
Development

No branches or pull requests

2 participants