Skip to content

Numerical stability #4694

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Lysovenko opened this issue May 10, 2014 · 7 comments
Closed

Numerical stability #4694

Lysovenko opened this issue May 10, 2014 · 7 comments

Comments

@Lysovenko
Copy link

It is no numerical stability in the NumPy (version1.6.2 but perhaps also in future fersions ).

>>> d=array([ 253.,  253.,  253.,  253.,  253.,  253.,  252.,  253.,  252.,
        252.,  253.,  253.,  252.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  252.,  253.,  253.,  252.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  254.,  253.,  253.,  253.,  252.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  252.,  251.,  254.,  254.,  254.,
        252.,  252.,  253.,  253.,  254.,  253.,  253.,  253.,  254.,
        253.,  252.,  253.,  254.,  253.,  252.,  253.,  253.,  253.,
        253.,  253.,  253.,  252.,  253.,  253.,  253.,  252.,  253.,
        254.,  252.,  252.,  253.,  253.,  253.,  253.,  253.,  252.,
        253.,  253.,  253.,  252.,  253.,  253.,  253.,  252.,  253.,
        253.,  253.,  253.,  253.,  253.,  254.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  252.,  253.,
        253.,  253.,  252.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  252.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  252.,  253.,  253.,  253.,  253.,  253.,  253.,  254.,
        253.,  253.,  253.,  253.,  253.,  254.,  253.,  253.,  253.,
        253.,  253.,  253.,  252.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        254.,  253.,  253.,  253.,  253.,  254.,  253.,  253.,  253.,
        253.,  254.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  254.,  254.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        254.,  254.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  254.,  254.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  254.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  254.,  253.,  253.,  253.,
        253.,  252.,  252.,  253.,  253.,  253.,  253.,  254.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        252.,  253.,  253.,  253.,  253.,  252.,  253.,  253.,  253.,
        253.,  252.,  253.,  253.,  252.,  252.,  252.,  252.,  252.,
        253.,  252.,  253.,  252.,  253.,  253.,  252.,  253.,  253.,
        252.,  252.,  252.,  252.,  252.,  253.,  253.,  253.,  252.,
        252.,  251.,  252.,  252.,  252.,  252.,  252.,  252.,  253.,
        252.,  252.,  252.,  252.,  252.,  252.,  253.,  253.,  253.,
        252.,  252.,  252.,  253.,  253.,  252.], dtype=float32)
>>> d.sum() / len(d) == d.mean()
True
# it is wrong: when used huge number of numbers last digits can be lost
>>> (d ** 2).mean() - d.mean() ** 2
-0.072070897236699238
# unreal bullshit happens: dispersion can't be negative
>>> x = 0.
>>> for i, v in enumerate(d, 1):
...  x += (v - x) / i
... 
>>> x
252.90442890442881
>>> d.mean()
252.9044289044289
'# x is the numerically stable d.mean()
>>> y = 0.
>>> for i, v in enumerate(d ** 2, 1):
...  y += (v - y) / i
... 
>>> y
63960.853146853144
>>> (d ** 2).mean()
63960.578088578091
>>> y - x ** 2
0.20298737785924459
>>> d.std() ** 2
0.20298681725988854
# at least d.std() have some stability
@njsmith
Copy link
Member

njsmith commented May 10, 2014

I think your complaint is just that when you have float32s in numpy, you get more rounding error than when using python's builtin float64s? If so then the solution is just to tell numpy to use float64. If you can't store your data using float32 in general, then you can still ask specific operations to upcast internally by using the dtype argument. Try d.mean(dtype=np.float64), d.sum(dtype=np.float64).

(Also in newer versions numpy does use a somewhat more numerically stable implementation for sum, but I'm not sure it will help for this particular pattern of rounding errors.)

If this doesn't resolve your problem please re-open with more details?

@njsmith njsmith closed this as completed May 10, 2014
@njsmith
Copy link
Member

njsmith commented May 10, 2014

Err, when I said "If you can't store your data using float32 in general...", of course I meant float64, not float32.

@Lysovenko
Copy link
Author

dtype=np.float64 does not works in my case. array.mean must be
stabilized rather than array.sum. array.sum can be stabilized using
sorted array - is too expencive for CPU time ;-)

2014-05-10 17:50 GMT+03:00, Nathaniel J. Smith [email protected]:

Err, when I said "If you can't store your data using float32 in general...",
of course I meant float64, not float32.


Reply to this email directly or view it on GitHub:
#4694 (comment)

@charris
Copy link
Member

charris commented May 10, 2014

On Sat, May 10, 2014 at 10:23 AM, Lysovenko [email protected]:

dtype=np.float64 does not works in my case. array.mean must be
stabilized rather than array.sum. array.sum can be stabilized using
sorted array - is too expencive for CPU time ;-)

2014-05-10 17:50 GMT+03:00, Nathaniel J. Smith [email protected]:

Err, when I said "If you can't store your data using float32 in
general...",
of course I meant float64, not float32.


Reply to this email directly or view it on GitHub:
#4694 (comment)

This is probably improved in the upcoming 1.9 version. Could you give that
a try?
The easiest way to get that at the moment is to pull the master branch, but
the first beta should be out in a few days.

Chuck

@njsmith
Copy link
Member

njsmith commented May 10, 2014

On further checking... your array is all small integers, so it turns out that d.sum() and (d ** 2).sum() return the exactly correct answer, with no rounding error. As a result, the only rounding in d.sum() / len(d) and in (d ** 2).sum() / len(d) occurs in the division. Since primitive operations like / are correctly rounded, this means that d.mean() and (d ** 2).mean() are also correctly rounded -- they have the minimum possible error that any floating point operation can have.

So the only possible way to get a more accurate mean on this array is to switch to a higher-precision floating point format. And for this purpose, dtype=np.float64 works just fine for me:

In [41]: (d ** 2).mean() - d.mean() ** 2
Out[41]: -0.074715096736326814

In [42]: (d ** 2).mean(dtype=np.float64) - d.mean(dtype=np.float64) ** 2
Out[42]: 0.20298737781558884

(This is with numpy 1.8.0.)

@argriffing
Copy link
Contributor

# it is wrong: when used huge number of numbers last digits can be lost
>>> (d ** 2).mean() - d.mean() ** 2
-0.072070897236699238
# unreal bullshit happens: dispersion can't be negative

This way of computing the variance of d is a classic example of an unstable formula!

There is a Wikipedia page about the different numerical properties of various mathematically equivalent ways to compute it http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Na.C3.AFve_algorithm. The Python devs seem to agree that this is 'unreal bullshit' so in PEP 450 they are adding a standard library function to compute variance. On that page they have an example, with a comment

And variance should *never* be negative:

        >>> variance(data*100)
        -1239429440.1282566

where variance is implemented with essentially the formula that you are using here. If you are using numpy, a better way is to use np.var(d). This should work in your example even if d has dtype np.float32.

Edit: Sorry, you probably knew this already, and you were wondering why the numpy mean is worse than your handwritten mean. I guess this is mainly because of the data types of the intermediate calculations, and this seems to have a long history in the numpy bug tracker #1033, #1063, #2448.

Edit 2:

This is probably improved in the upcoming 1.9 version.
Could you give that a try?
The easiest way to get that at the moment is to pull the master branch

I tried this and got 0.202987 float32 for np.var and 0.199219 float32 for the naive formula. So it now does a better job of preserving the float32 data type and also gets a closer answer. I guess the improved stability is due to #3685?

@argriffing
Copy link
Contributor

On further checking... your array is all small integers, so it turns out that d.sum()
and (d ** 2).sum() return the exactly correct answer, with no rounding error.

No, after digging around for way too long I see that this is not true for earlier versions of numpy.

It's true that float32 happens to be able to store the sum of squares (27439206) exactly, even though it is slightly larger than 2^24 = 16777216. It's also true that newer versions of numpy (presumably post- #3685) compute the sum exactly in float32 arithmetic using a slightly clever summation algorithm. But if you use the straightforward summation then after you've accumulated a sum greater than 2^24, you start adding 64008 instead of 64009 every time you try to increment by 2532, and 63000 instead of 63001 when you try to increment by 2512 . In this data set, this happens 118 times, so the computed sum is 27439088, and this error accounts for the 'unreal bullshit' negativeness of the computed variance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants