Numerical stability #4694

Lysovenko · 2014-05-10T14:44:20Z

It is no numerical stability in the NumPy (version1.6.2 but perhaps also in future fersions ).

>>> d=array([ 253.,  253.,  253.,  253.,  253.,  253.,  252.,  253.,  252.,
        252.,  253.,  253.,  252.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  252.,  253.,  253.,  252.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  254.,  253.,  253.,  253.,  252.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  252.,  251.,  254.,  254.,  254.,
        252.,  252.,  253.,  253.,  254.,  253.,  253.,  253.,  254.,
        253.,  252.,  253.,  254.,  253.,  252.,  253.,  253.,  253.,
        253.,  253.,  253.,  252.,  253.,  253.,  253.,  252.,  253.,
        254.,  252.,  252.,  253.,  253.,  253.,  253.,  253.,  252.,
        253.,  253.,  253.,  252.,  253.,  253.,  253.,  252.,  253.,
        253.,  253.,  253.,  253.,  253.,  254.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  252.,  253.,
        253.,  253.,  252.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  252.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  252.,  253.,  253.,  253.,  253.,  253.,  253.,  254.,
        253.,  253.,  253.,  253.,  253.,  254.,  253.,  253.,  253.,
        253.,  253.,  253.,  252.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        254.,  253.,  253.,  253.,  253.,  254.,  253.,  253.,  253.,
        253.,  254.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  254.,  254.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        254.,  254.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  254.,  254.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  254.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        253.,  253.,  253.,  253.,  253.,  254.,  253.,  253.,  253.,
        253.,  252.,  252.,  253.,  253.,  253.,  253.,  254.,  253.,
        253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,  253.,
        252.,  253.,  253.,  253.,  253.,  252.,  253.,  253.,  253.,
        253.,  252.,  253.,  253.,  252.,  252.,  252.,  252.,  252.,
        253.,  252.,  253.,  252.,  253.,  253.,  252.,  253.,  253.,
        252.,  252.,  252.,  252.,  252.,  253.,  253.,  253.,  252.,
        252.,  251.,  252.,  252.,  252.,  252.,  252.,  252.,  253.,
        252.,  252.,  252.,  252.,  252.,  252.,  253.,  253.,  253.,
        252.,  252.,  252.,  253.,  253.,  252.], dtype=float32)
>>> d.sum() / len(d) == d.mean()
True
# it is wrong: when used huge number of numbers last digits can be lost
>>> (d ** 2).mean() - d.mean() ** 2
-0.072070897236699238
# unreal bullshit happens: dispersion can't be negative
>>> x = 0.
>>> for i, v in enumerate(d, 1):
...  x += (v - x) / i
... 
>>> x
252.90442890442881
>>> d.mean()
252.9044289044289
'# x is the numerically stable d.mean()
>>> y = 0.
>>> for i, v in enumerate(d ** 2, 1):
...  y += (v - y) / i
... 
>>> y
63960.853146853144
>>> (d ** 2).mean()
63960.578088578091
>>> y - x ** 2
0.20298737785924459
>>> d.std() ** 2
0.20298681725988854
# at least d.std() have some stability

njsmith · 2014-05-10T14:50:08Z

I think your complaint is just that when you have float32s in numpy, you get more rounding error than when using python's builtin float64s? If so then the solution is just to tell numpy to use float64. If you can't store your data using float32 in general, then you can still ask specific operations to upcast internally by using the dtype argument. Try d.mean(dtype=np.float64), d.sum(dtype=np.float64).

(Also in newer versions numpy does use a somewhat more numerically stable implementation for sum, but I'm not sure it will help for this particular pattern of rounding errors.)

If this doesn't resolve your problem please re-open with more details?

njsmith · 2014-05-10T14:50:52Z

Err, when I said "If you can't store your data using float32 in general...", of course I meant float64, not float32.

Lysovenko · 2014-05-10T16:23:46Z

dtype=np.float64 does not works in my case. array.mean must be
stabilized rather than array.sum. array.sum can be stabilized using
sorted array - is too expencive for CPU time ;-)

2014-05-10 17:50 GMT+03:00, Nathaniel J. Smith [email protected]:

Err, when I said "If you can't store your data using float32 in general...",
of course I meant float64, not float32.

Reply to this email directly or view it on GitHub:
#4694 (comment)

charris · 2014-05-10T18:27:28Z

On Sat, May 10, 2014 at 10:23 AM, Lysovenko [email protected]:

dtype=np.float64 does not works in my case. array.mean must be
stabilized rather than array.sum. array.sum can be stabilized using
sorted array - is too expencive for CPU time ;-)

2014-05-10 17:50 GMT+03:00, Nathaniel J. Smith [email protected]:

Err, when I said "If you can't store your data using float32 in
general...",
of course I meant float64, not float32.

Reply to this email directly or view it on GitHub:
#4694 (comment)

This is probably improved in the upcoming 1.9 version. Could you give that
a try?
The easiest way to get that at the moment is to pull the master branch, but
the first beta should be out in a few days.

Chuck

njsmith · 2014-05-10T18:46:58Z

On further checking... your array is all small integers, so it turns out that d.sum() and (d ** 2).sum() return the exactly correct answer, with no rounding error. As a result, the only rounding in d.sum() / len(d) and in (d ** 2).sum() / len(d) occurs in the division. Since primitive operations like / are correctly rounded, this means that d.mean() and (d ** 2).mean() are also correctly rounded -- they have the minimum possible error that any floating point operation can have.

So the only possible way to get a more accurate mean on this array is to switch to a higher-precision floating point format. And for this purpose, dtype=np.float64 works just fine for me:

In [41]: (d ** 2).mean() - d.mean() ** 2
Out[41]: -0.074715096736326814

In [42]: (d ** 2).mean(dtype=np.float64) - d.mean(dtype=np.float64) ** 2
Out[42]: 0.20298737781558884

(This is with numpy 1.8.0.)

argriffing · 2014-05-13T01:22:50Z

# it is wrong: when used huge number of numbers last digits can be lost
>>> (d ** 2).mean() - d.mean() ** 2
-0.072070897236699238
# unreal bullshit happens: dispersion can't be negative

This way of computing the variance of d is a classic example of an unstable formula!

There is a Wikipedia page about the different numerical properties of various mathematically equivalent ways to compute it http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Na.C3.AFve_algorithm. The Python devs seem to agree that this is 'unreal bullshit' so in PEP 450 they are adding a standard library function to compute variance. On that page they have an example, with a comment

And variance should *never* be negative:

        >>> variance(data*100)
        -1239429440.1282566

where variance is implemented with essentially the formula that you are using here. If you are using numpy, a better way is to use np.var(d). This should work in your example even if d has dtype np.float32.

Edit: Sorry, you probably knew this already, and you were wondering why the numpy mean is worse than your handwritten mean. I guess this is mainly because of the data types of the intermediate calculations, and this seems to have a long history in the numpy bug tracker #1033, #1063, #2448.

Edit 2:

This is probably improved in the upcoming 1.9 version.
Could you give that a try?
The easiest way to get that at the moment is to pull the master branch

I tried this and got 0.202987 float32 for np.var and 0.199219 float32 for the naive formula. So it now does a better job of preserving the float32 data type and also gets a closer answer. I guess the improved stability is due to #3685?

argriffing · 2014-05-13T03:21:07Z

On further checking... your array is all small integers, so it turns out that d.sum()
and (d ** 2).sum() return the exactly correct answer, with no rounding error.

No, after digging around for way too long I see that this is not true for earlier versions of numpy.

It's true that float32 happens to be able to store the sum of squares (27439206) exactly, even though it is slightly larger than 2^24 = 16777216. It's also true that newer versions of numpy (presumably post- #3685) compute the sum exactly in float32 arithmetic using a slightly clever summation algorithm. But if you use the straightforward summation then after you've accumulated a sum greater than 2^24, you start adding 64008 instead of 64009 every time you try to increment by 2532, and 63000 instead of 63001 when you try to increment by 2512 . In this data set, this happens 118 times, so the computed sum is 27439088, and this error accounts for the 'unreal bullshit' negativeness of the computed variance.

njsmith closed this as completed May 10, 2014

musically-ut mentioned this issue Mar 17, 2015

Numerically unstable mean calculation for Timedeltas. pandas-dev/pandas#9670

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numerical stability #4694

Numerical stability #4694

Lysovenko commented May 10, 2014

njsmith commented May 10, 2014

njsmith commented May 10, 2014

Lysovenko commented May 10, 2014

charris commented May 10, 2014

njsmith commented May 10, 2014

argriffing commented May 13, 2014

argriffing commented May 13, 2014

Numerical stability #4694

Numerical stability #4694

Comments

Lysovenko commented May 10, 2014

njsmith commented May 10, 2014

njsmith commented May 10, 2014

Lysovenko commented May 10, 2014

charris commented May 10, 2014

njsmith commented May 10, 2014

argriffing commented May 13, 2014

argriffing commented May 13, 2014