Resample / upsample behavior diverges from pandas #1631

jhamman · 2017-10-12T19:22:44Z

I've found a few issues where xarray's new resample / upsample functionality is diverging from Pandas. I think they are mostly surrounding how NaNs are treated. Thoughts from @shoyer, @darothen and others.

Gist with all the juicy details: https://gist.github.com/jhamman/354f0e5ff32a39550ffd25800e7214fc#file-xarray_resample-ipynb

xref: #1608, #1272

shoyer · 2017-10-13T22:05:41Z

The key difference appears to be:

In xarray, .resample(...).interpolate(...) only interpolates over existing gaps in the data. If a value is already marked as NaN, it doesn't get interpolated.
In pandas, .resample(...).interpolate(...) fills in existing NaNs.

I think this is a bug in pandas, since the behavior is inconsistent with other resample methods like ffill():

>>> s.reindex_like(slike).resample('1D').ffill()
time
2016-01-01     NaN
2016-01-02     0.0
2016-01-03     1.0
2016-01-04     2.0
2016-01-05     3.0
2016-01-06     NaN
2016-01-07     4.0
2016-01-08     5.0
2016-01-09     6.0
2016-01-10     7.0
2016-01-11     8.0
2016-01-12     9.0
2016-01-13    10.0
2016-01-14     NaN
2016-01-15     NaN
Freq: D, dtype: float32

More generally: resample() exists for resampling existing values, not filling in missing values. If you want to fill in values that are already NaN, you should use one of the existing filling methods (e.g., fillna() or interpolate()). Or you can drop this filling values with .dropna().

(This does suggest that xarray could use a direct DataArray.interpolate() method.)

Another example:

>>> s.reindex_like(slike).resample('12H').ffill()
time
2016-01-01 00:00:00     NaN
2016-01-01 12:00:00     NaN
2016-01-02 00:00:00     0.0
2016-01-02 12:00:00     0.0
2016-01-03 00:00:00     1.0
2016-01-03 12:00:00     1.0
2016-01-04 00:00:00     2.0
2016-01-04 12:00:00     2.0
2016-01-05 00:00:00     3.0
2016-01-05 12:00:00     3.0
2016-01-06 00:00:00     NaN
2016-01-06 12:00:00     NaN
2016-01-07 00:00:00     4.0
2016-01-07 12:00:00     4.0
2016-01-08 00:00:00     5.0
2016-01-08 12:00:00     5.0
2016-01-09 00:00:00     6.0
2016-01-09 12:00:00     6.0
2016-01-10 00:00:00     7.0
2016-01-10 12:00:00     7.0
2016-01-11 00:00:00     8.0
2016-01-11 12:00:00     8.0
2016-01-12 00:00:00     9.0
2016-01-12 12:00:00     9.0
2016-01-13 00:00:00    10.0
2016-01-13 12:00:00    10.0
2016-01-14 00:00:00     NaN
2016-01-14 12:00:00     NaN
2016-01-15 00:00:00     NaN
Freq: 12H, dtype: float32

It is useful that pandas's upsampling is only repeating values within the previously valid range. Otherwise it is likely to interpolate over true data gaps.

As another use-case: suppose we have a temperature dataset with 3 hourly measurements, and we want to upsample it to 1 hour resolution. Occasionally, measurements are missing for day(s) at a time, which we mark with missing values (suppose the server running the model ran out of disk space). It is useful to be able to resample to a higher resolution without entirely unrealistic interpolation over data gaps.

jhamman · 2017-10-13T23:48:31Z

Thanks @shoyer. I always appreciated this feature in Pandas so I'm bummed to see it may not have been intentional. I need a xarray interpolate method that fills NaNs so I'll give that a go. I suspect it will be a widely used feature for dealing with missing data.

shoyer · 2017-10-13T23:54:51Z

Let's see where the pandas discussion ends up. If xarray had a method for interpolating to fill missing values, achieving your desired result would be as a simple as chaining another interpolate call, e.g., .resample('1D').interpolate().interpolate_na() or .interpolate_na().resample('1D').interpolate().

darothen · 2017-10-14T13:19:58Z

Thanks for documenting this @jhamman. I think all the logic is in .resample(...).interpolate() to build out true interpolation or really imputation/infilling. I can jump in if there's any confusion in the code.

mmartini-usgs · 2017-10-30T18:11:58Z

Thanks for posting this @jhamman. It's really helping me understand what is going on with my data when I use xarray. My understanding of Pandas is that it should not by default be interpolating - however I am downsampling and this is stated for upsampling (in Python for Data Analysis).

jhamman added the topic-pandas-like label Oct 12, 2017

shoyer mentioned this issue Oct 13, 2017

resample().interpolate() should not fill pre-existing NaNs pandas-dev/pandas#17868

Open

jhamman mentioned this issue Oct 20, 2017

WIP: Feature/interpolate #1640

Merged

4 tasks

fujiisoup closed this as completed in #1640 Dec 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resample / upsample behavior diverges from pandas #1631

Resample / upsample behavior diverges from pandas #1631

jhamman commented Oct 12, 2017 •

edited by shoyer

Loading

shoyer commented Oct 13, 2017

jhamman commented Oct 13, 2017

shoyer commented Oct 13, 2017

darothen commented Oct 14, 2017

mmartini-usgs commented Oct 30, 2017

Resample / upsample behavior diverges from pandas #1631

Resample / upsample behavior diverges from pandas #1631

Comments

jhamman commented Oct 12, 2017 • edited by shoyer Loading

shoyer commented Oct 13, 2017

jhamman commented Oct 13, 2017

shoyer commented Oct 13, 2017

darothen commented Oct 14, 2017

mmartini-usgs commented Oct 30, 2017

jhamman commented Oct 12, 2017 •

edited by shoyer

Loading