Skip to content

Commit ee9da17

Browse files
authored
interpolate_na: Add max_gap support. (#3302)
* interpolate_na: Add maxgap support. * Add docs. * Add requires_bottleneck to test. * Review comments. * Update xarray/core/dataarray.py Co-Authored-By: Maximilian Roos <[email protected]> * Update xarray/core/dataset.py Co-Authored-By: Maximilian Roos <[email protected]> * maxgap → max_gap * update whats-new * update computation.rst * Better support uniformly spaced coordinates. Split legnths, interp test * Raise error for max_gap and irregularly spaced coordinates + test * rework. * Use pandas checks for index duplication and monotonicity. * Progress + add datetime. * nicer error message * A few fstrings. * finish up timedelta max_gap. * fix whats-new * small fixes. * fix dan's test. * remove redundant test. * nicer error message. * Add xfailed cftime tests * better error checking and tests. * typing. * update docstrings * scipy intersphinx * fix tests * add bottleneck testing decorator.
1 parent 7b4a286 commit ee9da17

File tree

7 files changed

+322
-54
lines changed

7 files changed

+322
-54
lines changed

doc/computation.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,9 @@ for filling missing values via 1D interpolation.
9595
Note that xarray slightly diverges from the pandas ``interpolate`` syntax by
9696
providing the ``use_coordinate`` keyword which facilitates a clear specification
9797
of which values to use as the index in the interpolation.
98+
xarray also provides the ``max_gap`` keyword argument to limit the interpolation to
99+
data gaps of length ``max_gap`` or smaller. See :py:meth:`~xarray.DataArray.interpolate_na`
100+
for more.
98101

99102
Aggregation
100103
===========

doc/conf.py

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -340,9 +340,10 @@
340340
# Example configuration for intersphinx: refer to the Python standard library.
341341
intersphinx_mapping = {
342342
"python": ("https://docs.python.org/3/", None),
343-
"pandas": ("https://pandas.pydata.org/pandas-docs/stable/", None),
344-
"iris": ("http://scitools.org.uk/iris/docs/latest/", None),
345-
"numpy": ("https://docs.scipy.org/doc/numpy/", None),
346-
"numba": ("https://numba.pydata.org/numba-doc/latest/", None),
347-
"matplotlib": ("https://matplotlib.org/", None),
343+
"pandas": ("https://pandas.pydata.org/pandas-docs/stable", None),
344+
"iris": ("https://scitools.org.uk/iris/docs/latest", None),
345+
"numpy": ("https://docs.scipy.org/doc/numpy", None),
346+
"scipy": ("https://docs.scipy.org/doc/scipy/reference", None),
347+
"numba": ("https://numba.pydata.org/numba-doc/latest", None),
348+
"matplotlib": ("https://matplotlib.org", None),
348349
}

doc/whats-new.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,10 @@ Breaking changes
3838

3939
New Features
4040
~~~~~~~~~~~~
41+
42+
- Added the ``max_gap`` kwarg to :py:meth:`~xarray.DataArray.interpolate_na` and
43+
:py:meth:`~xarray.Dataset.interpolate_na`. This controls the maximum size of the data
44+
gap that will be filled by interpolation. By `Deepak Cherian <https://github.com/dcherian>`_.
4145
- :py:meth:`Dataset.drop_sel` & :py:meth:`DataArray.drop_sel` have been added for dropping labels.
4246
:py:meth:`Dataset.drop_vars` & :py:meth:`DataArray.drop_vars` have been added for
4347
dropping variables (including coordinates). The existing ``drop`` methods remain as a backward compatible

xarray/core/dataarray.py

Lines changed: 42 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -2018,44 +2018,69 @@ def fillna(self, value: Any) -> "DataArray":
20182018

20192019
def interpolate_na(
20202020
self,
2021-
dim=None,
2021+
dim: Hashable = None,
20222022
method: str = "linear",
20232023
limit: int = None,
20242024
use_coordinate: Union[bool, str] = True,
2025+
max_gap: Union[int, float, str, pd.Timedelta, np.timedelta64] = None,
20252026
**kwargs: Any,
20262027
) -> "DataArray":
2027-
"""Interpolate values according to different methods.
2028+
"""Fill in NaNs by interpolating according to different methods.
20282029
20292030
Parameters
20302031
----------
20312032
dim : str
20322033
Specifies the dimension along which to interpolate.
2033-
method : {'linear', 'nearest', 'zero', 'slinear', 'quadratic', 'cubic',
2034-
'polynomial', 'barycentric', 'krog', 'pchip',
2035-
'spline', 'akima'}, optional
2034+
method : str, optional
20362035
String indicating which method to use for interpolation:
20372036
20382037
- 'linear': linear interpolation (Default). Additional keyword
2039-
arguments are passed to ``numpy.interp``
2040-
- 'nearest', 'zero', 'slinear', 'quadratic', 'cubic',
2041-
'polynomial': are passed to ``scipy.interpolate.interp1d``. If
2042-
method=='polynomial', the ``order`` keyword argument must also be
2038+
arguments are passed to :py:func:`numpy.interp`
2039+
- 'nearest', 'zero', 'slinear', 'quadratic', 'cubic', 'polynomial':
2040+
are passed to :py:func:`scipy.interpolate.interp1d`. If
2041+
``method='polynomial'``, the ``order`` keyword argument must also be
20432042
provided.
2044-
- 'barycentric', 'krog', 'pchip', 'spline', and `akima`: use their
2045-
respective``scipy.interpolate`` classes.
2046-
use_coordinate : boolean or str, default True
2043+
- 'barycentric', 'krog', 'pchip', 'spline', 'akima': use their
2044+
respective :py:class:`scipy.interpolate` classes.
2045+
use_coordinate : bool, str, default True
20472046
Specifies which index to use as the x values in the interpolation
20482047
formulated as `y = f(x)`. If False, values are treated as if
2049-
eqaully-spaced along `dim`. If True, the IndexVariable `dim` is
2050-
used. If use_coordinate is a string, it specifies the name of a
2048+
eqaully-spaced along ``dim``. If True, the IndexVariable `dim` is
2049+
used. If ``use_coordinate`` is a string, it specifies the name of a
20512050
coordinate variariable to use as the index.
20522051
limit : int, default None
20532052
Maximum number of consecutive NaNs to fill. Must be greater than 0
2054-
or None for no limit.
2053+
or None for no limit. This filling is done regardless of the size of
2054+
the gap in the data. To only interpolate over gaps less than a given length,
2055+
see ``max_gap``.
2056+
max_gap: int, float, str, pandas.Timedelta, numpy.timedelta64, default None.
2057+
Maximum size of gap, a continuous sequence of NaNs, that will be filled.
2058+
Use None for no limit. When interpolating along a datetime64 dimension
2059+
and ``use_coordinate=True``, ``max_gap`` can be one of the following:
2060+
2061+
- a string that is valid input for pandas.to_timedelta
2062+
- a :py:class:`numpy.timedelta64` object
2063+
- a :py:class:`pandas.Timedelta` object
2064+
Otherwise, ``max_gap`` must be an int or a float. Use of ``max_gap`` with unlabeled
2065+
dimensions has not been implemented yet. Gap length is defined as the difference
2066+
between coordinate values at the first data point after a gap and the last value
2067+
before a gap. For gaps at the beginning (end), gap length is defined as the difference
2068+
between coordinate values at the first (last) valid data point and the first (last) NaN.
2069+
For example, consider::
2070+
2071+
<xarray.DataArray (x: 9)>
2072+
array([nan, nan, nan, 1., nan, nan, 4., nan, nan])
2073+
Coordinates:
2074+
* x (x) int64 0 1 2 3 4 5 6 7 8
2075+
2076+
The gap lengths are 3-0 = 3; 6-3 = 3; and 8-6 = 2 respectively
2077+
kwargs : dict, optional
2078+
parameters passed verbatim to the underlying interpolation function
20552079
20562080
Returns
20572081
-------
2058-
DataArray
2082+
interpolated: DataArray
2083+
Filled in DataArray.
20592084
20602085
See also
20612086
--------
@@ -2070,6 +2095,7 @@ def interpolate_na(
20702095
method=method,
20712096
limit=limit,
20722097
use_coordinate=use_coordinate,
2098+
max_gap=max_gap,
20732099
**kwargs,
20742100
)
20752101

xarray/core/dataset.py

Lines changed: 42 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -3908,42 +3908,65 @@ def interpolate_na(
39083908
method: str = "linear",
39093909
limit: int = None,
39103910
use_coordinate: Union[bool, Hashable] = True,
3911+
max_gap: Union[int, float, str, pd.Timedelta, np.timedelta64] = None,
39113912
**kwargs: Any,
39123913
) -> "Dataset":
3913-
"""Interpolate values according to different methods.
3914+
"""Fill in NaNs by interpolating according to different methods.
39143915
39153916
Parameters
39163917
----------
3917-
dim : Hashable
3918+
dim : str
39183919
Specifies the dimension along which to interpolate.
3919-
method : {'linear', 'nearest', 'zero', 'slinear', 'quadratic', 'cubic',
3920-
'polynomial', 'barycentric', 'krog', 'pchip',
3921-
'spline'}, optional
3920+
method : str, optional
39223921
String indicating which method to use for interpolation:
39233922
39243923
- 'linear': linear interpolation (Default). Additional keyword
3925-
arguments are passed to ``numpy.interp``
3926-
- 'nearest', 'zero', 'slinear', 'quadratic', 'cubic',
3927-
'polynomial': are passed to ``scipy.interpolate.interp1d``. If
3928-
method=='polynomial', the ``order`` keyword argument must also be
3924+
arguments are passed to :py:func:`numpy.interp`
3925+
- 'nearest', 'zero', 'slinear', 'quadratic', 'cubic', 'polynomial':
3926+
are passed to :py:func:`scipy.interpolate.interp1d`. If
3927+
``method='polynomial'``, the ``order`` keyword argument must also be
39293928
provided.
3930-
- 'barycentric', 'krog', 'pchip', 'spline': use their respective
3931-
``scipy.interpolate`` classes.
3932-
use_coordinate : boolean or str, default True
3929+
- 'barycentric', 'krog', 'pchip', 'spline', 'akima': use their
3930+
respective :py:class:`scipy.interpolate` classes.
3931+
use_coordinate : bool, str, default True
39333932
Specifies which index to use as the x values in the interpolation
39343933
formulated as `y = f(x)`. If False, values are treated as if
3935-
eqaully-spaced along `dim`. If True, the IndexVariable `dim` is
3936-
used. If use_coordinate is a string, it specifies the name of a
3934+
eqaully-spaced along ``dim``. If True, the IndexVariable `dim` is
3935+
used. If ``use_coordinate`` is a string, it specifies the name of a
39373936
coordinate variariable to use as the index.
39383937
limit : int, default None
39393938
Maximum number of consecutive NaNs to fill. Must be greater than 0
3940-
or None for no limit.
3941-
kwargs : any
3942-
parameters passed verbatim to the underlying interplation function
3939+
or None for no limit. This filling is done regardless of the size of
3940+
the gap in the data. To only interpolate over gaps less than a given length,
3941+
see ``max_gap``.
3942+
max_gap: int, float, str, pandas.Timedelta, numpy.timedelta64, default None.
3943+
Maximum size of gap, a continuous sequence of NaNs, that will be filled.
3944+
Use None for no limit. When interpolating along a datetime64 dimension
3945+
and ``use_coordinate=True``, ``max_gap`` can be one of the following:
3946+
3947+
- a string that is valid input for pandas.to_timedelta
3948+
- a :py:class:`numpy.timedelta64` object
3949+
- a :py:class:`pandas.Timedelta` object
3950+
Otherwise, ``max_gap`` must be an int or a float. Use of ``max_gap`` with unlabeled
3951+
dimensions has not been implemented yet. Gap length is defined as the difference
3952+
between coordinate values at the first data point after a gap and the last value
3953+
before a gap. For gaps at the beginning (end), gap length is defined as the difference
3954+
between coordinate values at the first (last) valid data point and the first (last) NaN.
3955+
For example, consider::
3956+
3957+
<xarray.DataArray (x: 9)>
3958+
array([nan, nan, nan, 1., nan, nan, 4., nan, nan])
3959+
Coordinates:
3960+
* x (x) int64 0 1 2 3 4 5 6 7 8
3961+
3962+
The gap lengths are 3-0 = 3; 6-3 = 3; and 8-6 = 2 respectively
3963+
kwargs : dict, optional
3964+
parameters passed verbatim to the underlying interpolation function
39433965
39443966
Returns
39453967
-------
3946-
Dataset
3968+
interpolated: Dataset
3969+
Filled in Dataset.
39473970
39483971
See also
39493972
--------
@@ -3959,6 +3982,7 @@ def interpolate_na(
39593982
method=method,
39603983
limit=limit,
39613984
use_coordinate=use_coordinate,
3985+
max_gap=max_gap,
39623986
**kwargs,
39633987
)
39643988
return new

xarray/core/missing.py

Lines changed: 98 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,46 @@
11
import warnings
22
from functools import partial
3-
from typing import Any, Callable, Dict, Sequence
3+
from numbers import Number
4+
from typing import Any, Callable, Dict, Hashable, Sequence, Union
45

56
import numpy as np
67
import pandas as pd
78

89
from . import utils
9-
from .common import _contains_datetime_like_objects
10+
from .common import _contains_datetime_like_objects, ones_like
1011
from .computation import apply_ufunc
1112
from .duck_array_ops import dask_array_type
1213
from .utils import OrderedSet, is_scalar
1314
from .variable import Variable, broadcast_variables
1415

1516

17+
def _get_nan_block_lengths(obj, dim: Hashable, index: Variable):
18+
"""
19+
Return an object where each NaN element in 'obj' is replaced by the
20+
length of the gap the element is in.
21+
"""
22+
23+
# make variable so that we get broadcasting for free
24+
index = Variable([dim], index)
25+
26+
# algorithm from https://github.com/pydata/xarray/pull/3302#discussion_r324707072
27+
arange = ones_like(obj) * index
28+
valid = obj.notnull()
29+
valid_arange = arange.where(valid)
30+
cumulative_nans = valid_arange.ffill(dim=dim).fillna(index[0])
31+
32+
nan_block_lengths = (
33+
cumulative_nans.diff(dim=dim, label="upper")
34+
.reindex({dim: obj[dim]})
35+
.where(valid)
36+
.bfill(dim=dim)
37+
.where(~valid, 0)
38+
.fillna(index[-1] - valid_arange.max())
39+
)
40+
41+
return nan_block_lengths
42+
43+
1644
class BaseInterpolator:
1745
"""Generic interpolator class for normalizing interpolation methods
1846
"""
@@ -178,7 +206,7 @@ def _apply_over_vars_with_dim(func, self, dim=None, **kwargs):
178206
return ds
179207

180208

181-
def get_clean_interp_index(arr, dim, use_coordinate=True):
209+
def get_clean_interp_index(arr, dim: Hashable, use_coordinate: Union[str, bool] = True):
182210
"""get index to use for x values in interpolation.
183211
184212
If use_coordinate is True, the coordinate that shares the name of the
@@ -195,23 +223,33 @@ def get_clean_interp_index(arr, dim, use_coordinate=True):
195223
index = arr.coords[use_coordinate]
196224
if index.ndim != 1:
197225
raise ValueError(
198-
"Coordinates used for interpolation must be 1D, "
199-
"%s is %dD." % (use_coordinate, index.ndim)
226+
f"Coordinates used for interpolation must be 1D, "
227+
f"{use_coordinate} is {index.ndim}D."
200228
)
229+
index = index.to_index()
230+
231+
# TODO: index.name is None for multiindexes
232+
# set name for nice error messages below
233+
if isinstance(index, pd.MultiIndex):
234+
index.name = dim
235+
236+
if not index.is_monotonic:
237+
raise ValueError(f"Index {index.name!r} must be monotonically increasing")
238+
239+
if not index.is_unique:
240+
raise ValueError(f"Index {index.name!r} has duplicate values")
201241

202242
# raise if index cannot be cast to a float (e.g. MultiIndex)
203243
try:
204244
index = index.values.astype(np.float64)
205245
except (TypeError, ValueError):
206246
# pandas raises a TypeError
207-
# xarray/nuppy raise a ValueError
247+
# xarray/numpy raise a ValueError
208248
raise TypeError(
209-
"Index must be castable to float64 to support"
210-
"interpolation, got: %s" % type(index)
249+
f"Index {index.name!r} must be castable to float64 to support "
250+
f"interpolation, got {type(index).__name__}."
211251
)
212-
# check index sorting now so we can skip it later
213-
if not (np.diff(index) > 0).all():
214-
raise ValueError("Index must be monotonicly increasing")
252+
215253
else:
216254
axis = arr.get_axis_num(dim)
217255
index = np.arange(arr.shape[axis], dtype=np.float64)
@@ -220,7 +258,13 @@ def get_clean_interp_index(arr, dim, use_coordinate=True):
220258

221259

222260
def interp_na(
223-
self, dim=None, use_coordinate=True, method="linear", limit=None, **kwargs
261+
self,
262+
dim: Hashable = None,
263+
use_coordinate: Union[bool, str] = True,
264+
method: str = "linear",
265+
limit: int = None,
266+
max_gap: Union[int, float, str, pd.Timedelta, np.timedelta64] = None,
267+
**kwargs,
224268
):
225269
"""Interpolate values according to different methods.
226270
"""
@@ -230,6 +274,40 @@ def interp_na(
230274
if limit is not None:
231275
valids = _get_valid_fill_mask(self, dim, limit)
232276

277+
if max_gap is not None:
278+
max_type = type(max_gap).__name__
279+
if not is_scalar(max_gap):
280+
raise ValueError("max_gap must be a scalar.")
281+
282+
if (
283+
dim in self.indexes
284+
and isinstance(self.indexes[dim], pd.DatetimeIndex)
285+
and use_coordinate
286+
):
287+
if not isinstance(max_gap, (np.timedelta64, pd.Timedelta, str)):
288+
raise TypeError(
289+
f"Underlying index is DatetimeIndex. Expected max_gap of type str, pandas.Timedelta or numpy.timedelta64 but received {max_type}"
290+
)
291+
292+
if isinstance(max_gap, str):
293+
try:
294+
max_gap = pd.to_timedelta(max_gap)
295+
except ValueError:
296+
raise ValueError(
297+
f"Could not convert {max_gap!r} to timedelta64 using pandas.to_timedelta"
298+
)
299+
300+
if isinstance(max_gap, pd.Timedelta):
301+
max_gap = np.timedelta64(max_gap.value, "ns")
302+
303+
max_gap = np.timedelta64(max_gap, "ns").astype(np.float64)
304+
305+
if not use_coordinate:
306+
if not isinstance(max_gap, (Number, np.number)):
307+
raise TypeError(
308+
f"Expected integer or floating point max_gap since use_coordinate=False. Received {max_type}."
309+
)
310+
233311
# method
234312
index = get_clean_interp_index(self, dim, use_coordinate=use_coordinate)
235313
interp_class, kwargs = _get_interpolator(method, **kwargs)
@@ -253,6 +331,14 @@ def interp_na(
253331
if limit is not None:
254332
arr = arr.where(valids)
255333

334+
if max_gap is not None:
335+
if dim not in self.coords:
336+
raise NotImplementedError(
337+
"max_gap not implemented for unlabeled coordinates yet."
338+
)
339+
nan_block_lengths = _get_nan_block_lengths(self, dim, index)
340+
arr = arr.where(nan_block_lengths <= max_gap)
341+
256342
return arr
257343

258344

0 commit comments

Comments
 (0)