-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Sortby #1389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sortby #1389
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice docs and tests! I think the logic inside sortby
could be simplified a bit (it will take a couple iterations) but this is a very nice start.
xarray/core/dataset.py
Outdated
labels. | ||
""" | ||
from .dataarray import DataArray | ||
if isinstance(variables, (str, unicode, DataArray)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reduce the number of separate code paths, I would suggest a normalization based flow. For example:
if not isinstance(variables, list):
variables = [variables]
# variables is now a list of str/DataArray
variables = [v if isinstance(v, DataArray) else self[v]
for v in variables]
# variables is now a list of DataArray
Now you can avoid the if
/else
branch inside the loop.
xarray/core/dataset.py
Outdated
Parameters | ||
---------- | ||
variables: (str, DataArray, or iterable of either) | ||
Name of a 1D variable in coords/data_vars whose values are used to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be Name(s)
xarray/core/dataset.py
Outdated
|
||
Parameters | ||
---------- | ||
variables: (str, DataArray, or iterable of either) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense to limit this to only lists, rather than arbitrary iterables. Arbitrary iterables makes the dispatch rules more complicated because both strings and DataArray objects are iterable.
xarray/core/dataset.py
Outdated
raise ValueError("Input DataArray must have same " | ||
"length as dimension it is sorting.") | ||
dims.append((key, val)) | ||
return self.isel(**dict(dims)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if there are multiple keys provided along the same dimension? This is equivalent to sorting by multiple columns in a spreadsheet.
It's OK not to support it, but we should be sure to raise an informative error, because I'm sure somebody will try it. Note that when you put multiple items in dict
with the same name Python ignores the first argument:
In [1]: dict([('x', 1), ('x', 2)])
Out[1]: {'x': 2}
If you do want to support it, probably the easiest way is with np.lexsort
xarray/tests/test_dataarray.py
Outdated
expected = DataArray([[4, 3], [2, 1]], | ||
[('x', ['a', 'b']), ('y', [0, 1])]) | ||
|
||
actual = da.sortby(['x', 'y']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to include a test for a single sorted, too, just da.sortby('x')
. Likewise da.sortby(dax)
.
doc/indexing.rst
Outdated
@@ -523,3 +523,24 @@ labels: | |||
|
|||
array | |||
array.get_index('x') | |||
|
|||
Sort | |||
------------------ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the length here needs to line up with the previous line.
doc/indexing.rst
Outdated
@@ -523,3 +523,24 @@ labels: | |||
|
|||
array | |||
array.get_index('x') | |||
|
|||
Sort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indexing.rst
is already our longest doc page, so I would put this section in "Reshaping and reorganizing data" instead (reshaping.rst
).
doc/indexing.rst
Outdated
Sort | ||
------------------ | ||
|
||
One may sort a dataarray/dataset via :py:meth:`~xarray.DataArray.sortby` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One may
-> You can
We want this as |
Weird. Travis seems to fail in places unrelated to my changes. I'm rather agnostic to |
Indeed, please ignore the first CI build on Travis. For some inexplicable reason it is installing the pandas 0.20rc, which is failing due to #1386. I am slightly in favor of |
xarray/core/dataset.py
Outdated
else: | ||
key = d.dims[0] | ||
val = d.argsort() if ascending else d.argsort()[::-1] | ||
if len(val) != len(self[key]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use self.dims[key]
instead of len(self[key])
(the later has significantly more overhead, since it constructs a pandas.Index
object).
xarray/core/dataset.py
Outdated
else: | ||
key = d.dims[0] | ||
val = d.argsort() if ascending else d.argsort()[::-1] | ||
if len(val) != len(self[key]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another minor point: it's better to fail earlier is possible. Can you move this check up one line, before the sort?
xarray/core/dataset.py
Outdated
vs = variables | ||
vs = [v if isinstance(v, DataArray) else self[v] for v in vs] | ||
|
||
dims = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this variable would be better called indices
?
xarray/core/dataset.py
Outdated
raise ValueError("Input DataArray must have same " | ||
"length as dimension it is sorting.") | ||
if key in dims: | ||
dims[key] = np.lexsort((val, dims[key])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is calculating the right thing. I think you need to lexsort d
, not val
(the result of argsort
). Otherwise, lexsort does some sort of shuffle of the sort indices:
>>> np.lexsort(([2, 1, 0],))
array([2, 1, 0])
Maybe inverse_permutation
would fix this.
Alternatively, this might be clearer -- and maybe slightly faster -- if this was done with a single call to lexsort
per dimension. Something like:
vars_by_dim = collections.defaultdict(list)
for d in vs:
...
vars_by_dim[key].append(d)
indices = {}
for key, vars in vars_by_dim.items():
order = np.lexsort(tuple(reversed(vars))
indices[key] = order if ascending else order[::-1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Not only should I have lexsorted d
, there was another logical error later in that code block as well (.update
should've been after an else
). Fixed those and pushed.
…rroneous code. Also addressed some reviewer comments.
Almost time to merge this baby in? :) |
xarray/tests/test_dataarray.py
Outdated
# dax0 is sorted first to give indices of [1, 2, 0] | ||
# and then dax1 would be used to move index 2 ahead of 1 | ||
dax0 = DataArray([100, 95, 95], [('x', [0, 1, 2])]) | ||
dax1 = DataArray([0, 1, 0], [('x', [7, 8, 9])]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example actually raises a good question, because dax1
has different coordinate labels for x
than dax0
and da
.
Should we align sortby arguments before using them, or ensure that they are already aligned? I think this is probably a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I have a dumb question: why do they need to be aligned? In this case, we're sorting da
along x
using the values of two dataarrays: dax0
and dax1
. So as long as both of those dataarrays have x
and the right length, we should be able to perform our sort. The coordinate labels feel irrelevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would mostly be for consistency with other xarray options, almost all of which require aligned objects. Indexing is the one exception, but I want to clean that up (#974).
I think I would prefer to align all of objects in variables
with the object being sorted, if they aren't already aligned (in most cases they will be, if they are pulled out by name). You could do this with xarray.align(self, *variables, join='left')
. Just note that missing sort labels will get sorted to the end, consistent with how NumPy's sort handles NaN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick question: If we're aligning the input args, does that mean we allow the input dataarrays to be N-D as well, since the left-join will force matching of the dims. Then of course it opens up the possibility of using 1 input dataarray to sort multiple dimensions. Should we support that?
Align also opens up a small can of worms, because the left-join could introduce nan
into the array values, complicating the sort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
align
doesn't change the dimensions on or dimensionality of any of its arguments -- it only changes coordinate labels and dimension sizes.
Then of course it opens up the possibility of using 1 input dataarray to sort multiple dimensions.
What would this look like?
Align also opens up a small can of worms, because the left-join could introduce nan into the array values, complicating the sort.
Indeed, this is why we would need to document how we sort NaN. Fortunately, NumPy already has a well defined sort order for NaN (it gets moved to the end).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon more thinking I think the N-D sort using 1 dataarray wouldn't make much sense, and even if it does would be an edge case. Never mind that.
xarray/core/dataset.py
Outdated
@@ -2741,6 +2741,56 @@ def roll(self, **shifts): | |||
|
|||
return self._replace_vars_and_dims(variables) | |||
|
|||
def sortby(self, variables, ascending=True): | |||
""" | |||
Sorts the dataset, either along specified dimensions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a short one-line description at the top? That is the standard numpy docstring format.
xarray/core/dataset.py
Outdated
and the FIRST key in the sequence is used as the primary sort key, | ||
followed by the 2nd key, etc. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should have only one space.
xarray/core/dataset.py
Outdated
vars_by_dim[key].append(d) | ||
|
||
indices = {} | ||
for key, ds in vars_by_dim.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a minor point -- but can you use slightly longer variable names instead of the abbreviations for variables that exist outside of one line? e.g.,
vs
-> variables
d
-> data_array
ds
-> arrays (this one is especially confusing, because ds
is normally used for xarray.Dataset
)
This would help for readability (also follows PEP8 guidelines)
xarray/core/dataset.py
Outdated
|
||
indices = {} | ||
for key, ds in vars_by_dim.items(): | ||
order = np.lexsort(tuple(reversed(ds))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to add a test case verifying that this sorts a pandas.MultiIndex
properly.
I've added a tiny bit of extra docstring just to ease my discomfort about the NaN sort. If you feel it's redundant I can remove it. |
Perhaps it's finally ready for prime time? |
Looks like one the Travis-CI tests failing on the sortby tests: https://travis-ci.org/pydata/xarray/jobs/229457729#L342 It's not immediately clear to me what's going on there, but maybe it's an issue with the old version of NumPy? The error is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, once we figure out that test failure!
xarray/core/dataset.py
Outdated
variables = [variables] | ||
else: | ||
variables = variables | ||
variables = [v if isinstance(v, DataArray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I prefer writing this like:
variables = [v if isinstance(v, DataArray) else self[v]
for v in variables]
because the else
clause is part of the if
conditional statement.
@@ -2519,6 +2519,69 @@ def test_combine_first(self): | |||
[('x', ['a', 'b', 'd']), ('y', [-1, 0])]) | |||
self.assertDataArrayEqual(actual, expected) | |||
|
|||
def test_sortby(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you delete the test cases here that are redundant with those in test_dataset.py
? Given the implementation (only a small wrapped for DataArray.sortby()
), I am OK having a less comprehensive set of unit tests for the DataArray implementation -- and fewer unit tests make things easier to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I would say the entire test_dataarray::test_sortby
is redundant, because it's essentially a carbon copy of test_dataset::test_sortby
. I've gotten rid of a few lines in the latest push, but could trim more if you like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just add a single very basic test to verify that it works, and add a comment noting that more advanced functionality is tested in test_dataset.py
.
Does it work on numpy 1.11? We could potentially (partially) drop compatibility for older numpy releases. Or if we can identify the issue, we can raise an informative error or use a work-around. |
Ah, good to know. In that case, let's add a check for |
Think I just copied something over from |
xarray/core/dataset.py
Outdated
A new dataset where all the specified dims are sorted by dim | ||
labels. | ||
""" | ||
if LooseVersion(np.__version__) < LooseVersion('1.11.0'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this check inside the loop, if data_array.dtype == object
? That way, users of old versions of numpy can at least sort other arrays (e.g., numbers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'll still need to adjust the test suite so it passes on older numpy.
xarray/core/dataset.py
Outdated
for data_array in aligned_other_vars: | ||
if len(data_array.dims) > 1: | ||
raise ValueError("Input DataArray has more than 1 dimension.") | ||
elif data_array.dtype == object and\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer parentheses to \
for the line continuation:
https://www.python.org/dev/peps/pep-0008/#maximum-line-length
xarray/core/dataset.py
Outdated
aligned_other_vars = aligned_vars[1:] | ||
vars_by_dim = defaultdict(list) | ||
for data_array in aligned_other_vars: | ||
if len(data_array.dims) > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be data_array.ndim != 1
. Scalar arrays would also be problematic.
xarray/core/dataset.py
Outdated
'requires numpy 1.11.0 or later to support ' | ||
'object data-type.') | ||
else: | ||
key = data_array.dims[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider switching this to (key,) = data_array.dims
. This removes the otherwise mysterious 0
literal and serves as an implicit assert statement.
xarray/core/dataset.py
Outdated
for data_array in aligned_other_vars: | ||
if len(data_array.dims) > 1: | ||
raise ValueError("Input DataArray has more than 1 dimension.") | ||
elif data_array.dtype == object and\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can you make this just if
instead of elif
? This isn't a natural alternative to the dimensionality check, so it makes more sense in a separate if
.
xarray/core/dataset.py
Outdated
'sortby uses np.lexsort under the hood, which ' | ||
'requires numpy 1.11.0 or later to support ' | ||
'object data-type.') | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another nit: Can you remove the else
block here? It doesn't make sense to conditionally enter this part of the code, and it's especially weird considering the line below, which would have key
as an undefined variable if it ever gets run.
Looks ready now. |
@chunweiyuan have you run flake8 on this? |
Today is the first time I've heard of flake8. Is this how you guys standardize quality checks? If so, do you run once for python 3.5 and once for 2.7? |
It's in the checklist for every new PR :).
Yes. It would be nice to run this as a continuous integration test but we haven't set that up that. Running it once with either Python 2 or 3 is fine -- it should give the same output either way. |
I just made some futile attempts to find that checklist, to no avail. Mind helping me with a link? :) |
Scroll to the top of this PR and look at your first post :)
…On Wed, May 10, 2017 at 11:08 AM, chunweiyuan ***@***.***> wrote:
I just made some futile attempts to find that checklist, to no avail. Mind
helping me with a link? :)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1389 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1mF7Jjb2NmZ_5LZpRvUP_nwG-khgks5r4f0hgaJpZM4NMGMU>
.
|
Haha, brilliant. |
I've played around with it a bit but not seen any changes to the complaints. What is going on? Should I even worry about the |
It's very strange that fake8 complains about non-Python files. I think you
can safely ignore those.
…On Wed, May 10, 2017 at 8:55 PM chunweiyuan ***@***.***> wrote:
git diff upstream/master | flake8 --diff only complains about a bunch of
lines in whats-new.rst and reshaping.rst. But these complaints don't make
any sense to me. Some examples:
doc/reshaping.rst:195:1: E112 expected an indented block
doc/reshaping.rst:203:1: E101 indentation contains mixed spaces and tabs
doc/reshaping.rst:203:1: W191 indentation contains tabs
doc/reshaping.rst:203:2: E113 unexpected indentation
I've played around with it a bit but not seen any changes to the
complaints. What is going on? Should I even worry about the .rst files?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1389 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1qjB4dK0tHLBMrX6sURnlt_T5QX0ks5r4obPgaJpZM4NMGMU>
.
|
If that's the case then I have nothing more to add. I've run flake8 on the individual files I touched I as well, such as |
One last thing -- this needs an entry in |
|
Indeed, I will add it right now! Thanks for your contribution!! |
Thank you very much for your patience! BTW, I think I messed up my co-worker's name in |
No worries, |
git diff upstream/master | flake8 --diff