Skip to content

Sortby #1389

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
May 12, 2017
Merged

Sortby #1389

merged 25 commits into from
May 12, 2017

Conversation

chunweiyuan
Copy link
Contributor

@chunweiyuan chunweiyuan commented Apr 29, 2017

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice docs and tests! I think the logic inside sortby could be simplified a bit (it will take a couple iterations) but this is a very nice start.

labels.
"""
from .dataarray import DataArray
if isinstance(variables, (str, unicode, DataArray)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce the number of separate code paths, I would suggest a normalization based flow. For example:

if not isinstance(variables, list):
    variables = [variables]
# variables is now a list of str/DataArray
variables = [v if isinstance(v, DataArray) else self[v]
             for v in variables]
# variables is now a list of DataArray

Now you can avoid the if/else branch inside the loop.

Parameters
----------
variables: (str, DataArray, or iterable of either)
Name of a 1D variable in coords/data_vars whose values are used to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be Name(s)


Parameters
----------
variables: (str, DataArray, or iterable of either)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to limit this to only lists, rather than arbitrary iterables. Arbitrary iterables makes the dispatch rules more complicated because both strings and DataArray objects are iterable.

raise ValueError("Input DataArray must have same "
"length as dimension it is sorting.")
dims.append((key, val))
return self.isel(**dict(dims))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if there are multiple keys provided along the same dimension? This is equivalent to sorting by multiple columns in a spreadsheet.

It's OK not to support it, but we should be sure to raise an informative error, because I'm sure somebody will try it. Note that when you put multiple items in dict with the same name Python ignores the first argument:

In [1]: dict([('x', 1), ('x', 2)])
Out[1]: {'x': 2}

If you do want to support it, probably the easiest way is with np.lexsort

expected = DataArray([[4, 3], [2, 1]],
[('x', ['a', 'b']), ('y', [0, 1])])

actual = da.sortby(['x', 'y'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to include a test for a single sorted, too, just da.sortby('x'). Likewise da.sortby(dax).

doc/indexing.rst Outdated
@@ -523,3 +523,24 @@ labels:

array
array.get_index('x')

Sort
------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the length here needs to line up with the previous line.

doc/indexing.rst Outdated
@@ -523,3 +523,24 @@ labels:

array
array.get_index('x')

Sort
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indexing.rst is already our longest doc page, so I would put this section in "Reshaping and reorganizing data" instead (reshaping.rst).

doc/indexing.rst Outdated
Sort
------------------

One may sort a dataarray/dataset via :py:meth:`~xarray.DataArray.sortby` and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One may -> You can

@max-sixty
Copy link
Collaborator

We want this as sortby, rather than sort_by? Generally the latter would be more idiomatic, but if there is prior sortby out there, I would vote for consistency

@chunweiyuan
Copy link
Contributor Author

Weird. Travis seems to fail in places unrelated to my changes.

I'm rather agnostic to sortby vs. sort_by.

@shoyer
Copy link
Member

shoyer commented May 1, 2017

Indeed, please ignore the first CI build on Travis. For some inexplicable reason it is installing the pandas 0.20rc, which is failing due to #1386.

I am slightly in favor of sortby but also could go either way. The PEP8 guidance on function/method naming is "lowercase with words separated by underscores as necessary to improve readability." I don't think an underscore is necessary for readability here.

else:
key = d.dims[0]
val = d.argsort() if ascending else d.argsort()[::-1]
if len(val) != len(self[key]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use self.dims[key] instead of len(self[key]) (the later has significantly more overhead, since it constructs a pandas.Index object).

else:
key = d.dims[0]
val = d.argsort() if ascending else d.argsort()[::-1]
if len(val) != len(self[key]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another minor point: it's better to fail earlier is possible. Can you move this check up one line, before the sort?

vs = variables
vs = [v if isinstance(v, DataArray) else self[v] for v in vs]

dims = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this variable would be better called indices?

raise ValueError("Input DataArray must have same "
"length as dimension it is sorting.")
if key in dims:
dims[key] = np.lexsort((val, dims[key]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is calculating the right thing. I think you need to lexsort d, not val (the result of argsort). Otherwise, lexsort does some sort of shuffle of the sort indices:

>>> np.lexsort(([2, 1, 0],))
array([2, 1, 0])

Maybe inverse_permutation would fix this.

Alternatively, this might be clearer -- and maybe slightly faster -- if this was done with a single call to lexsort per dimension. Something like:

vars_by_dim = collections.defaultdict(list)
for d in vs:
    ...
    vars_by_dim[key].append(d)

indices = {}
for key, vars in vars_by_dim.items():
    order = np.lexsort(tuple(reversed(vars))
    indices[key] = order if ascending else order[::-1]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Not only should I have lexsorted d, there was another logical error later in that code block as well (.update should've been after an else). Fixed those and pushed.

Chun-Wei Yuan added 2 commits May 1, 2017 14:08
@chunweiyuan
Copy link
Contributor Author

Almost time to merge this baby in? :)

# dax0 is sorted first to give indices of [1, 2, 0]
# and then dax1 would be used to move index 2 ahead of 1
dax0 = DataArray([100, 95, 95], [('x', [0, 1, 2])])
dax1 = DataArray([0, 1, 0], [('x', [7, 8, 9])])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example actually raises a good question, because dax1 has different coordinate labels for x than dax0 and da.

Should we align sortby arguments before using them, or ensure that they are already aligned? I think this is probably a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I have a dumb question: why do they need to be aligned? In this case, we're sorting da along x using the values of two dataarrays: dax0 and dax1. So as long as both of those dataarrays have x and the right length, we should be able to perform our sort. The coordinate labels feel irrelevant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would mostly be for consistency with other xarray options, almost all of which require aligned objects. Indexing is the one exception, but I want to clean that up (#974).

I think I would prefer to align all of objects in variables with the object being sorted, if they aren't already aligned (in most cases they will be, if they are pulled out by name). You could do this with xarray.align(self, *variables, join='left'). Just note that missing sort labels will get sorted to the end, consistent with how NumPy's sort handles NaN.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question: If we're aligning the input args, does that mean we allow the input dataarrays to be N-D as well, since the left-join will force matching of the dims. Then of course it opens up the possibility of using 1 input dataarray to sort multiple dimensions. Should we support that?

Align also opens up a small can of worms, because the left-join could introduce nan into the array values, complicating the sort.

Copy link
Member

@shoyer shoyer May 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

align doesn't change the dimensions on or dimensionality of any of its arguments -- it only changes coordinate labels and dimension sizes.

Then of course it opens up the possibility of using 1 input dataarray to sort multiple dimensions.

What would this look like?

Align also opens up a small can of worms, because the left-join could introduce nan into the array values, complicating the sort.

Indeed, this is why we would need to document how we sort NaN. Fortunately, NumPy already has a well defined sort order for NaN (it gets moved to the end).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon more thinking I think the N-D sort using 1 dataarray wouldn't make much sense, and even if it does would be an edge case. Never mind that.

@@ -2741,6 +2741,56 @@ def roll(self, **shifts):

return self._replace_vars_and_dims(variables)

def sortby(self, variables, ascending=True):
"""
Sorts the dataset, either along specified dimensions,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a short one-line description at the top? That is the standard numpy docstring format.

and the FIRST key in the sequence is used as the primary sort key,
followed by the 2nd key, etc.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should have only one space.

vars_by_dim[key].append(d)

indices = {}
for key, ds in vars_by_dim.items():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a minor point -- but can you use slightly longer variable names instead of the abbreviations for variables that exist outside of one line? e.g.,
vs -> variables
d -> data_array
ds -> arrays (this one is especially confusing, because ds is normally used for xarray.Dataset)

This would help for readability (also follows PEP8 guidelines)


indices = {}
for key, ds in vars_by_dim.items():
order = np.lexsort(tuple(reversed(ds)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add a test case verifying that this sorts a pandas.MultiIndex properly.

@chunweiyuan
Copy link
Contributor Author

I've added a tiny bit of extra docstring just to ease my discomfort about the NaN sort. If you feel it's redundant I can remove it.

@chunweiyuan
Copy link
Contributor Author

Perhaps it's finally ready for prime time?

@shoyer
Copy link
Member

shoyer commented May 9, 2017

Looks like one the Travis-CI tests failing on the sortby tests: https://travis-ci.org/pydata/xarray/jobs/229457729#L342

It's not immediately clear to me what's going on there, but maybe it's an issue with the old version of NumPy? The error is TypeError: merge sort not available for item 0.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, once we figure out that test failure!

variables = [variables]
else:
variables = variables
variables = [v if isinstance(v, DataArray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I prefer writing this like:

        variables = [v if isinstance(v, DataArray) else self[v]
                     for v in variables]

because the else clause is part of the if conditional statement.

@@ -2519,6 +2519,69 @@ def test_combine_first(self):
[('x', ['a', 'b', 'd']), ('y', [-1, 0])])
self.assertDataArrayEqual(actual, expected)

def test_sortby(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you delete the test cases here that are redundant with those in test_dataset.py? Given the implementation (only a small wrapped for DataArray.sortby()), I am OK having a less comprehensive set of unit tests for the DataArray implementation -- and fewer unit tests make things easier to maintain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I would say the entire test_dataarray::test_sortby is redundant, because it's essentially a carbon copy of test_dataset::test_sortby. I've gotten rid of a few lines in the latest push, but could trim more if you like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just add a single very basic test to verify that it works, and add a comment noting that more advanced functionality is tested in test_dataset.py.

@chunweiyuan
Copy link
Contributor Author

If I downgrade to numpy 1.10, I get that error. Once upgraded to 1.12, it goes away......
screen shot 2017-05-09 at 11 39 20 am
screen shot 2017-05-09 at 11 39 31 am

@shoyer
Copy link
Member

shoyer commented May 9, 2017

If I downgrade to numpy 1.10, I get that error. Once upgraded to 1.12, it goes away......

Does it work on numpy 1.11?

We could potentially (partially) drop compatibility for older numpy releases. Or if we can identify the issue, we can raise an informative error or use a work-around.

@shoyer
Copy link
Member

shoyer commented May 9, 2017

Ah, good to know.

In that case, let's add a check for dtype == object based on numpy.__version__ (search for LooseVersion for examples in the xarray codebase) and raise an informative NotImplementedError when necessary.

@chunweiyuan
Copy link
Contributor Author

Think I just copied something over from variable.py and changed the message a little bit...

A new dataset where all the specified dims are sorted by dim
labels.
"""
if LooseVersion(np.__version__) < LooseVersion('1.11.0'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this check inside the loop, if data_array.dtype == object? That way, users of old versions of numpy can at least sort other arrays (e.g., numbers).

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll still need to adjust the test suite so it passes on older numpy.

for data_array in aligned_other_vars:
if len(data_array.dims) > 1:
raise ValueError("Input DataArray has more than 1 dimension.")
elif data_array.dtype == object and\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer parentheses to \ for the line continuation:
https://www.python.org/dev/peps/pep-0008/#maximum-line-length

aligned_other_vars = aligned_vars[1:]
vars_by_dim = defaultdict(list)
for data_array in aligned_other_vars:
if len(data_array.dims) > 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be data_array.ndim != 1. Scalar arrays would also be problematic.

'requires numpy 1.11.0 or later to support '
'object data-type.')
else:
key = data_array.dims[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider switching this to (key,) = data_array.dims. This removes the otherwise mysterious 0 literal and serves as an implicit assert statement.

for data_array in aligned_other_vars:
if len(data_array.dims) > 1:
raise ValueError("Input DataArray has more than 1 dimension.")
elif data_array.dtype == object and\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you make this just if instead of elif? This isn't a natural alternative to the dimensionality check, so it makes more sense in a separate if.

'sortby uses np.lexsort under the hood, which '
'requires numpy 1.11.0 or later to support '
'object data-type.')
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another nit: Can you remove the else block here? It doesn't make sense to conditionally enter this part of the code, and it's especially weird considering the line below, which would have key as an undefined variable if it ever gets run.

@chunweiyuan
Copy link
Contributor Author

Looks ready now.

@shoyer
Copy link
Member

shoyer commented May 10, 2017

@chunweiyuan have you run flake8 on this?

@chunweiyuan
Copy link
Contributor Author

Today is the first time I've heard of flake8. Is this how you guys standardize quality checks? If so, do you run once for python 3.5 and once for 2.7?

@shoyer
Copy link
Member

shoyer commented May 10, 2017

Today is the first time I've heard of flake8.

It's in the checklist for every new PR :).

Is this how you guys standardize quality checks?

Yes. It would be nice to run this as a continuous integration test but we haven't set that up that.

Running it once with either Python 2 or 3 is fine -- it should give the same output either way.

@chunweiyuan
Copy link
Contributor Author

I just made some futile attempts to find that checklist, to no avail. Mind helping me with a link? :)

@shoyer
Copy link
Member

shoyer commented May 10, 2017 via email

@chunweiyuan
Copy link
Contributor Author

Haha, brilliant.

@chunweiyuan
Copy link
Contributor Author

git diff upstream/master | flake8 --diff only complains about a bunch of lines in whats-new.rst and reshaping.rst. But these complaints don't make any sense to me. Some examples:

doc/reshaping.rst:195:1: E112 expected an indented block
doc/reshaping.rst:203:1: E101 indentation contains mixed spaces and tabs
doc/reshaping.rst:203:1: W191 indentation contains tabs
doc/reshaping.rst:203:2: E113 unexpected indentation

I've played around with it a bit but not seen any changes to the complaints. What is going on? Should I even worry about the .rst files?

@shoyer
Copy link
Member

shoyer commented May 11, 2017 via email

@chunweiyuan
Copy link
Contributor Author

If that's the case then I have nothing more to add. I've run flake8 on the individual files I touched I as well, such as flake8 xarray/core/dataset.py, and none of my changes are flagged.

@shoyer
Copy link
Member

shoyer commented May 11, 2017

One last thing -- this needs an entry in docs/api.rst in order for the doc page to get built.

@chunweiyuan
Copy link
Contributor Author

api.rst should've been a checklist item :)

@shoyer shoyer merged commit 80ddad9 into pydata:master May 12, 2017
@shoyer
Copy link
Member

shoyer commented May 12, 2017

api.rst should've been a checklist item :)

Indeed, I will add it right now!

Thanks for your contribution!!

@shoyer shoyer mentioned this pull request May 12, 2017
4 tasks
@chunweiyuan
Copy link
Contributor Author

Thank you very much for your patience!

BTW, I think I messed up my co-worker's name in whats-new.rst. The spelling is right (Kyle Heuton), but the extra indentation shouldn't be there. As a result his name appears a bit out of place. Would be very much obliged if you could fix that in your next push to master. Gracias.

@shoyer
Copy link
Member

shoyer commented May 12, 2017

BTW, I think I messed up my co-worker's name in whats-new.rst. The spelling is right (Kyle Heuton), but the extra indentation shouldn't be there. As a result his name appears a bit out of place. Would be very much obliged if you could fix that in your next push to master. Gracias.

No worries, whats-new.rst usually requires a fix-up before each release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sortby() or sort_index() method for Dataset and DataArray
3 participants