Sortby #1389

chunweiyuan · 2017-04-29T00:44:01Z

closes sortby() or sort_index() method for Dataset and DataArray #967
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

shoyer

Nice docs and tests! I think the logic inside sortby could be simplified a bit (it will take a couple iterations) but this is a very nice start.

shoyer · 2017-04-29T01:24:24Z

xarray/core/dataset.py

+            labels.
+        """
+        from .dataarray import DataArray
+        if isinstance(variables, (str, unicode, DataArray)):


To reduce the number of separate code paths, I would suggest a normalization based flow. For example:

if not isinstance(variables, list): variables = [variables] # variables is now a list of str/DataArray variables = [v if isinstance(v, DataArray) else self[v] for v in variables] # variables is now a list of DataArray

Now you can avoid the if/else branch inside the loop.

shoyer · 2017-04-29T01:32:26Z

xarray/core/dataset.py

+        Parameters
+        ----------
+        variables: (str, DataArray, or iterable of either)
+            Name of a 1D variable in coords/data_vars whose values are used to


Should be Name(s)

shoyer · 2017-04-29T01:33:21Z

xarray/core/dataset.py

+
+        Parameters
+        ----------
+        variables: (str, DataArray, or iterable of either)


It might make sense to limit this to only lists, rather than arbitrary iterables. Arbitrary iterables makes the dispatch rules more complicated because both strings and DataArray objects are iterable.

shoyer · 2017-04-29T01:34:47Z

xarray/core/dataset.py

+                        raise ValueError("Input DataArray must have same "
+                                         "length as dimension it is sorting.")
+            dims.append((key, val))
+        return self.isel(**dict(dims))


What happens if there are multiple keys provided along the same dimension? This is equivalent to sorting by multiple columns in a spreadsheet.

It's OK not to support it, but we should be sure to raise an informative error, because I'm sure somebody will try it. Note that when you put multiple items in dict with the same name Python ignores the first argument:

In [1]: dict([('x', 1), ('x', 2)]) Out[1]: {'x': 2}

If you do want to support it, probably the easiest way is with np.lexsort

shoyer · 2017-04-29T01:36:37Z

xarray/tests/test_dataarray.py

+        expected = DataArray([[4, 3], [2, 1]],
+                             [('x', ['a', 'b']), ('y', [0, 1])])
+
+        actual = da.sortby(['x', 'y'])


It would be good to include a test for a single sorted, too, just da.sortby('x'). Likewise da.sortby(dax).

shoyer · 2017-04-29T01:38:09Z

doc/indexing.rst

@@ -523,3 +523,24 @@ labels:

    array
    array.get_index('x')
+
+Sort
+------------------


I think the length here needs to line up with the previous line.

shoyer · 2017-04-29T01:39:24Z

doc/indexing.rst

@@ -523,3 +523,24 @@ labels:

    array
    array.get_index('x')
+
+Sort


indexing.rst is already our longest doc page, so I would put this section in "Reshaping and reorganizing data" instead (reshaping.rst).

max-sixty · 2017-04-29T11:02:51Z

doc/indexing.rst

+Sort
+------------------
+
+One may sort a dataarray/dataset via :py:meth:`~xarray.DataArray.sortby` and


One may -> You can

max-sixty · 2017-04-29T11:05:30Z

We want this as sortby, rather than sort_by? Generally the latter would be more idiomatic, but if there is prior sortby out there, I would vote for consistency

chunweiyuan · 2017-05-01T16:18:55Z

Weird. Travis seems to fail in places unrelated to my changes.

I'm rather agnostic to sortby vs. sort_by.

shoyer · 2017-05-01T16:49:43Z

Indeed, please ignore the first CI build on Travis. For some inexplicable reason it is installing the pandas 0.20rc, which is failing due to #1386.

I am slightly in favor of sortby but also could go either way. The PEP8 guidance on function/method naming is "lowercase with words separated by underscores as necessary to improve readability." I don't think an underscore is necessary for readability here.

shoyer · 2017-05-01T16:52:55Z

xarray/core/dataset.py

+            else:
+                key = d.dims[0]
+                val = d.argsort() if ascending else d.argsort()[::-1]
+                if len(val) != len(self[key]):


Use self.dims[key] instead of len(self[key]) (the later has significantly more overhead, since it constructs a pandas.Index object).

shoyer · 2017-05-01T17:05:54Z

xarray/core/dataset.py

+            else:
+                key = d.dims[0]
+                val = d.argsort() if ascending else d.argsort()[::-1]
+                if len(val) != len(self[key]):


Another minor point: it's better to fail earlier is possible. Can you move this check up one line, before the sort?

shoyer · 2017-05-01T17:07:38Z

xarray/core/dataset.py

+            vs = variables
+        vs = [v if isinstance(v, DataArray) else self[v] for v in vs]
+
+        dims = {}


Maybe this variable would be better called indices?

shoyer · 2017-05-01T17:21:16Z

xarray/core/dataset.py

+                    raise ValueError("Input DataArray must have same "
+                                     "length as dimension it is sorting.")
+            if key in dims:
+                dims[key] = np.lexsort((val, dims[key]))


I'm not sure this is calculating the right thing. I think you need to lexsort d, not val (the result of argsort). Otherwise, lexsort does some sort of shuffle of the sort indices:

>>> np.lexsort(([2, 1, 0],)) array([2, 1, 0])

Maybe inverse_permutation would fix this.

Alternatively, this might be clearer -- and maybe slightly faster -- if this was done with a single call to lexsort per dimension. Something like:

vars_by_dim = collections.defaultdict(list) for d in vs: ... vars_by_dim[key].append(d) indices = {} for key, vars in vars_by_dim.items(): order = np.lexsort(tuple(reversed(vars)) indices[key] = order if ascending else order[::-1]

Good catch. Not only should I have lexsorted d, there was another logical error later in that code block as well (.update should've been after an else). Fixed those and pushed.

…rroneous code. Also addressed some reviewer comments.

chunweiyuan · 2017-05-04T16:01:45Z

Almost time to merge this baby in? :)

shoyer · 2017-05-04T17:31:25Z

xarray/tests/test_dataarray.py

+        # dax0 is sorted first to give indices of [1, 2, 0]
+        # and then dax1 would be used to move index 2 ahead of 1
+        dax0 = DataArray([100, 95, 95], [('x', [0, 1, 2])])
+        dax1 = DataArray([0, 1, 0], [('x', [7, 8, 9])])


This example actually raises a good question, because dax1 has different coordinate labels for x than dax0 and da.

Should we align sortby arguments before using them, or ensure that they are already aligned? I think this is probably a good idea.

hmm, I have a dumb question: why do they need to be aligned? In this case, we're sorting da along x using the values of two dataarrays: dax0 and dax1. So as long as both of those dataarrays have x and the right length, we should be able to perform our sort. The coordinate labels feel irrelevant.

It would mostly be for consistency with other xarray options, almost all of which require aligned objects. Indexing is the one exception, but I want to clean that up (#974).

I think I would prefer to align all of objects in variables with the object being sorted, if they aren't already aligned (in most cases they will be, if they are pulled out by name). You could do this with xarray.align(self, *variables, join='left'). Just note that missing sort labels will get sorted to the end, consistent with how NumPy's sort handles NaN.

Sounds good. Will do.

Quick question: If we're aligning the input args, does that mean we allow the input dataarrays to be N-D as well, since the left-join will force matching of the dims. Then of course it opens up the possibility of using 1 input dataarray to sort multiple dimensions. Should we support that?

Align also opens up a small can of worms, because the left-join could introduce nan into the array values, complicating the sort.

align doesn't change the dimensions on or dimensionality of any of its arguments -- it only changes coordinate labels and dimension sizes.

Then of course it opens up the possibility of using 1 input dataarray to sort multiple dimensions.

What would this look like?

Align also opens up a small can of worms, because the left-join could introduce nan into the array values, complicating the sort.

Indeed, this is why we would need to document how we sort NaN. Fortunately, NumPy already has a well defined sort order for NaN (it gets moved to the end).

Upon more thinking I think the N-D sort using 1 dataarray wouldn't make much sense, and even if it does would be an edge case. Never mind that.

shoyer · 2017-05-04T17:32:01Z

xarray/core/dataset.py

@@ -2741,6 +2741,56 @@ def roll(self, **shifts):

        return self._replace_vars_and_dims(variables)

+    def sortby(self, variables, ascending=True):
+        """
+        Sorts the dataset, either along specified dimensions,


Can you please add a short one-line description at the top? That is the standard numpy docstring format.

shoyer · 2017-05-04T17:32:09Z

xarray/core/dataset.py

+        and the FIRST key in the sequence is used as the primary sort key,
+        followed by the 2nd key, etc.
+
+


should have only one space.

shoyer · 2017-05-04T17:36:54Z

xarray/core/dataset.py

+            vars_by_dim[key].append(d)
+
+        indices = {}
+        for key, ds in vars_by_dim.items():


This is a minor point -- but can you use slightly longer variable names instead of the abbreviations for variables that exist outside of one line? e.g.,
vs -> variables
d -> data_array
ds -> arrays (this one is especially confusing, because ds is normally used for xarray.Dataset)

This would help for readability (also follows PEP8 guidelines)

shoyer · 2017-05-04T17:37:35Z

xarray/core/dataset.py

+
+        indices = {}
+        for key, ds in vars_by_dim.items():
+            order = np.lexsort(tuple(reversed(ds)))


It would be nice to add a test case verifying that this sorts a pandas.MultiIndex properly.

chunweiyuan · 2017-05-06T15:04:31Z

I've added a tiny bit of extra docstring just to ease my discomfort about the NaN sort. If you feel it's redundant I can remove it.

chunweiyuan · 2017-05-09T00:39:44Z

Perhaps it's finally ready for prime time?

shoyer · 2017-05-09T00:52:59Z

Looks like one the Travis-CI tests failing on the sortby tests: https://travis-ci.org/pydata/xarray/jobs/229457729#L342

It's not immediately clear to me what's going on there, but maybe it's an issue with the old version of NumPy? The error is TypeError: merge sort not available for item 0.

shoyer

This looks good to me, once we figure out that test failure!

shoyer · 2017-05-09T00:55:09Z

xarray/core/dataset.py

+            variables = [variables]
+        else:
+            variables = variables
+        variables = [v if isinstance(v, DataArray)


nit: I prefer writing this like:

variables = [v if isinstance(v, DataArray) else self[v] for v in variables]

because the else clause is part of the if conditional statement.

shoyer · 2017-05-09T00:57:19Z

xarray/tests/test_dataarray.py

@@ -2519,6 +2519,69 @@ def test_combine_first(self):
                             [('x', ['a', 'b', 'd']), ('y', [-1, 0])])
        self.assertDataArrayEqual(actual, expected)

+    def test_sortby(self):


Can you delete the test cases here that are redundant with those in test_dataset.py? Given the implementation (only a small wrapped for DataArray.sortby()), I am OK having a less comprehensive set of unit tests for the DataArray implementation -- and fewer unit tests make things easier to maintain.

To be honest, I would say the entire test_dataarray::test_sortby is redundant, because it's essentially a carbon copy of test_dataset::test_sortby. I've gotten rid of a few lines in the latest push, but could trim more if you like.

I would just add a single very basic test to verify that it works, and add a comment noting that more advanced functionality is tested in test_dataset.py.

chunweiyuan · 2017-05-09T18:57:51Z

If I downgrade to numpy 1.10, I get that error. Once upgraded to 1.12, it goes away......

shoyer · 2017-05-09T19:02:43Z

If I downgrade to numpy 1.10, I get that error. Once upgraded to 1.12, it goes away......

Does it work on numpy 1.11?

We could potentially (partially) drop compatibility for older numpy releases. Or if we can identify the issue, we can raise an informative error or use a work-around.

shoyer · 2017-05-09T20:32:10Z

Ah, good to know.

In that case, let's add a check for dtype == object based on numpy.__version__ (search for LooseVersion for examples in the xarray codebase) and raise an informative NotImplementedError when necessary.

chunweiyuan · 2017-05-09T21:27:06Z

Think I just copied something over from variable.py and changed the message a little bit...

shoyer · 2017-05-09T21:36:12Z

xarray/core/dataset.py

+            A new dataset where all the specified dims are sorted by dim
+            labels.
+        """
+        if LooseVersion(np.__version__) < LooseVersion('1.11.0'):


Can you put this check inside the loop, if data_array.dtype == object? That way, users of old versions of numpy can at least sort other arrays (e.g., numbers).

shoyer

I think you'll still need to adjust the test suite so it passes on older numpy.

shoyer · 2017-05-09T21:55:26Z

xarray/core/dataset.py

+        for data_array in aligned_other_vars:
+            if len(data_array.dims) > 1:
+                raise ValueError("Input DataArray has more than 1 dimension.")
+            elif data_array.dtype == object and\


Prefer parentheses to \ for the line continuation:
https://www.python.org/dev/peps/pep-0008/#maximum-line-length

shoyer · 2017-05-09T21:56:19Z

xarray/core/dataset.py

+        aligned_other_vars = aligned_vars[1:]
+        vars_by_dim = defaultdict(list)
+        for data_array in aligned_other_vars:
+            if len(data_array.dims) > 1:


Should be data_array.ndim != 1. Scalar arrays would also be problematic.

shoyer · 2017-05-09T21:57:12Z

xarray/core/dataset.py

+                        'requires numpy 1.11.0 or later to support '
+                        'object data-type.')
+            else:
+                key = data_array.dims[0]


Consider switching this to (key,) = data_array.dims. This removes the otherwise mysterious 0 literal and serves as an implicit assert statement.

shoyer · 2017-05-09T21:57:40Z

xarray/core/dataset.py

+        for data_array in aligned_other_vars:
+            if len(data_array.dims) > 1:
+                raise ValueError("Input DataArray has more than 1 dimension.")
+            elif data_array.dtype == object and\


Also, can you make this just if instead of elif? This isn't a natural alternative to the dimensionality check, so it makes more sense in a separate if.

shoyer · 2017-05-09T21:58:04Z

xarray/core/dataset.py

+                        'sortby uses np.lexsort under the hood, which '
+                        'requires numpy 1.11.0 or later to support '
+                        'object data-type.')
+            else:


Another nit: Can you remove the else block here? It doesn't make sense to conditionally enter this part of the code, and it's especially weird considering the line below, which would have key as an undefined variable if it ever gets run.

chunweiyuan · 2017-05-10T16:16:28Z

Looks ready now.

shoyer · 2017-05-10T17:18:54Z

@chunweiyuan have you run flake8 on this?

chunweiyuan · 2017-05-10T17:30:32Z

Today is the first time I've heard of flake8. Is this how you guys standardize quality checks? If so, do you run once for python 3.5 and once for 2.7?

shoyer · 2017-05-10T17:48:37Z

Today is the first time I've heard of flake8.

It's in the checklist for every new PR :).

Is this how you guys standardize quality checks?

Yes. It would be nice to run this as a continuous integration test but we haven't set that up that.

Running it once with either Python 2 or 3 is fine -- it should give the same output either way.

chunweiyuan · 2017-05-10T18:08:33Z

I just made some futile attempts to find that checklist, to no avail. Mind helping me with a link? :)

shoyer · 2017-05-10T20:16:49Z

Scroll to the top of this PR and look at your first post :)

…

On Wed, May 10, 2017 at 11:08 AM, chunweiyuan ***@***.***> wrote: I just made some futile attempts to find that checklist, to no avail. Mind helping me with a link? :) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1389 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1mF7Jjb2NmZ_5LZpRvUP_nwG-khgks5r4f0hgaJpZM4NMGMU> .

chunweiyuan · 2017-05-10T20:20:12Z

Haha, brilliant.

chunweiyuan · 2017-05-11T03:55:58Z

git diff upstream/master | flake8 --diff only complains about a bunch of lines in whats-new.rst and reshaping.rst. But these complaints don't make any sense to me. Some examples:

doc/reshaping.rst:195:1: E112 expected an indented block
doc/reshaping.rst:203:1: E101 indentation contains mixed spaces and tabs
doc/reshaping.rst:203:1: W191 indentation contains tabs
doc/reshaping.rst:203:2: E113 unexpected indentation

I've played around with it a bit but not seen any changes to the complaints. What is going on? Should I even worry about the .rst files?

shoyer · 2017-05-11T04:38:19Z

It's very strange that fake8 complains about non-Python files. I think you can safely ignore those.

…

On Wed, May 10, 2017 at 8:55 PM chunweiyuan ***@***.***> wrote: git diff upstream/master | flake8 --diff only complains about a bunch of lines in whats-new.rst and reshaping.rst. But these complaints don't make any sense to me. Some examples: doc/reshaping.rst:195:1: E112 expected an indented block doc/reshaping.rst:203:1: E101 indentation contains mixed spaces and tabs doc/reshaping.rst:203:1: W191 indentation contains tabs doc/reshaping.rst:203:2: E113 unexpected indentation I've played around with it a bit but not seen any changes to the complaints. What is going on? Should I even worry about the .rst files? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1389 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1qjB4dK0tHLBMrX6sURnlt_T5QX0ks5r4obPgaJpZM4NMGMU> .

chunweiyuan · 2017-05-11T16:51:34Z

If that's the case then I have nothing more to add. I've run flake8 on the individual files I touched I as well, such as flake8 xarray/core/dataset.py, and none of my changes are flagged.

shoyer · 2017-05-11T17:01:30Z

One last thing -- this needs an entry in docs/api.rst in order for the doc page to get built.

…ortby

chunweiyuan · 2017-05-11T23:59:29Z

api.rst should've been a checklist item :)

shoyer · 2017-05-12T00:29:42Z

api.rst should've been a checklist item :)

Indeed, I will add it right now!

Thanks for your contribution!!

chunweiyuan · 2017-05-12T17:18:49Z

Thank you very much for your patience!

BTW, I think I messed up my co-worker's name in whats-new.rst. The spelling is right (Kyle Heuton), but the extra indentation shouldn't be there. As a result his name appears a bit out of place. Would be very much obliged if you could fix that in your next push to master. Gracias.

shoyer · 2017-05-12T18:36:30Z

BTW, I think I messed up my co-worker's name in whats-new.rst. The spelling is right (Kyle Heuton), but the extra indentation shouldn't be there. As a result his name appears a bit out of place. Would be very much obliged if you could fix that in your next push to master. Gracias.

No worries, whats-new.rst usually requires a fix-up before each release

Chun-Wei Yuan added 5 commits April 27, 2017 14:51

First commit of .sort_index() for both dataarray.py and dataset.py.

45a34bf

Committing changes prior to switching to sortby().

dfb7735

Finished and passed tests for sortby().

57630bd

Adding to doc.

e0e6e73

Revising whats-new.rst

2715219

shoyer reviewed Apr 29, 2017

View reviewed changes

Chun-Wei Yuan added 2 commits April 28, 2017 20:40

Fixed some coordinate labeling in tests for clarification.

8c168dd

Addressed some review comments, and moved doc to reshape.rst.

64ebb4d

max-sixty reviewed Apr 29, 2017

View reviewed changes

Adding lexsort support in test.

fd0ce66

shoyer reviewed May 1, 2017

View reviewed changes

Chun-Wei Yuan added 2 commits May 1, 2017 14:08

Fixed erroneous code, and the erroneous test to failed to catch the e…

e91c35e

…rroneous code. Also addressed some reviewer comments.

Merge branch 'master' into sortby

082923c

shoyer reviewed May 4, 2017

View reviewed changes

Chun-Wei Yuan and others added 4 commits May 4, 2017 16:07

Adding test for pandas.MultiIndex. Addressed some review comments.

dec6eff

Align input args before sort. Also added a test on pd.MultiIndex.

09f43e4

Merge branch 'master' into sortby

3afa454

Minor addition to docstring.

e615636

shoyer approved these changes May 9, 2017

View reviewed changes

Simplified test_dataarray::test_sortby a bit.

a816b60

Putting dax back.

f9c71c6

Merge branch 'master' into sortby

3dd7366

NotImplementedError for < numpy 1.11.0

e35a934

shoyer reviewed May 9, 2017

View reviewed changes

Move LooseVersion check into the loop.

9db7918

shoyer reviewed May 9, 2017

View reviewed changes

LooseVersion in tests.

02aa024

shoyer added 2 commits May 10, 2017 10:00

Fix indentation, docstring for dataset.py

64391bb

dataarray.py docstring fixup

0446687

Chun-Wei Yuan added 3 commits May 11, 2017 10:08

Adding to api.rst

3dcbc3a

Merge branch 'sortby' of https://github.com/chunweiyuan/xarray into s…

518971b

…ortby

Merge branch 'master' into sortby

124df48

shoyer merged commit 80ddad9 into pydata:master May 12, 2017

shoyer mentioned this pull request May 12, 2017

Mention api.rst in PR template #1407

Merged

4 tasks

		and the FIRST key in the sequence is used as the primary sort key,
		followed by the 2nd key, etc.

Sortby #1389

Sortby #1389

Conversation

chunweiyuan commented Apr 29, 2017 • edited by shoyer Loading

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-sixty commented Apr 29, 2017

chunweiyuan commented May 1, 2017

shoyer commented May 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chunweiyuan commented May 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer May 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chunweiyuan commented May 6, 2017

chunweiyuan commented May 9, 2017

shoyer commented May 9, 2017

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chunweiyuan commented May 9, 2017

shoyer commented May 9, 2017

shoyer commented May 9, 2017

chunweiyuan commented May 9, 2017

Choose a reason for hiding this comment

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chunweiyuan commented May 10, 2017

shoyer commented May 10, 2017

chunweiyuan commented May 10, 2017

shoyer commented May 10, 2017

chunweiyuan commented May 10, 2017

shoyer commented May 10, 2017 via email

chunweiyuan commented May 10, 2017

chunweiyuan commented May 11, 2017

shoyer commented May 11, 2017 via email

chunweiyuan commented May 11, 2017

shoyer commented May 11, 2017

chunweiyuan commented May 11, 2017

shoyer commented May 12, 2017

chunweiyuan commented May 12, 2017

shoyer commented May 12, 2017

chunweiyuan commented Apr 29, 2017 •

edited by shoyer

Loading

shoyer May 5, 2017 •

edited

Loading