Skip to content

DataArray.set_index throws error on documented input #3176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gwgundersen opened this issue Aug 2, 2019 · 7 comments · Fixed by #3228
Closed

DataArray.set_index throws error on documented input #3176

gwgundersen opened this issue Aug 2, 2019 · 7 comments · Fixed by #3228

Comments

@gwgundersen
Copy link
Contributor

Problem Description

Docs for DataArray.set_index describe the main indexes argument as:

Mapping from names matching dimensions and values given by (lists of) the names of existing coordinates or variables to set as new (multi-)index.

This suggests that one can set a DataArray instance's coordinates by passing in a dimension and a list-like object of coordinates.

MCVE

In [1]: import numpy as np

In [2]: import xarray as xr

In [3]: arr = xr.DataArray(data=np.ones((2, 3)), dims=['x', 'y'])

In [4]: arr.dims
Out[4]: ('x', 'y')

In [5]: arr.set_index({'x': range(2)})
KeyError   
...
    144         for n in var_names:
--> 145             var = variables[n]
    146             if (current_index_variable is not None and
    147                     var.dims != current_index_variable.dims):

KeyError: 0

At first, I thought it might be because coords and _coords were not being set in this case:

In [18]: arr.coords
Out[18]: 
Coordinates:
    *empty*

In [19]: arr._coords
Out[19]: OrderedDict()

but even if I set the coordinates first and then try to re-index, it fails:

In [20]: arr = xr.DataArray(data=np.ones((2, 3)), dims=['x', 'y'], coords={'x': range(2), 'y': range(3)})
In [21]: arr.set_index({'x': ['a', 'b', 'c']})
...
    144         for n in var_names:
--> 145             var = variables[n]
    146             if (current_index_variable is not None and
    147                     var.dims != current_index_variable.dims):

Expected Output

I expect my MCVE to work based on the documentation.

Problem Solution

My guess is that the issue is Xarray is using the merge_indexes function (see here) from the Dataset module, and there is no concept of a variable in a DataArray.

@max-sixty
Copy link
Collaborator

Thanks for the issue @gwgundersen

I think the docs are potentially a bit unclear, and maybe the error message. The existing intention of set_index is to set existing variables as indexes, rather than creating new ones. For example, to extend your case:

In [16]: arr = xr.DataArray(data=np.ones((2, 3)), dims=['x', 'y'], coords={'x': range(2), 'y': range(3), 'a': ('x', [3,4])})

In [17]: arr
Out[17]:
<xarray.DataArray (x: 2, y: 3)>
array([[1., 1., 1.],
       [1., 1., 1.]])
Coordinates:
  * x        (x) int64 0 1
  * y        (y) int64 0 1 2
    a        (x) int64 3 4

In [18]: arr.set_index(x='a')
Out[18]:
<xarray.DataArray (x: 2, y: 3)>
array([[1., 1., 1.],
       [1., 1., 1.]])
Coordinates:
  * x        (x) int64 3 4
  * y        (y) int64 0 1 2

We'd definitely be keen on a PR improving the error message (i.e. something like 'a' is not the name of an existing variable), and v open to feedback on the docs & the method's functionality; let us know if you'd be interested in that PR.

@gwgundersen
Copy link
Contributor Author

Thanks for the explanation. I'll create a PR and link to this issue this evening.

@gwgundersen
Copy link
Contributor Author

Looking at this now, and I'm a little surprised at the verbiage. In your example, do you consider a to be a "variable"? I thought variables were individual DataArray objects "inside" Dataset objects. My colleagues and I have been referring to objects such as a as "alternative" or "auxiliary" dimensions. Basically, a different labeling of the same coordinates. You also seem to call these "multidimensional coordinates"?

But I do think I see the use case. The point is that you can take an existing dimension's coordinates and set them as the coordinates for an alternative dimension?

@max-sixty
Copy link
Collaborator

Looking at this now, and I'm a little surprised at the verbiage. In your example, do you consider a to be a "variable"? I thought variables were individual DataArray objects "inside" Dataset objects. My colleagues and I have been referring to objects such as a as "alternative" or "auxiliary" dimensions. Basically, a different labeling of the same coordinates. You also seem to call these "multidimensional coordinates"?

You're not alone; the proliferation and overlap of terms can be confusing at the least. Maybe we should have a glossary somewhere. Briefly:

  • Dimensions are like x & y above (a is not a dimension)
  • Coordinates are labels along dimensions. These can be either index or non-index coordinates. a above is an non-index coordinate; x & y are indexes. Currently indexes are always named the same as their dimension.
  • (not 100% sure about this one, @pydata/xarray correct me where I'm wrong) Data Variables are indeed the objects inside a dataset. All the objects are Variables, including Coordinates.

But I do think I see the use case. The point is that you can take an existing dimension's coordinates and set them as the coordinates for an alternative dimension?

💯

@gwgundersen
Copy link
Contributor Author

Thanks for these answers! On a related point, I'd be keen to open a PR for improved documentation for whatever object a is. It seems like the documented Xarray terminology is "multidimensional coordinate", right? To me, "non-index coordinate" and "multidimensional coordinate" are both pretty vague until you're more familiar with Xarray's way of thinking.

What do you think of the terminology "alternative" or "auxiliary" dimension? a is clearly a dimension in the sense that it has coordinates or labels for all the "tick marks" along the x dimension. At the very least, I'd love to add a lot more examples of how to actually use these things.

@gwgundersen
Copy link
Contributor Author

Looks like the idea of a glossary is already being discussed in #2410.

@max-sixty
Copy link
Collaborator

Good work on finding that issue. I think even if we can get something brief in, that would be helpful.

On the specific definitions:

What do you think of the terminology "alternative" or "auxiliary" dimension? a is clearly a dimension in the sense that it has coordinates or labels for all the "tick marks" along the x dimension.

For me 'dimension' has a precise definition from traditional sciences, so having our 'coordinate' be an additional / auxiliary / alternative dimension wouldn't be consistent with that (e.g. a 4-dimensional array would still be 4 dimensional regardless of how many coordinates it had).

At the very least, I'd love to add a lot more examples of how to actually use these things.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants