-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Dataset constructor can take pandas objects #677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset constructor can take pandas objects #677
Conversation
def test_constructor_pandas_single(self): | ||
|
||
ds = DataArray(np.random.rand(4,3), dims=['a', 'b']).to_dataset('a') | ||
#'aself.make_example_math_dataset()['foo'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line should probably be deleted
Miscreant line removed @shoyer |
Maybe add a test that this works on a Series? |
Yes - will do |
This doesn't work well for Series actually, because you have a Dataset with no coords - it's just a single value in each array in the Dataset. I've added a test for Panels. Let me know if you think that's sufficient, or it's worth spending more time on the Series. |
Isn't this exactly what you would expect? Series is dict like with single elements (scalars) as values. |
I imagine the rule for the Dataset constructor from pandas objects as removing one dimension. |
Ah, I understand now -- series fails your unit test. I think it still gives the expected result, though, e.g., |
Yes, my last comment wasn't clear. I think it's something to do with ChainMap - Potentially because Regardless I'll add a note in the docs for DataFrame & Panel, and the Series can wait for the moment. In [30]: series=pd.Series(pd.np.random.rand(4))
In [31]: dict(series)
Out[31]:
{0: 0.26874240805523286,
1: 0.3110026841777368,
2: 0.22873881434409837,
3: 0.9946345046609677}
In [34]: dict(xray.core.utils.ChainMap(series))
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/index.py:805: FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point
type(self).__name__),FutureWarning)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/series.py in __getitem__(self, key)
520 try:
--> 521 result = self.index.get_value(self, key)
522
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/index.py in get_value(self, series, key)
1591 if is_float(k) and not self.is_floating():
-> 1592 raise KeyError
1593
KeyError:
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-34-2a2c45b6f2cd> in <module>()
----> 1 dict(xray.core.utils.ChainMap(series))
/Users/maximilianroos/Dropbox/workspace/xray/xray/core/utils.py in __getitem__(self, key)
310 for mapping in self.maps:
311 try:
--> 312 return mapping[key]
313 except KeyError:
314 pass
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/series.py in __getitem__(self, key)
545
546 # we can try to coerce the indexer (or this will raise)
--> 547 new_key = self.index._convert_scalar_indexer(key,kind='getitem')
548 if type(new_key) != type(key):
549 return self.__getitem__(new_key)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/index.py in _convert_scalar_indexer(self, key, kind)
804 warnings.warn("scalar indexers for index type {0} should be integers and not floating point".format(
805 type(self).__name__),FutureWarning)
--> 806 return to_int()
807
808 return key
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/index.py in to_int()
787 ikey = int(key)
788 if ikey != key:
--> 789 return self._invalid_indexer('label', key)
790 return ikey
791
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/index.py in _invalid_indexer(self, form, key)
942 klass=type(self),
943 key=key,
--> 944 kind=type(key)))
945
946 def get_duplicates(self):
TypeError: cannot do label indexing on <class 'pandas.core.index.Int64Index'> with these indexers [0.26874240805523286] of <class 'numpy.float64'> |
@shoyer I made some more improvements to the docs, although they need a review |
dimensionality equal to the length of ``dims``. | ||
- ``attrs`` is an arbitrary Python dictionary for storing metadata associated | ||
with a particular array. | ||
``coords`` are supplied as dictionary of ``{coord_name: coord}`` where the values are scalar values, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should also be able to supply DataArrays (and maybe pandas objects?) as coords
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer Should we define something as "DataArray-like" if it's:
- A :py:class:
~xray.DataArray
- A tuple of the form
(dims, data[, attrs])
- A pandas object
- A numpy array, whose dimensions will be labelled
dim0
,dim1
, etc
...and then use that definition throughout? There are currently a few references to that (although currently written differently in different places).
I think the only exception is a dim
, which is a 1D version of those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would indeed be worth formalizing what it means to be "DataArray-like". The trouble is that it depends a bit on context:
For coords
or data_vars
, you can supply:
- An
xray.DataArray
orxray.Variable
- A tuple of the form (dims, data[, attrs])
- A pandas object
- 1D numpy arrays, which are assumed to be along the given dimension
- Scalars
We don't automatically labeled dimensions (except for 1D arrays), because that's probably user error rather than what they would like to see.
For casting with the DataArray
constructor, you can use:
- An
xray.DataArray
orxray.Variable
- A pandas object
- Scalars or NumPy arrays, whose dimensions will labeled
dim0
,dim1
, etc (unlessdims
orcoords
is supplied)
Here, we don't accept tuples (dims, data[, attrs])
because there's another, more explicit place for such arguments in the DataArray
constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super, thanks for that.
I'll use those lists. Do you think it would make sense to define the first list as 'DataArray-like', and use it for coords
& data_vars
? I don't think it's a problem that the DataArray constructor can't be constructed with all of them. But introducing terms tends to be a one-way process, so let's do it deliberately if we do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer gentle ping on this
Dataset constructor can take pandas objects
This is great, thanks! |
OK, there's still some improvements to make re the comments above, but that can be for the next iteration |
Closes a 'first-step' of #676. Works only for simple, non-MultiIndexed, pandas objects.