Skip to content

Dataset constructor can take pandas objects #677

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 2, 2016
Merged

Dataset constructor can take pandas objects #677

merged 3 commits into from
Jan 2, 2016

Conversation

max-sixty
Copy link
Collaborator

Closes a 'first-step' of #676. Works only for simple, non-MultiIndexed, pandas objects.

def test_constructor_pandas_single(self):

ds = DataArray(np.random.rand(4,3), dims=['a', 'b']).to_dataset('a')
#'aself.make_example_math_dataset()['foo']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line should probably be deleted

@max-sixty
Copy link
Collaborator Author

Miscreant line removed @shoyer

@shoyer
Copy link
Member

shoyer commented Dec 11, 2015

Maybe add a test that this works on a Series?

@max-sixty
Copy link
Collaborator Author

Yes - will do

@max-sixty
Copy link
Collaborator Author

This doesn't work well for Series actually, because you have a Dataset with no coords - it's just a single value in each array in the Dataset.

I've added a test for Panels. Let me know if you think that's sufficient, or it's worth spending more time on the Series.

@shoyer
Copy link
Member

shoyer commented Dec 19, 2015

you have a Dataset with no coords - it's just a single value in each array in the Dataset

Isn't this exactly what you would expect? Series is dict like with single elements (scalars) as values.

@shoyer
Copy link
Member

shoyer commented Dec 19, 2015

I imagine the rule for the Dataset constructor from pandas objects as removing one dimension.

@shoyer
Copy link
Member

shoyer commented Dec 19, 2015

Ah, I understand now -- series fails your unit test. I think it still gives the expected result, though, e.g., Dataset(Series(range(3))).equals(Dataset(dict(enumerated(range(3)))). In any case this is probably sufficient :).

@max-sixty
Copy link
Collaborator Author

Yes, my last comment wasn't clear. I think it's something to do with ChainMap - dict(series) gives the expected result, but dict(ChainMap(series)) throws an error (actually two...).

Potentially because list(series) gives values (but list(df) gives the keys)?

Regardless I'll add a note in the docs for DataFrame & Panel, and the Series can wait for the moment.

In [30]: series=pd.Series(pd.np.random.rand(4))

In [31]: dict(series)
Out[31]: 
{0: 0.26874240805523286,
 1: 0.3110026841777368,
 2: 0.22873881434409837,
 3: 0.9946345046609677}

In [34]: dict(xray.core.utils.ChainMap(series))
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/index.py:805: FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point
  type(self).__name__),FutureWarning)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/series.py in __getitem__(self, key)
    520         try:
--> 521             result = self.index.get_value(self, key)
    522 

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/index.py in get_value(self, series, key)
   1591         if is_float(k) and not self.is_floating():
-> 1592             raise KeyError
   1593 

KeyError: 

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-34-2a2c45b6f2cd> in <module>()
----> 1 dict(xray.core.utils.ChainMap(series))

/Users/maximilianroos/Dropbox/workspace/xray/xray/core/utils.py in __getitem__(self, key)
    310         for mapping in self.maps:
    311             try:
--> 312                 return mapping[key]
    313             except KeyError:
    314                 pass

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/series.py in __getitem__(self, key)
    545 
    546                 # we can try to coerce the indexer (or this will raise)
--> 547                 new_key = self.index._convert_scalar_indexer(key,kind='getitem')
    548                 if type(new_key) != type(key):
    549                     return self.__getitem__(new_key)

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/index.py in _convert_scalar_indexer(self, key, kind)
    804                 warnings.warn("scalar indexers for index type {0} should be integers and not floating point".format(
    805                     type(self).__name__),FutureWarning)
--> 806             return to_int()
    807 
    808         return key

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/index.py in to_int()
    787             ikey = int(key)
    788             if ikey != key:
--> 789                 return self._invalid_indexer('label', key)
    790             return ikey
    791 

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/index.py in _invalid_indexer(self, form, key)
    942                                                            klass=type(self),
    943                                                            key=key,
--> 944                                                            kind=type(key)))
    945 
    946     def get_duplicates(self):

TypeError: cannot do label indexing on <class 'pandas.core.index.Int64Index'> with these indexers [0.26874240805523286] of <class 'numpy.float64'>

@max-sixty
Copy link
Collaborator Author

@shoyer I made some more improvements to the docs, although they need a review

dimensionality equal to the length of ``dims``.
- ``attrs`` is an arbitrary Python dictionary for storing metadata associated
with a particular array.
``coords`` are supplied as dictionary of ``{coord_name: coord}`` where the values are scalar values,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also be able to supply DataArrays (and maybe pandas objects?) as coords

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer Should we define something as "DataArray-like" if it's:

  • A :py:class:~xray.DataArray
  • A tuple of the form (dims, data[, attrs])
  • A pandas object
  • A numpy array, whose dimensions will be labelled dim0, dim1, etc

...and then use that definition throughout? There are currently a few references to that (although currently written differently in different places).

I think the only exception is a dim, which is a 1D version of those.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would indeed be worth formalizing what it means to be "DataArray-like". The trouble is that it depends a bit on context:

For coords or data_vars, you can supply:

  • An xray.DataArray or xray.Variable
  • A tuple of the form (dims, data[, attrs])
  • A pandas object
  • 1D numpy arrays, which are assumed to be along the given dimension
  • Scalars

We don't automatically labeled dimensions (except for 1D arrays), because that's probably user error rather than what they would like to see.

For casting with the DataArray constructor, you can use:

  • An xray.DataArray or xray.Variable
  • A pandas object
  • Scalars or NumPy arrays, whose dimensions will labeled dim0, dim1, etc (unless dims or coords is supplied)

Here, we don't accept tuples (dims, data[, attrs]) because there's another, more explicit place for such arguments in the DataArray constructor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super, thanks for that.

I'll use those lists. Do you think it would make sense to define the first list as 'DataArray-like', and use it for coords & data_vars? I don't think it's a problem that the DataArray constructor can't be constructed with all of them. But introducing terms tends to be a one-way process, so let's do it deliberately if we do.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer gentle ping on this

shoyer added a commit that referenced this pull request Jan 2, 2016
Dataset constructor can take pandas objects
@shoyer shoyer merged commit eb8d179 into pydata:master Jan 2, 2016
@shoyer
Copy link
Member

shoyer commented Jan 2, 2016

This is great, thanks!

@max-sixty
Copy link
Collaborator Author

OK, there's still some improvements to make re the comments above, but that can be for the next iteration

@max-sixty max-sixty deleted the allow-pandas-to-ds-constructor branch January 2, 2016 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants