Dataset constructor can take pandas objects #677

max-sixty · 2015-12-10T23:22:32Z

Closes a 'first-step' of #676. Works only for simple, non-MultiIndexed, pandas objects.

shoyer · 2015-12-11T00:35:56Z

xray/test/test_dataset.py

+    def test_constructor_pandas_single(self):
+
+        ds = DataArray(np.random.rand(4,3), dims=['a', 'b']).to_dataset('a')
+        #'aself.make_example_math_dataset()['foo']


this line should probably be deleted

max-sixty · 2015-12-11T01:41:40Z

Miscreant line removed @shoyer

shoyer · 2015-12-11T01:46:53Z

Maybe add a test that this works on a Series?

max-sixty · 2015-12-15T22:44:57Z

Yes - will do

max-sixty · 2015-12-19T03:45:10Z

This doesn't work well for Series actually, because you have a Dataset with no coords - it's just a single value in each array in the Dataset.

I've added a test for Panels. Let me know if you think that's sufficient, or it's worth spending more time on the Series.

shoyer · 2015-12-19T03:47:29Z

you have a Dataset with no coords - it's just a single value in each array in the Dataset

Isn't this exactly what you would expect? Series is dict like with single elements (scalars) as values.

shoyer · 2015-12-19T03:49:11Z

I imagine the rule for the Dataset constructor from pandas objects as removing one dimension.

shoyer · 2015-12-19T03:51:51Z

Ah, I understand now -- series fails your unit test. I think it still gives the expected result, though, e.g., Dataset(Series(range(3))).equals(Dataset(dict(enumerated(range(3)))). In any case this is probably sufficient :).

max-sixty · 2015-12-19T03:57:41Z

Yes, my last comment wasn't clear. I think it's something to do with ChainMap - dict(series) gives the expected result, but dict(ChainMap(series)) throws an error (actually two...).

Potentially because list(series) gives values (but list(df) gives the keys)?

Regardless I'll add a note in the docs for DataFrame & Panel, and the Series can wait for the moment.

In [30]: series=pd.Series(pd.np.random.rand(4))

In [31]: dict(series)
Out[31]: 
{0: 0.26874240805523286,
 1: 0.3110026841777368,
 2: 0.22873881434409837,
 3: 0.9946345046609677}

In [34]: dict(xray.core.utils.ChainMap(series))
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/index.py:805: FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point
  type(self).__name__),FutureWarning)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/series.py in __getitem__(self, key)
    520         try:
--> 521             result = self.index.get_value(self, key)
    522 

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/index.py in get_value(self, series, key)
   1591         if is_float(k) and not self.is_floating():
-> 1592             raise KeyError
   1593 

KeyError: 

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-34-2a2c45b6f2cd> in <module>()
----> 1 dict(xray.core.utils.ChainMap(series))

/Users/maximilianroos/Dropbox/workspace/xray/xray/core/utils.py in __getitem__(self, key)
    310         for mapping in self.maps:
    311             try:
--> 312                 return mapping[key]
    313             except KeyError:
    314                 pass

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/series.py in __getitem__(self, key)
    545 
    546                 # we can try to coerce the indexer (or this will raise)
--> 547                 new_key = self.index._convert_scalar_indexer(key,kind='getitem')
    548                 if type(new_key) != type(key):
    549                     return self.__getitem__(new_key)

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/index.py in _convert_scalar_indexer(self, key, kind)
    804                 warnings.warn("scalar indexers for index type {0} should be integers and not floating point".format(
    805                     type(self).__name__),FutureWarning)
--> 806             return to_int()
    807 
    808         return key

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/index.py in to_int()
    787             ikey = int(key)
    788             if ikey != key:
--> 789                 return self._invalid_indexer('label', key)
    790             return ikey
    791 

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/index.py in _invalid_indexer(self, form, key)
    942                                                            klass=type(self),
    943                                                            key=key,
--> 944                                                            kind=type(key)))
    945 
    946     def get_duplicates(self):

TypeError: cannot do label indexing on <class 'pandas.core.index.Int64Index'> with these indexers [0.26874240805523286] of <class 'numpy.float64'>

max-sixty · 2015-12-19T05:07:20Z

@shoyer I made some more improvements to the docs, although they need a review

shoyer · 2015-12-19T05:48:34Z

doc/data-structures.rst

-  dimensionality equal to the length of ``dims``.
- ``attrs`` is an arbitrary Python dictionary for storing metadata associated
-  with a particular array.
+``coords`` are supplied as dictionary of ``{coord_name: coord}`` where the values are scalar values,


You should also be able to supply DataArrays (and maybe pandas objects?) as coords

@shoyer Should we define something as "DataArray-like" if it's:

A :py:class:~xray.DataArray

A tuple of the form (dims, data[, attrs])

A pandas object

A numpy array, whose dimensions will be labelled dim0, dim1, etc

...and then use that definition throughout? There are currently a few references to that (although currently written differently in different places).

I think the only exception is a dim, which is a 1D version of those.

It would indeed be worth formalizing what it means to be "DataArray-like". The trouble is that it depends a bit on context:

For coords or data_vars, you can supply:

An xray.DataArray or xray.Variable

A tuple of the form (dims, data[, attrs])

A pandas object

1D numpy arrays, which are assumed to be along the given dimension

Scalars

We don't automatically labeled dimensions (except for 1D arrays), because that's probably user error rather than what they would like to see.

For casting with the DataArray constructor, you can use:

An xray.DataArray or xray.Variable

A pandas object

Scalars or NumPy arrays, whose dimensions will labeled dim0, dim1, etc (unless dims or coords is supplied)

Here, we don't accept tuples (dims, data[, attrs]) because there's another, more explicit place for such arguments in the DataArray constructor.

Super, thanks for that.

I'll use those lists. Do you think it would make sense to define the first list as 'DataArray-like', and use it for coords & data_vars? I don't think it's a problem that the DataArray constructor can't be constructed with all of them. But introducing terms tends to be a one-way process, so let's do it deliberately if we do.

@shoyer gentle ping on this

Dataset constructor can take pandas objects

shoyer · 2016-01-02T07:37:20Z

This is great, thanks!

max-sixty · 2016-01-02T07:44:05Z

OK, there's still some improvements to make re the comments above, but that can be for the next iteration

shoyer reviewed Dec 11, 2015
View reviewed changes

shoyer reviewed Dec 19, 2015
View reviewed changes

max-sixty and others added 3 commits January 2, 2016 02:25

simple non-MultiIndexed pandas objects in DS constructor

2e27b86

doc improvements, particulary dataset docs

cec30e8

what's new

3e56789

shoyer added a commit that referenced this pull request Jan 2, 2016

Merge pull request #677 from SixtyCapital/allow-pandas-to-ds-constructor

eb8d179

Dataset constructor can take pandas objects

shoyer merged commit eb8d179 into pydata:master Jan 2, 2016

max-sixty deleted the allow-pandas-to-ds-constructor branch January 2, 2016 07:44

shoyer mentioned this pull request Feb 2, 2016

Support pandas.Series in the Dataset constructor #740

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset constructor can take pandas objects #677

Dataset constructor can take pandas objects #677

max-sixty commented Dec 10, 2015

shoyer Dec 11, 2015

max-sixty commented Dec 11, 2015

shoyer commented Dec 11, 2015

max-sixty commented Dec 15, 2015

max-sixty commented Dec 19, 2015

shoyer commented Dec 19, 2015

shoyer commented Dec 19, 2015

shoyer commented Dec 19, 2015

max-sixty commented Dec 19, 2015

max-sixty commented Dec 19, 2015

shoyer Dec 19, 2015

max-sixty Dec 21, 2015

shoyer Dec 22, 2015

max-sixty Dec 29, 2015

max-sixty Jan 2, 2016

shoyer commented Jan 2, 2016

max-sixty commented Jan 2, 2016

Dataset constructor can take pandas objects #677

Dataset constructor can take pandas objects #677

Conversation

max-sixty commented Dec 10, 2015

shoyer Dec 11, 2015

Choose a reason for hiding this comment

max-sixty commented Dec 11, 2015

shoyer commented Dec 11, 2015

max-sixty commented Dec 15, 2015

max-sixty commented Dec 19, 2015

shoyer commented Dec 19, 2015

shoyer commented Dec 19, 2015

shoyer commented Dec 19, 2015

max-sixty commented Dec 19, 2015

max-sixty commented Dec 19, 2015

shoyer Dec 19, 2015

Choose a reason for hiding this comment

max-sixty Dec 21, 2015

Choose a reason for hiding this comment

shoyer Dec 22, 2015

Choose a reason for hiding this comment

max-sixty Dec 29, 2015

Choose a reason for hiding this comment

max-sixty Jan 2, 2016

Choose a reason for hiding this comment

shoyer commented Jan 2, 2016

max-sixty commented Jan 2, 2016