Skip to content

Dataset constructor can take pandas objects #677

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 2, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 31 additions & 22 deletions doc/data-structures.rst
Original file line number Diff line number Diff line change
Expand Up @@ -74,9 +74,11 @@ in index values in the same way.
Coordinates can take the following forms:

- A list of ``(dim, ticks[, attrs])`` pairs with length equal to the number of dimensions
- A dictionary of ``{coord_name: coord}`` where the values are scaler values,
1D arrays or tuples (tuples in the same form as above). This form lets you supply other
coordinates than those corresponding to dimensions (more on these later).
- A dictionary of ``{coord_name: coord}`` where the values are each a scalar value,
a 1D array or a tuple. Tuples are be in the same form as the above, and
multiple dimensions can be supplied with the form ``(dims, data[, attrs])``.
Supplying as a tuple allows other coordinates than those corresponding to
dimensions (more on these later).

As a list of tuples:

Expand All @@ -92,6 +94,14 @@ As a dictionary:
'ranking': ('space', [1, 2, 3])},
dims=['time', 'space'])

As a dictionary with coords across multiple dimensions:

.. ipython:: python

xray.DataArray(data, coords={'time': times, 'space': locs, 'const': 42,
'ranking': (('space', 'time'), np.arange(12).reshape(4,3))},
dims=['time', 'space'])

If you create a ``DataArray`` by supplying a pandas
:py:class:`~pandas.Series`, :py:class:`~pandas.DataFrame` or
:py:class:`~pandas.Panel`, any non-specified arguments in the
Expand Down Expand Up @@ -194,8 +204,7 @@ to access any variable in a dataset, datasets have four key properties:
each dimension (e.g., ``{'x': 6, 'y': 6, 'time': 8}``)
- ``data_vars``: a dict-like container of DataArrays corresponding to variables
- ``coords``: another dict-like container of DataArrays intended to label points
used in ``data_vars`` (e.g., 1-dimensional arrays of numbers, datetime
objects or strings)
used in ``data_vars`` (e.g., arrays of numbers, datetime objects or strings)
- ``attrs``: an ``OrderedDict`` to hold arbitrary metadata

The distinction between whether a variables falls in data or coordinates
Expand Down Expand Up @@ -223,18 +232,16 @@ Creating a Dataset
~~~~~~~~~~~~~~~~~~

To make an :py:class:`~xray.Dataset` from scratch, supply dictionaries for any
variables, coordinates and attributes you would like to insert into the
dataset.
variables (``data_vars``), coordinates (``coords``) and attributes (``attrs``).

For the ``data_vars`` and ``coords`` arguments, keys should be the name of the
variable and values should be scalars, 1d arrays or tuples of the form
``(dims, data[, attrs])`` sufficient to label each array:
``data_vars`` are supplied as a dictionary with each key as the name of the variable and each
value as one of:
- A :py:class:`~xray.DataArray`
- A tuple of the form ``(dims, data[, attrs])``
- A pandas object

- ``dims`` should be a sequence of strings.
- ``data`` should be a numpy.ndarray (or array-like object) that has a
dimensionality equal to the length of ``dims``.
- ``attrs`` is an arbitrary Python dictionary for storing metadata associated
with a particular array.
``coords`` are supplied as dictionary of ``{coord_name: coord}`` where the values are scalar values,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also be able to supply DataArrays (and maybe pandas objects?) as coords

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer Should we define something as "DataArray-like" if it's:

  • A :py:class:~xray.DataArray
  • A tuple of the form (dims, data[, attrs])
  • A pandas object
  • A numpy array, whose dimensions will be labelled dim0, dim1, etc

...and then use that definition throughout? There are currently a few references to that (although currently written differently in different places).

I think the only exception is a dim, which is a 1D version of those.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would indeed be worth formalizing what it means to be "DataArray-like". The trouble is that it depends a bit on context:

For coords or data_vars, you can supply:

  • An xray.DataArray or xray.Variable
  • A tuple of the form (dims, data[, attrs])
  • A pandas object
  • 1D numpy arrays, which are assumed to be along the given dimension
  • Scalars

We don't automatically labeled dimensions (except for 1D arrays), because that's probably user error rather than what they would like to see.

For casting with the DataArray constructor, you can use:

  • An xray.DataArray or xray.Variable
  • A pandas object
  • Scalars or NumPy arrays, whose dimensions will labeled dim0, dim1, etc (unless dims or coords is supplied)

Here, we don't accept tuples (dims, data[, attrs]) because there's another, more explicit place for such arguments in the DataArray constructor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super, thanks for that.

I'll use those lists. Do you think it would make sense to define the first list as 'DataArray-like', and use it for coords & data_vars? I don't think it's a problem that the DataArray constructor can't be constructed with all of them. But introducing terms tends to be a one-way process, so let's do it deliberately if we do.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer gentle ping on this

arrays or tuples in the form of ``(dims, data[, attrs])``.

Let's create some fake data for the example we show above:

Expand All @@ -259,8 +266,8 @@ Notice that we did not explicitly include coordinates for the "x" or "y"
dimensions, so they were filled in array of ascending integers of the proper
length.

We can also pass :py:class:`xray.DataArray` objects or a pandas object as values
in the dictionary instead of tuples:
Here we pass :py:class:`xray.DataArray` objects or a pandas object as values
in the dictionary:

.. ipython:: python

Expand All @@ -271,13 +278,15 @@ in the dictionary instead of tuples:

xray.Dataset({'bar': foo.to_pandas()})

Where a pandas object is supplied, the names of its indexes are used as dimension
Where a pandas object is supplied as a value, the names of its indexes are used as dimension
names, and its data is aligned to any existing dimensions.

You can also create an dataset from a :py:class:`pandas.DataFrame` with
:py:meth:`Dataset.from_dataframe <xray.Dataset.from_dataframe>` or from a
netCDF file on disk with :py:func:`~xray.open_dataset`. See
:ref:`pandas` and :ref:`io`.
You can also create an dataset from:
- A :py:class:`pandas.DataFrame` or :py:class:`pandas.Panel` along its columns and items
respectively, by passing it into the :py:class:`xray.Dataset` directly
- A :py:class:`pandas.DataFrame` with :py:meth:`Dataset.from_dataframe <xray.Dataset.from_dataframe>`,
which will additionally handle MultiIndexes See :ref:`pandas`
- A netCDF file on disk with :py:func:`~xray.open_dataset`. See :ref:`io`.

Dataset contents
~~~~~~~~~~~~~~~~
Expand Down
6 changes: 4 additions & 2 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -94,8 +94,10 @@ Enhancements

Notice that ``shift`` moves data independently of coordinates, but ``roll``
moves both data and coordinates.
- Assigning a ``pandas`` object to a ``Dataset`` directly is now permitted. Its
index names correspond to the `dims`` of the ``Dataset``, and its data is aligned
- Assigning a ``pandas`` object to the variable of ``Dataset`` directly is now permitted. Its
index names correspond to the ``dims`` of the ``Dataset``, and its data is aligned
- Passing a :py:class:`pandas.DataFrame` or :py:class:`pandas.Panel` to a Dataset constructor
is now permitted
- New function :py:func:`~xray.broadcast` for explicitly broadcasting
``DataArray`` and ``Dataset`` objects against each other. For example:

Expand Down
2 changes: 1 addition & 1 deletion xray/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,7 @@ def __init__(self, data_vars=None, coords=None, attrs=None,
data_vars = {}
if coords is None:
coords = set()
if data_vars or coords:
if data_vars is not None or coords is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was easier than I thought!

self._set_init_vars_and_dims(data_vars, coords, compat)
if attrs is not None:
self.attrs = attrs
Expand Down
15 changes: 14 additions & 1 deletion xray/test/test_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,7 @@ def test_constructor_auto_align(self):
with self.assertRaisesRegexp(ValueError, 'conflicting sizes'):
Dataset({'a': a, 'b': b, 'e': e})

def test_constructor_pandas(self):
def test_constructor_pandas_sequence(self):

ds = self.make_example_math_dataset()
pandas_objs = OrderedDict(
Expand All @@ -214,6 +214,19 @@ def test_constructor_pandas(self):
ds_based_on_pandas = Dataset(variables=pandas_objs, coords=ds.coords, attrs=ds.attrs)
self.assertDatasetEqual(ds, ds_based_on_pandas)

def test_constructor_pandas_single(self):

das = [
DataArray(np.random.rand(4,3), dims=['a', 'b']), # df
DataArray(np.random.rand(4,3,2), dims=['a','b','c']), # panel
]

for da in das:
pandas_obj = da.to_pandas()
ds_based_on_pandas = Dataset(pandas_obj)
for dim in ds_based_on_pandas.data_vars:
self.assertArrayEqual(ds_based_on_pandas[dim], pandas_obj[dim])


def test_constructor_compat(self):
data = OrderedDict([('x', DataArray(0, coords={'y': 1})),
Expand Down