-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Added to_json_schema #14904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Added to_json_schema #14904
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,3 +4,5 @@ pathlib | |
backports.lzma | ||
py | ||
PyCrypto | ||
mock | ||
ipython |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,3 +18,4 @@ pymysql | |
psycopg2 | ||
s3fs | ||
beautifulsoup4 | ||
ipython |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,3 +18,4 @@ pymysql | |
beautifulsoup4 | ||
s3fs | ||
xarray | ||
ipython |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -60,6 +60,7 @@ JSON | |
:toctree: generated/ | ||
|
||
json_normalize | ||
build_table_schema | ||
|
||
.. currentmodule:: pandas | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2033,6 +2033,126 @@ using Hadoop or Spark. | |
df | ||
df.to_json(orient='records', lines=True) | ||
|
||
|
||
.. _io.table_schema: | ||
|
||
Table Schema | ||
'''''''''''' | ||
|
||
.. versionadded:: 0.20.0 | ||
|
||
`Table Schema`_ is a spec for describing tabular datasets as a JSON | ||
object. The JSON includes information on the field names, types, and | ||
other attributes. You can use the orient ``table`` to build | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
a JSON string with two fields, ``schema`` and ``data``. | ||
|
||
.. ipython:: python | ||
|
||
df = pd.DataFrame( | ||
{'A': [1, 2, 3], | ||
'B': ['a', 'b', 'c'], | ||
'C': pd.date_range('2016-01-01', freq='d', periods=3), | ||
}, index=pd.Index(range(3), name='idx')) | ||
df | ||
df.to_json(orient='table', date_format="iso") | ||
|
||
The ``schema`` field contains the ``fields`` key, which itself contains | ||
a list of column name to type pairs, including the ``Index`` or ``MultiIndex`` | ||
(see below for a list of types). | ||
The ``schema`` field also contains a ``primaryKey`` field if the (Multi)index | ||
is unique. | ||
|
||
The second field, ``data``, contains the serialized data with the ``records`` | ||
orient. | ||
The index is included, and any datetimes are ISO 8601 formatted, as required | ||
by the Table Schema spec. | ||
|
||
The full list of types supported are described in the Table Schema | ||
spec. This table shows the mapping from pandas types: | ||
|
||
============== ================= | ||
Pandas type Table Schema type | ||
============== ================= | ||
int64 integer | ||
float64 number | ||
bool boolean | ||
datetime64[ns] datetime | ||
timedelta64[ns] duration | ||
categorical any | ||
object str | ||
=============== ================= | ||
|
||
A few notes on the generated table schema: | ||
|
||
- The ``schema`` object contains a ``pandas_version`` field. This contains | ||
the version of pandas' dialect of the schema, and will be incremented | ||
with each revision. | ||
- All dates are converted to UTC when serializing. Even timezone naïve values, | ||
which are treated as UTC with an offset of 0. | ||
|
||
.. ipython:: python: | ||
|
||
from pandas.io.json import build_table_schema | ||
s = pd.Series(pd.date_range('2016', periods=4)) | ||
build_table_schema(s) | ||
|
||
- datetimes with a timezone (before serializing), include an additional field | ||
``tz`` with the time zone name (e.g. ``'US/Central'``). | ||
|
||
.. ipython:: python | ||
|
||
s_tz = pd.Series(pd.date_range('2016', periods=12, | ||
tz='US/Central')) | ||
build_table_schema(s_tz) | ||
|
||
- Periods are converted to timestamps before serialization, and so have the | ||
same behavior of being converted to UTC. In addition, periods will contain | ||
and additional field ``freq`` with the period's frequency, e.g. ``'A-DEC'`` | ||
|
||
.. ipython:: python | ||
|
||
s_per = pd.Series(1, index=pd.period_range('2016', freq='A-DEC', | ||
periods=4)) | ||
build_table_schema(s_per) | ||
|
||
- Categoricals use the ``any`` type and an ``enum`` constraint listing | ||
the set of possible values. Additionally, an ``ordered`` field is included | ||
|
||
.. ipython:: python | ||
|
||
s_cat = pd.Series(pd.Categorical(['a', 'b', 'a'])) | ||
build_table_schema(s_cat) | ||
|
||
- A ``primaryKey`` field, containing an array of labels, is included | ||
*if the index is unique*: | ||
|
||
.. ipython:: python | ||
|
||
s_dupe = pd.Series([1, 2], index=[1, 1]) | ||
build_table_schema(s_dupe) | ||
|
||
- The ``primaryKey`` behavior is the same with MultiIndexes, but in this | ||
case the ``primaryKey`` is an array: | ||
|
||
.. ipython:: python | ||
|
||
s_multi = pd.Series(1, index=pd.MultiIndex.from_product([('a', 'b'), | ||
(0, 1)])) | ||
build_table_schema(s_multi) | ||
|
||
- The default naming roughly follows these rules: | ||
|
||
+ For series, the ``object.name`` is used. If that's none, then the | ||
name is ``values`` | ||
+ For DataFrames, the stringified version of the column name is used | ||
+ For ``Index`` (not ``MultiIndex``), ``index.name`` is used, with a | ||
fallback to ``index`` if that is None. | ||
+ For ``MultiIndex``, ``mi.names`` is used. If any level has no name, | ||
then ``level_<i>`` is used. | ||
|
||
|
||
_Table Schema: http://specs.frictionlessdata.io/json-table-schema/ | ||
|
||
HTML | ||
---- | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,6 +12,7 @@ Highlights include: | |
- Building pandas for development now requires ``cython >= 0.23`` (:issue:`14831`) | ||
- The ``.ix`` indexer has been deprecated, see :ref:`here <whatsnew_0200.api_breaking.deprecate_ix>` | ||
- Switched the test framework to `pytest`_ (:issue:`13097`) | ||
- A new orient for JSON serialization, ``orient='table'``, that uses the Table Schema spec, see :ref: `here <whatsnew_0200.enhancements.table_schema>` | ||
|
||
.. _pytest: http://doc.pytest.org/en/latest/ | ||
|
||
|
@@ -154,6 +155,40 @@ New Behavior: | |
|
||
df[df.chromosomes != '1'].groupby('chromosomes', sort=False).sum() | ||
|
||
.. _whatsnew_0200.enhancements.table_schema | ||
|
||
Table Schema Output | ||
^^^^^^^^^^^^^^^^^^^ | ||
|
||
The new orient ``'table'`` for :meth:`DataFrame.to_json` | ||
will generate a `Table Schema`_ compatible string representation of | ||
the data. | ||
|
||
.. ipython:: python | ||
|
||
df = pd.DataFrame( | ||
{'A': [1, 2, 3], | ||
'B': ['a', 'b', 'c'], | ||
'C': pd.date_range('2016-01-01', freq='d', periods=3), | ||
}, index=pd.Index(range(3), name='idx')) | ||
df | ||
df.to_json(orient='table') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will raise an error, as you didn't specify There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I don't like it either. How about this: I change the default There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. see my comment above about this. you can make the default |
||
|
||
|
||
See :ref:`IO: Table Schema for more<io.table_schema>`. | ||
|
||
Additionally, the repr for ``DataFrame`` and ``Series`` can now publish | ||
this JSON Table schema representation of the Series or DataFrame if you are | ||
using IPython (or another frontend like `nteract`_ using the Jupyter messaging | ||
protocol). | ||
This gives frontends like the Jupyter notebook and `nteract`_ | ||
more flexiblity in how they display pandas objects, since they have | ||
more information about the data. | ||
You must enable this by setting the ``display.html.table_schema`` option to True. | ||
|
||
.. _Table Schema: http://specs.frictionlessdata.io/json-table-schema/ | ||
.. _nteract: http://nteract.io/ | ||
|
||
.. _whatsnew_0200.enhancements.other: | ||
|
||
Other enhancements | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,6 +4,7 @@ | |
import operator | ||
import weakref | ||
import gc | ||
import json | ||
|
||
import numpy as np | ||
import pandas.lib as lib | ||
|
@@ -129,6 +130,37 @@ def __init__(self, data, axes=None, copy=False, dtype=None, | |
object.__setattr__(self, '_data', data) | ||
object.__setattr__(self, '_item_cache', {}) | ||
|
||
def _ipython_display_(self): | ||
try: | ||
from IPython.display import display | ||
except ImportError: | ||
return None | ||
|
||
# Series doesn't define _repr_html_ or _repr_latex_ | ||
latex = self._repr_latex_() if hasattr(self, '_repr_latex_') else None | ||
html = self._repr_html_() if hasattr(self, '_repr_html_') else None | ||
table_schema = self._repr_table_schema_() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should the
in the resulting output? Oh nevermind, I see the |
||
# We need the inital newline since we aren't going through the | ||
# usual __repr__. See | ||
# https://github.com/pandas-dev/pandas/pull/14904#issuecomment-277829277 | ||
text = "\n" + repr(self) | ||
|
||
reprs = {"text/plain": text, "text/html": html, "text/latex": latex, | ||
"application/vnd.dataresource+json": table_schema} | ||
reprs = {k: v for k, v in reprs.items() if v} | ||
display(reprs, raw=True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would the best way to test this to be mocking There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I hope so, cause that's what I do here https://github.com/pandas-dev/pandas/pull/14904/files#diff-81a94f6a5e3a0de7887baaab7b55f579R145 😉 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Weird, when I was viewing this the codecov extension was showing this segment as not covered. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch! We didn't have IPython installed in the build that runs the coverage report, so it was skipped. Just pushed a commit adding it. |
||
|
||
def _repr_table_schema_(self): | ||
""" | ||
Not a real Jupyter special repr method, but we use the same | ||
naming convention. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 😄 one step towards general adoption I think. 😉 |
||
""" | ||
if config.get_option("display.html.table_schema"): | ||
data = self.head(config.get_option('display.max_rows')) | ||
payload = json.loads(data.to_json(orient='table'), | ||
object_pairs_hook=collections.OrderedDict) | ||
return payload | ||
|
||
def _validate_dtype(self, dtype): | ||
""" validate the passed dtype """ | ||
|
||
|
@@ -1094,7 +1126,7 @@ def __setstate__(self, state): | |
strings before writing. | ||
""" | ||
|
||
def to_json(self, path_or_buf=None, orient=None, date_format='epoch', | ||
def to_json(self, path_or_buf=None, orient=None, date_format=None, | ||
double_precision=10, force_ascii=True, date_unit='ms', | ||
default_handler=None, lines=False): | ||
""" | ||
|
@@ -1129,10 +1161,17 @@ def to_json(self, path_or_buf=None, orient=None, date_format='epoch', | |
- index : dict like {index -> {column -> value}} | ||
- columns : dict like {column -> {index -> value}} | ||
- values : just the values array | ||
- table : dict like {'schema': {schema}, 'data': {data}} | ||
describing the data, and the data component is | ||
like ``orient='records'``. | ||
|
||
date_format : {'epoch', 'iso'} | ||
.. versionchanged:: 0.20.0 | ||
|
||
date_format : {None, 'epoch', 'iso'} | ||
Type of date conversion. `epoch` = epoch milliseconds, | ||
`iso`` = ISO8601, default is epoch. | ||
`iso` = ISO8601. The default depends on the `orient`. For | ||
`orient='table'`, the default is `'iso'`. For all other orients, | ||
the default is `'epoch'`. | ||
double_precision : The number of decimal places to use when encoding | ||
floating point values, default 10. | ||
force_ascii : force encoded string to be ASCII, default True. | ||
|
@@ -1151,14 +1190,53 @@ def to_json(self, path_or_buf=None, orient=None, date_format='epoch', | |
|
||
.. versionadded:: 0.19.0 | ||
|
||
|
||
Returns | ||
------- | ||
same type as input object with filtered info axis | ||
|
||
See Also | ||
-------- | ||
pd.read_json | ||
|
||
Examples | ||
-------- | ||
|
||
>>> df = pd.DataFrame([['a', 'b'], ['c', 'd']], | ||
... index=['row 1', 'row 2'], | ||
... columns=['col 1', 'col 2']) | ||
>>> df.to_json(orient='split') | ||
'{"columns":["col 1","col 2"], | ||
"index":["row 1","row 2"], | ||
"data":[["a","b"],["c","d"]]}' | ||
|
||
Encoding/decoding a Dataframe using ``'index'`` formatted JSON: | ||
|
||
>>> df.to_json(orient='index') | ||
'{"row 1":{"col 1":"a","col 2":"b"},"row 2":{"col 1":"c","col 2":"d"}}' | ||
|
||
Encoding/decoding a Dataframe using ``'records'`` formatted JSON. | ||
Note that index labels are not preserved with this encoding. | ||
|
||
>>> df.to_json(orient='records') | ||
'[{"col 1":"a","col 2":"b"},{"col 1":"c","col 2":"d"}]' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nice examples! |
||
|
||
Encoding with Table Schema | ||
|
||
>>> df.to_json(orient='table') | ||
'{"schema": {"fields": [{"name": "index", "type": "string"}, | ||
{"name": "col 1", "type": "string"}, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shouldn't the version be here? |
||
{"name": "col 2", "type": "string"}], | ||
"primaryKey": "index", | ||
"pandas_version": "0.20.0"}, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is |
||
"data": [{"index": "row 1", "col 1": "a", "col 2": "b"}, | ||
{"index": "row 2", "col 1": "c", "col 2": "d"}]}' | ||
""" | ||
|
||
from pandas.io import json | ||
if date_format is None and orient == 'table': | ||
date_format = 'iso' | ||
elif date_format is None: | ||
date_format = 'epoch' | ||
return json.to_json(path_or_buf=path_or_buf, obj=self, orient=orient, | ||
date_format=date_format, | ||
double_precision=double_precision, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,5 @@ | ||
from .json import to_json, read_json, loads, dumps # noqa | ||
from .normalize import json_normalize # noqa | ||
from .table_schema import build_table_schema # noqa | ||
|
||
del json, normalize # noqa | ||
del json, normalize, table_schema # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you may want a ref tag here