-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Add DatasetDict.to_pandas #5312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
eda691a
b1fb7db
ffd9833
5d2d192
4c0a8d7
0c07084
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -6,11 +6,13 @@ | |||
import warnings | ||||
from io import BytesIO | ||||
from pathlib import Path | ||||
from typing import Callable, Dict, List, Optional, Tuple, Union | ||||
from typing import Callable, Dict, Iterator, List, Optional, Tuple, Union | ||||
|
||||
import fsspec | ||||
import numpy as np | ||||
import pandas as pd | ||||
from huggingface_hub import HfApi | ||||
from typing_extensions import Literal | ||||
|
||||
from datasets.utils.metadata import DatasetMetadata | ||||
|
||||
|
@@ -36,7 +38,11 @@ | |||
logger = logging.get_logger(__name__) | ||||
|
||||
|
||||
class DatasetDict(dict): | ||||
class SplitsError(ValueError): | ||||
pass | ||||
|
||||
|
||||
class DatasetDict(Dict[str, Dataset]): | ||||
"""A dictionary (dict of str: datasets.Dataset) with dataset transforms methods (map, filter, etc.)""" | ||||
|
||||
def _check_values_type(self): | ||||
|
@@ -1417,6 +1423,61 @@ def push_to_hub( | |||
revision=branch, | ||||
) | ||||
|
||||
def to_pandas( | ||||
self, | ||||
splits: Optional[Union[Literal["all"], List[str]]] = None, | ||||
batch_size: Optional[int] = None, | ||||
batched: bool = False, | ||||
) -> Union[pd.DataFrame, Iterator[pd.DataFrame]]: | ||||
"""Returns the dataset as a :class:`pandas.DataFrame`. Can also return a generator for large datasets. | ||||
|
||||
You must specify which splits to convert if the dataset is made of multiple splits. | ||||
|
||||
Args: | ||||
splits (:obj:`Union[Literal["all"], List[str]]`, optional): List of splits to convert to a DataFrame. | ||||
You don't need to specify the splits if there's only one. | ||||
Use splits="all" to convert all the splits (they will be converted in the order of the dictionary). | ||||
batched (:obj:`bool`): Set to :obj:`True` to return a generator that yields the dataset as batches | ||||
of ``batch_size`` rows. Defaults to :obj:`False` (returns the whole datasets once) | ||||
batch_size (:obj:`int`, optional): The size (number of rows) of the batches if ``batched`` is `True`. | ||||
Defaults to :obj:`datasets.config.DEFAULT_MAX_BATCH_SIZE`. | ||||
|
||||
Returns: | ||||
`pandas.DataFrame` or `Iterator[pandas.DataFrame]` | ||||
|
||||
Example: | ||||
|
||||
If the dataset has one split: | ||||
```py | ||||
>>> df = dataset_dict.to_pandas() | ||||
``` | ||||
|
||||
If the dataset has multiple splits: | ||||
```py | ||||
>>> df_train = dataset_dict["train"].to_pandas() | ||||
>>> df_all = dataset_dict.to_pandas(splits="all") | ||||
>>> df_train_test = dataset_dict.to_pandas(splits=["train", "test"]) | ||||
``` | ||||
""" | ||||
self._check_values_type() | ||||
self._check_values_features() | ||||
if splits is None and len(self) > 1: | ||||
raise SplitsError( | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe invent a more specific name for this type of error? smth like |
||||
"Failed to convert to pandas: please choose which splits to convert. " | ||||
f"Available splits: {list(self)}. For example:" | ||||
'\n df = ds["train"].to_pandas()' | ||||
'\n df = ds.to_pandas(splits=["train", "test"])' | ||||
lhoestq marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
'\n df = ds.to_pandas(splits="all")' | ||||
) | ||||
splits = splits if splits is not None and splits != "all" else list(self) | ||||
bad_splits = list(set(splits) - set(self)) | ||||
if bad_splits: | ||||
raise ValueError(f"Can't convert those splits to pandas : {bad_splits}. Available splits: {list(self)}.") | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe raise a custom error here too? to be aligned with datasets/src/datasets/utils/info_utils.py Line 48 in cb8dd98
(subclassing SplitsError defined above?)
|
||||
if batched: | ||||
return (df for split in splits for df in self[split].to_pandas(batch_size=batch_size, batched=batched)) | ||||
else: | ||||
return pd.concat([self[split].to_pandas() for split in splits]) | ||||
|
||||
lhoestq marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
|
||||
class IterableDatasetDict(dict): | ||||
def with_format( | ||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also add our
Split
to the type hint.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also allow
str
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If one wants to choose one split they already do
dataset_dict["test"].to_pandas()
- I don't think that introducingsplits="test"
would make it particularly easier.Although since we don't support the
Split
API fully (e.g. doing"train+test[:20]"
) I wouldn't necessarily addSplit
in the type hint