-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Add DatasetDict.to_pandas #5312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The current implementation is what I had in mind, i.e. concatenate all splits by default. However, I think most tabular datasets would come as a single split. So for that usecase, it wouldn't change UX if we raise when there are more than one splits. And for multiple splits, the user either passes a list, or they can pass |
I think it's better to raise an error in cases when there are multiple splits but no split is specified so that users know for sure with which data they are working. I imagine a case when a user loads a dataset that they don't know much about (like what splits it has), and if they get a concatenation of everything, it might lead to incorrect processing or interpretations and it would be hard to notice it. |
I just changed to raise an error if there are multiple splits. The error shows an example of how to choose a split to convert. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
src/datasets/dataset_dict.py
Outdated
@@ -1417,6 +1422,42 @@ def push_to_hub( | |||
revision=branch, | |||
) | |||
|
|||
def to_pandas( | |||
self, batch_size: Optional[int] = None, batched: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we add a splits
parameter here? And let users get an output which is all splits attached together?
df = load_dataset("blah").to_pandas(splits='all')
# or
df = load_dataset("blah").to_pandas(splits=["a", "b", "c"])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added support for splits='all'
and splits=["a", "b", "c"]
:)
Let me know if it sounds good to you !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks awesome! ❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this enhancement that will improve UX for tabular data!!
Below some comments, questions, nits...
@@ -1417,6 +1423,57 @@ def push_to_hub( | |||
revision=branch, | |||
) | |||
|
|||
def to_pandas( | |||
self, | |||
splits: Optional[Union[Literal["all"], List[str]]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also add our Split
to the type hint.
df_all = dataset_dict.to_pandas(splits=Split.ALL)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also allow str
?
df_test = dataset_dict.to_pandas(splits="test")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If one wants to choose one split they already do dataset_dict["test"].to_pandas()
- I don't think that introducing splits="test"
would make it particularly easier.
Although since we don't support the Split
API fully (e.g. doing "train+test[:20]"
) I wouldn't necessarily add Split
in the type hint
Co-authored-by: Albert Villanova del Moral <[email protected]>
Thanks for the review, I've updated the type hint and added a line to raise an error on bad splits :) |
Merging #5301 would eliminate the need for this PR, no? In the meantime, I find the current API cleaner. |
Let me know if it sounds good to you @mariosasko @albertvillanova :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like it! added a small suggestion about errors, feel free to ignore if you think it's redundant.
self._check_values_type() | ||
self._check_values_features() | ||
if splits is None and len(self) > 1: | ||
raise SplitsError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe invent a more specific name for this type of error? smth like SplitsNotSpecifiedError
/SplitsNotProvidedError
? (subclassing SplitsError
?)
splits = splits if splits is not None and splits != "all" else list(self) | ||
bad_splits = list(set(splits) - set(self)) | ||
if bad_splits: | ||
raise ValueError(f"Can't convert those splits to pandas : {bad_splits}. Available splits: {list(self)}.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe raise a custom error here too? to be aligned with UnexpectedSplits
exception in info_utils.py
:
datasets/src/datasets/utils/info_utils.py
Line 48 in cb8dd98
class UnexpectedSplits(SplitsVerificationException): |
(subclassing
SplitsError
defined above?)
I'm still not convinced. If |
@mariosasko the issue we're dealing with is that in tabular scenarios, we often don't have splits in the dataset, and imposing that concept to people dealing with the library hampers adoption. |
@adrinjalali This PR proposes a solution inconsistent with the existing API (in other words, a solution that clutters our API 🙂). Moreover, our library primarily focuses on larger-than-RAM datasets, and tabular datasets don't (directly) fall into this group. Instead of the temporary "fix" proposed here, it makes much more sense to align |
closing this one for now |
From discussions in #5189, for tabular data it doesn't really make sense to have to do
because many datasets are not split.
In this PR I added
to_pandas
toDatasetDict
which returns the DataFrame:If there's only one split, you don't need to specify the split name:
EDIT: and if a dataset has multiple splits:
I do have one question though @merveenoyan @adrinjalali @mariosasko:
Should we raise an error if there are multiple splits and ask the user to choose one explicitly ?