Add DatasetDict.to_pandas #5312

lhoestq · 2022-11-29T16:30:02Z

From discussions in #5189, for tabular data it doesn't really make sense to have to do

df = load_dataset(...)["train"].to_pandas()

because many datasets are not split.

In this PR I added to_pandas to DatasetDict which returns the DataFrame:

If there's only one split, you don't need to specify the split name:

df = load_dataset(...).to_pandas()

EDIT: and if a dataset has multiple splits:

df = load_dataset(...).to_pandas(splits=["train", "test"])
# or
df = load_dataset(...).to_pandas(splits="all")

# raises an error because you need to select the split(s) to convert
load_dataset(...).to_pandas()

I do have one question though @merveenoyan @adrinjalali @mariosasko:

Should we raise an error if there are multiple splits and ask the user to choose one explicitly ?

adrinjalali · 2022-11-29T16:46:55Z

The current implementation is what I had in mind, i.e. concatenate all splits by default.

However, I think most tabular datasets would come as a single split. So for that usecase, it wouldn't change UX if we raise when there are more than one splits.

And for multiple splits, the user either passes a list, or they can pass splits="all" to have all splits concatenated.

polinaeterna · 2022-11-29T17:04:57Z

I think it's better to raise an error in cases when there are multiple splits but no split is specified so that users know for sure with which data they are working. I imagine a case when a user loads a dataset that they don't know much about (like what splits it has), and if they get a concatenation of everything, it might lead to incorrect processing or interpretations and it would be hard to notice it.
("explicit is better than implicit")

lhoestq · 2022-11-29T17:26:37Z

I just changed to raise an error if there are multiple splits. The error shows an example of how to choose a split to convert.

HuggingFaceDocBuilderDev · 2022-11-29T17:30:31Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

adrinjalali · 2022-11-30T08:49:58Z

src/datasets/dataset_dict.py

@@ -1417,6 +1422,42 @@ def push_to_hub(
            revision=branch,
        )

+    def to_pandas(
+        self, batch_size: Optional[int] = None, batched: bool = False


could we add a splits parameter here? And let users get an output which is all splits attached together?

df = load_dataset("blah").to_pandas(splits='all') # or df = load_dataset("blah").to_pandas(splits=["a", "b", "c"])

Just added support for splits='all' and splits=["a", "b", "c"] :)

Let me know if it sounds good to you !

adrinjalali

Thanks, this looks awesome! ❤️

albertvillanova

Thanks for this enhancement that will improve UX for tabular data!!

Below some comments, questions, nits...

albertvillanova · 2022-12-05T07:09:18Z

src/datasets/dataset_dict.py

@@ -1417,6 +1423,57 @@ def push_to_hub(
            revision=branch,
        )

+    def to_pandas(
+        self,
+        splits: Optional[Union[Literal["all"], List[str]]] = None,


We could also add our Split to the type hint.

df_all = dataset_dict.to_pandas(splits=Split.ALL)

Should we also allow str?

df_test = dataset_dict.to_pandas(splits="test")

If one wants to choose one split they already do dataset_dict["test"].to_pandas() - I don't think that introducing splits="test" would make it particularly easier.

Although since we don't support the Split API fully (e.g. doing "train+test[:20]") I wouldn't necessarily add Split in the type hint

src/datasets/dataset_dict.py

Co-authored-by: Albert Villanova del Moral <[email protected]>

lhoestq · 2022-12-05T14:44:09Z

Thanks for the review, I've updated the type hint and added a line to raise an error on bad splits :)

mariosasko · 2022-12-05T15:17:19Z

Merging #5301 would eliminate the need for this PR, no?

In the meantime, I find the current API cleaner.

lhoestq · 2022-12-06T12:09:48Z

This solution is simpler than #5301 and covers most cases for tabular datasets, so I'm in favor of merging this one and put #5301 on stand by

lhoestq · 2022-12-07T12:03:52Z

Let me know if it sounds good to you @mariosasko @albertvillanova :)

polinaeterna

like it! added a small suggestion about errors, feel free to ignore if you think it's redundant.

polinaeterna · 2022-12-07T12:24:53Z

src/datasets/dataset_dict.py

+        self._check_values_type()
+        self._check_values_features()
+        if splits is None and len(self) > 1:
+            raise SplitsError(


maybe invent a more specific name for this type of error? smth like SplitsNotSpecifiedError/SplitsNotProvidedError ? (subclassing SplitsError?)

polinaeterna · 2022-12-07T12:32:55Z

src/datasets/dataset_dict.py

+        splits = splits if splits is not None and splits != "all" else list(self)
+        bad_splits = list(set(splits) - set(self))
+        if bad_splits:
+            raise ValueError(f"Can't convert those splits to pandas : {bad_splits}. Available splits: {list(self)}.")


maybe raise a custom error here too? to be aligned with UnexpectedSplits exception in info_utils.py:

datasets/src/datasets/utils/info_utils.py

Line 48 in cb8dd98

class UnexpectedSplits(SplitsVerificationException):

(subclassing SplitsError defined above?)

mariosasko · 2022-12-07T14:15:01Z

I'm still not convinced. If DatasetDict needs this method and there is no other way, then IMO it would make more sense to return a dictionary with the splits converted to pd.DataFrame.

adrinjalali · 2022-12-07T14:36:08Z

@mariosasko the issue we're dealing with is that in tabular scenarios, we often don't have splits in the dataset, and imposing that concept to people dealing with the library hampers adoption.

mariosasko · 2022-12-12T18:50:42Z

@adrinjalali This PR proposes a solution inconsistent with the existing API (in other words, a solution that clutters our API 🙂). Moreover, our library primarily focuses on larger-than-RAM datasets, and tabular datasets don't (directly) fall into this group.

Instead of the temporary "fix" proposed here, it makes much more sense to align load_dataset with both tabular and DL workflows "in a consistent way", so I suggest we continue our discussion from #5189 to have this resolved by version 3.0.

lhoestq · 2023-01-25T17:33:42Z

closing this one for now

add DatasetDict.to_pandas

eda691a

raise an error if multiple splits

b1fb7db

adrinjalali reviewed Nov 30, 2022

View reviewed changes

lhoestq added 2 commits November 30, 2022 14:40

add splits= parameter

ffd9833

add splits="all"

5d2d192

adrinjalali approved these changes Nov 30, 2022

View reviewed changes

lhoestq requested review from albertvillanova, mariosasko and polinaeterna November 30, 2022 14:09

lhoestq mentioned this pull request Dec 1, 2022

Reduce friction in tabular dataset workflow by eliminating having splits when dataset is loaded #5189

Open

albertvillanova reviewed Dec 5, 2022

View reviewed changes

lhoestq and others added 2 commits December 5, 2022 15:34

Apply suggestions from code review

4c0a8d7

Co-authored-by: Albert Villanova del Moral <[email protected]>

raise an error on bad splits

0c07084

polinaeterna approved these changes Dec 7, 2022

View reviewed changes

lhoestq closed this Jan 25, 2023

albertvillanova deleted the add-datasetdict-topandas branch September 24, 2023 10:06

Add DatasetDict.to_pandas #5312

Add DatasetDict.to_pandas #5312

Uh oh!

Conversation

lhoestq commented Nov 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali commented Nov 29, 2022

Uh oh!

polinaeterna commented Nov 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Nov 29, 2022

Uh oh!

HuggingFaceDocBuilderDev commented Nov 29, 2022

Uh oh!

adrinjalali Nov 30, 2022

Choose a reason for hiding this comment

Uh oh!

lhoestq Nov 30, 2022

Choose a reason for hiding this comment

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

albertvillanova Dec 5, 2022

Choose a reason for hiding this comment

Uh oh!

albertvillanova Dec 5, 2022

Choose a reason for hiding this comment

Uh oh!

lhoestq Dec 5, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lhoestq commented Dec 5, 2022

Uh oh!

mariosasko commented Dec 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Dec 6, 2022

Uh oh!

lhoestq commented Dec 7, 2022

Uh oh!

polinaeterna left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

polinaeterna Dec 7, 2022

Choose a reason for hiding this comment

Uh oh!

polinaeterna Dec 7, 2022

Choose a reason for hiding this comment

Uh oh!

mariosasko commented Dec 7, 2022

Uh oh!

adrinjalali commented Dec 7, 2022

Uh oh!

mariosasko commented Dec 12, 2022

Uh oh!

lhoestq commented Jan 25, 2023

Uh oh!

Uh oh!

lhoestq commented Nov 29, 2022 •

edited

Loading

polinaeterna commented Nov 29, 2022 •

edited

Loading

mariosasko commented Dec 5, 2022 •

edited

Loading

polinaeterna left a comment •

edited

Loading