Skip to content

Add DatasetDict.to_pandas #5312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed

Add DatasetDict.to_pandas #5312

wants to merge 6 commits into from

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Nov 29, 2022

From discussions in #5189, for tabular data it doesn't really make sense to have to do

df = load_dataset(...)["train"].to_pandas()

because many datasets are not split.

In this PR I added to_pandas to DatasetDict which returns the DataFrame:

If there's only one split, you don't need to specify the split name:

df = load_dataset(...).to_pandas()

EDIT: and if a dataset has multiple splits:

df = load_dataset(...).to_pandas(splits=["train", "test"])
# or
df = load_dataset(...).to_pandas(splits="all")

# raises an error because you need to select the split(s) to convert
load_dataset(...).to_pandas()

I do have one question though @merveenoyan @adrinjalali @mariosasko:

Should we raise an error if there are multiple splits and ask the user to choose one explicitly ?

@adrinjalali
Copy link

The current implementation is what I had in mind, i.e. concatenate all splits by default.

However, I think most tabular datasets would come as a single split. So for that usecase, it wouldn't change UX if we raise when there are more than one splits.

And for multiple splits, the user either passes a list, or they can pass splits="all" to have all splits concatenated.

@polinaeterna
Copy link
Contributor

polinaeterna commented Nov 29, 2022

I think it's better to raise an error in cases when there are multiple splits but no split is specified so that users know for sure with which data they are working. I imagine a case when a user loads a dataset that they don't know much about (like what splits it has), and if they get a concatenation of everything, it might lead to incorrect processing or interpretations and it would be hard to notice it.
("explicit is better than implicit")

@lhoestq
Copy link
Member Author

lhoestq commented Nov 29, 2022

I just changed to raise an error if there are multiple splits. The error shows an example of how to choose a split to convert.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@@ -1417,6 +1422,42 @@ def push_to_hub(
revision=branch,
)

def to_pandas(
self, batch_size: Optional[int] = None, batched: bool = False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add a splits parameter here? And let users get an output which is all splits attached together?

df = load_dataset("blah").to_pandas(splits='all')
# or
df = load_dataset("blah").to_pandas(splits=["a", "b", "c"])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added support for splits='all' and splits=["a", "b", "c"] :)

Let me know if it sounds good to you !

Copy link

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks awesome! ❤️

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this enhancement that will improve UX for tabular data!!

Below some comments, questions, nits...

@@ -1417,6 +1423,57 @@ def push_to_hub(
revision=branch,
)

def to_pandas(
self,
splits: Optional[Union[Literal["all"], List[str]]] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also add our Split to the type hint.

df_all = dataset_dict.to_pandas(splits=Split.ALL)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also allow str?

df_test = dataset_dict.to_pandas(splits="test")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one wants to choose one split they already do dataset_dict["test"].to_pandas() - I don't think that introducing splits="test" would make it particularly easier.

Although since we don't support the Split API fully (e.g. doing "train+test[:20]") I wouldn't necessarily add Split in the type hint

@lhoestq
Copy link
Member Author

lhoestq commented Dec 5, 2022

Thanks for the review, I've updated the type hint and added a line to raise an error on bad splits :)

@mariosasko
Copy link
Collaborator

mariosasko commented Dec 5, 2022

Merging #5301 would eliminate the need for this PR, no?

In the meantime, I find the current API cleaner.

@lhoestq
Copy link
Member Author

lhoestq commented Dec 6, 2022

This solution is simpler than #5301 and covers most cases for tabular datasets, so I'm in favor of merging this one and put #5301 on stand by

@lhoestq
Copy link
Member Author

lhoestq commented Dec 7, 2022

Let me know if it sounds good to you @mariosasko @albertvillanova :)

Copy link
Contributor

@polinaeterna polinaeterna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like it! added a small suggestion about errors, feel free to ignore if you think it's redundant.

self._check_values_type()
self._check_values_features()
if splits is None and len(self) > 1:
raise SplitsError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe invent a more specific name for this type of error? smth like SplitsNotSpecifiedError/SplitsNotProvidedError ? (subclassing SplitsError?)

splits = splits if splits is not None and splits != "all" else list(self)
bad_splits = list(set(splits) - set(self))
if bad_splits:
raise ValueError(f"Can't convert those splits to pandas : {bad_splits}. Available splits: {list(self)}.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe raise a custom error here too? to be aligned with UnexpectedSplits exception in info_utils.py:

class UnexpectedSplits(SplitsVerificationException):

(subclassing SplitsError defined above?)

@mariosasko
Copy link
Collaborator

I'm still not convinced. If DatasetDict needs this method and there is no other way, then IMO it would make more sense to return a dictionary with the splits converted to pd.DataFrame.

@adrinjalali
Copy link

@mariosasko the issue we're dealing with is that in tabular scenarios, we often don't have splits in the dataset, and imposing that concept to people dealing with the library hampers adoption.

@mariosasko
Copy link
Collaborator

@adrinjalali This PR proposes a solution inconsistent with the existing API (in other words, a solution that clutters our API 🙂). Moreover, our library primarily focuses on larger-than-RAM datasets, and tabular datasets don't (directly) fall into this group.

Instead of the temporary "fix" proposed here, it makes much more sense to align load_dataset with both tabular and DL workflows "in a consistent way", so I suggest we continue our discussion from #5189 to have this resolved by version 3.0.

@lhoestq
Copy link
Member Author

lhoestq commented Jan 25, 2023

closing this one for now

@lhoestq lhoestq closed this Jan 25, 2023
@albertvillanova albertvillanova deleted the add-datasetdict-topandas branch September 24, 2023 10:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants