-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Return a split Dataset in load_dataset #5301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
Just noticed that now we have to deal with indexed & split datasets. The remaining tests are failing because one should be able to get an indexed dataset when accessing the split of a dataset made of indexed splits (right now the index is just trashed) |
fn_kwargs=fn_kwargs, | ||
num_proc=num_proc, | ||
suffix_template=suffix_template, | ||
new_fingerprint=new_fingerprint + f"-{split_name}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new_fingerprint=new_fingerprint + f"-{split_name}", | |
new_fingerprint= f"{new_fingerprint}-{split_name}" if new_fingerprint else None, |
otherwise would raise unsupported operand type(s) for +: 'NoneType' and 'str'
...instead of a DatasetDict.
I implemented
Dataset.__getitem__
andIterableDataset.__getitem__
to be able to get a split from a dataset.The splits are defined by the
ds.info.splits
dictionary.Therefore a dataset is a table that optionally has some splits defined in the dataset info. And a split dataset is the concatenation of all its splits.
I made as little breaking changes as possible. Notable breaking changes:
load_dataset("potato").keys() / .items() / .values() /
don't work anymore, since we don't return a dictfor split_name in load_dataset("potato")
, since we now iterate on the examplesTODO:
This is related to #5189, to extend
load_dataset
to return datasets without splits