Skip to content

Return a split Dataset in load_dataset #5301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from
Closed

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Nov 25, 2022

...instead of a DatasetDict.

# now supported
ds = load_dataset("squad")
ds[0]  
for example in ds:
    pass

# still works
ds["train"]
ds["validation"]

# new
ds.splits  # Dict[str, Dataset] | None

# soon to be supported (not in this PR)
ds = load_dataset("dataset_with_no_splits")
ds[0]
for example in ds:
    pass

I implemented Dataset.__getitem__ and IterableDataset.__getitem__ to be able to get a split from a dataset.
The splits are defined by the ds.info.splits dictionary.

Therefore a dataset is a table that optionally has some splits defined in the dataset info. And a split dataset is the concatenation of all its splits.

I made as little breaking changes as possible. Notable breaking changes:

  • load_dataset("potato").keys() / .items() / .values() / don't work anymore, since we don't return a dict
  • same for for split_name in load_dataset("potato"), since we now iterate on the examples
  • ..

TODO:

  • Update push_to_hub
  • Update save_to_disk/load_from_disk
  • check for other breaking changes
  • fix existing tests
  • add new tests
  • docs

This is related to #5189, to extend load_dataset to return datasets without splits

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@lhoestq
Copy link
Member Author

lhoestq commented Nov 28, 2022

Just noticed that now we have to deal with indexed & split datasets. The remaining tests are failing because one should be able to get an indexed dataset when accessing the split of a dataset made of indexed splits (right now the index is just trashed)

fn_kwargs=fn_kwargs,
num_proc=num_proc,
suffix_template=suffix_template,
new_fingerprint=new_fingerprint + f"-{split_name}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
new_fingerprint=new_fingerprint + f"-{split_name}",
new_fingerprint= f"{new_fingerprint}-{split_name}" if new_fingerprint else None,

otherwise would raise unsupported operand type(s) for +: 'NoneType' and 'str'

@lhoestq lhoestq closed this Feb 21, 2023
@albertvillanova albertvillanova deleted the merge-ds-and-dsdict branch September 24, 2023 10:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants