Return a split Dataset in load_dataset #5301

lhoestq · 2022-11-25T16:35:54Z

...instead of a DatasetDict.

# now supported
ds = load_dataset("squad")
ds[0]  
for example in ds:
    pass

# still works
ds["train"]
ds["validation"]

# new
ds.splits  # Dict[str, Dataset] | None

# soon to be supported (not in this PR)
ds = load_dataset("dataset_with_no_splits")
ds[0]
for example in ds:
    pass

I implemented Dataset.__getitem__ and IterableDataset.__getitem__ to be able to get a split from a dataset.
The splits are defined by the ds.info.splits dictionary.

Therefore a dataset is a table that optionally has some splits defined in the dataset info. And a split dataset is the concatenation of all its splits.

I made as little breaking changes as possible. Notable breaking changes:

load_dataset("potato").keys() / .items() / .values() / don't work anymore, since we don't return a dict
same for for split_name in load_dataset("potato"), since we now iterate on the examples
..

TODO:

This is related to #5189, to extend load_dataset to return datasets without splits

HuggingFaceDocBuilderDev · 2022-11-25T16:41:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

lhoestq · 2022-11-28T19:19:57Z

Just noticed that now we have to deal with indexed & split datasets. The remaining tests are failing because one should be able to get an indexed dataset when accessing the split of a dataset made of indexed splits (right now the index is just trashed)

polinaeterna · 2022-11-30T16:53:33Z

src/datasets/arrow_dataset.py

+                    fn_kwargs=fn_kwargs,
+                    num_proc=num_proc,
+                    suffix_template=suffix_template,
+                    new_fingerprint=new_fingerprint + f"-{split_name}",


Suggested change

new_fingerprint=new_fingerprint + f"-{split_name}",

new_fingerprint= f"{new_fingerprint}-{split_name}" if new_fingerprint else None,

otherwise would raise unsupported operand type(s) for +: 'NoneType' and 'str'

lhoestq added 5 commits November 25, 2022 17:20

Return a split Dataset in load_dataset

37fe9d1

update return type

7b1b03f

update some tests

3937c42

fix filter

9f52dba

removed unused import

9174b92

lhoestq and others added 15 commits November 28, 2022 16:20

Merge branch 'main' into merge-ds-and-dsdict

d8b0b31

revert removal of .split

952234c

make the reader return a split that is not split

cdbe443

fix builder.as_streaming_dataset

c9c7150

fix old DatasetDict

bbcdc09

don't resolve features in IterableDataset.from_splits

72c57a1

make sure SplitDict is JSON serializable

5cafbd2

first Dataset.from-dict test

5950ab4

update builder tests

acebe0b

update fs tests

0fc72df

update iterable tests

7e6c712

update load tests

e8ffd6d

update split tests

9f89ef4

update push_to_hub tests

fbb028e

update image/audio tests

764ecbd

lhoestq mentioned this pull request Nov 28, 2022

Reduce friction in tabular dataset workflow by eliminating having splits when dataset is loaded #5189

Open

polinaeterna reviewed Nov 30, 2022

View reviewed changes

mariosasko mentioned this pull request Dec 5, 2022

Add DatasetDict.to_pandas #5312

Closed

lhoestq closed this Feb 21, 2023

albertvillanova deleted the merge-ds-and-dsdict branch September 24, 2023 10:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Return a split Dataset in load_dataset #5301

Return a split Dataset in load_dataset #5301

Uh oh!

lhoestq commented Nov 25, 2022 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Nov 25, 2022

Uh oh!

lhoestq commented Nov 28, 2022 •

edited

Loading

Uh oh!

polinaeterna Nov 30, 2022

Uh oh!

Uh oh!

	new_fingerprint=new_fingerprint + f"-{split_name}",
	new_fingerprint= f"{new_fingerprint}-{split_name}" if new_fingerprint else None,

Return a split Dataset in load_dataset #5301

Return a split Dataset in load_dataset #5301

Uh oh!

Conversation

lhoestq commented Nov 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Nov 25, 2022

Uh oh!

lhoestq commented Nov 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

polinaeterna Nov 30, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lhoestq commented Nov 25, 2022 •

edited

Loading

lhoestq commented Nov 28, 2022 •

edited

Loading