Reduce friction in tabular dataset workflow by eliminating having splits when dataset is loaded #5189

merveenoyan · 2022-11-02T09:15:02Z

Feature request

Sorry for cryptic name but I'd like to explain using code itself. When I want to load a specific dataset from a repository (for instance, this: https://huggingface.co/datasets/inria-soda/tabular-benchmark)

from datasets import load_dataset
dataset = load_dataset("inria-soda/tabular-benchmark", data_files=["reg_cat/house_sales.csv"], streaming=True)
print(next(iter(dataset["train"])))

datasets library is essentially designed for people who'd like to use benchmark datasets on various modalities to fine-tune their models, and these benchmark datasets usually have pre-defined train and test splits. However, for tabular workflows, having train and test splits usually ends up model overfitting to validation split so usually the users would like to do validation techniques like StratifiedKFoldCrossValidation or when they tune for hyperparameters they do GridSearchCrossValidation so often the behavior is to create their own splits. Even in this paper a benchmark is introduced but the split is done by authors.
It's a bit confusing for average tabular user to try and load a dataset and see "train" so it would be nice if we would not load dataset into a split called train by default.

from datasets import load_dataset
dataset = load_dataset("inria-soda/tabular-benchmark", data_files=["reg_cat/house_sales.csv"], streaming=True)
-print(next(iter(dataset["train"])))
+print(next(iter(dataset)))

Motivation

I explained it above 😅

Your contribution

I think this is quite a big change that seems small (e.g. how to determine datasets that will not be load to train split?), it's best if we discuss first!

The text was updated successfully, but these errors were encountered:

mariosasko · 2022-11-03T16:53:05Z

I have to admit I'm not a fan of this idea, as this would result in a non-consistent behavior between tabular and non-tabular datasets, which is confusing if done without the context you provided. Instead, we could consider returning a Dataset object rather than DatasetDict if there is only one split in the generated dataset. But then again, I think this lib is a bit too old to make such changes. @lhoestq @albertvillanova WDYT?

lhoestq · 2022-11-04T14:21:32Z

We can brainstorm here to see how we could make it happen ? And then depending on the options we see if it's a change we can do.

I'm starting with a first reasoning

Currently not passing split= in load_dataset means "return a dict with each split".

Now what would happen if a dataset has no split ? Ideally it should return one Dataset. And passing split= would have no sense. So depending on the dataset content, not passing split= should return a dict or a Dataset. In particular, those two cases should work:

# case 1: dataset without split
ds = load_dataset("dataset_without_split")
ds[0], ds["column_name"], list(ds) # we want this

# case 2: dataset with splits
ds = load_dataset("dataset_with_splits")
ds["train"] # this works and can't be changed
ds = load_dataset("dataset_with_splits", split="train")
ds[0], ds["column_name"], list(ds) # this works and can't be changed

I can see several ideas:

allowing load_dataset to return a different object based on the dataset content - either a Dataset or a DatasetDict
- we can update get_dataset_split_names to return None or a list if users want to know in advance what object will be returned. They can also use isinstance a posteriori
- but in this case we expect users to be careful when loading datasets and always to extra steps to check if they got a Dataset or DatasetDict
merge Dataset and DatasetDict objects
- they already share many functions: map, filter, push_to_hub etc.
- we can define ds[0] to be the first item of the first split, and consider that the uses accesses rows from the full table of all the splits concatenated
- however there is a collision when doing ds["column_name"] or ds["train"] that we need to address: the first returns a list, while the other returns a Dataset.

What are your opinions on those two ideas ? Do you have other ideas in mind ?

mariosasko · 2022-11-04T14:40:39Z

I like the first idea more (concatenating splits doesn't seem useful, no?). This is a significant breaking change, so I think we should do a poll (or something similar) to gather more info on the actual "expected behavior" and wait for Datasets 3.0 if we decide to implement it.

PS: @thomwolf also suggested the same thing a while ago (#743 (comment)).

thomwolf · 2022-11-04T14:47:31Z

I think it's an interesting improvement to the user experience for a case that comes often (no split) so I would definitively support it.

I would be more in favor of option 2 rather than returning various types of objects from load_dataset and handling carefully the possible collisions indeed

severo · 2022-11-04T15:51:54Z

Related: if a dataset only has one split, we don't show the splits select control in the dataset viewer on the Hub, eg. compare https://huggingface.co/datasets/hf-internal-testing/fixtures_image_utils/viewer/image/test with https://huggingface.co/datasets/glue/viewer/mnli/test.

See https://github.com/huggingface/moon-landing/pull/3858 for more details (internal)

merveenoyan · 2022-11-07T10:36:32Z

I feel like the second idea is a bit more overkill.
@severo I would say it's a bit irrelevant to the problem we have but is a separate problem @polinaeterna is solving at the moment. 😅 (also discussed on slack)

severo · 2022-11-07T10:40:40Z

OK, sorry for polluting the thread. The relation I saw with the dataset viewer is that from a UX point of view, we hide the concepts of split and configuration whenever possible -> this issue feels like doing the same in the datasets library.

adrinjalali · 2022-11-07T11:07:22Z

I would agree that returning different types based on the content of the dataset might be confusing.

We can do something similar to what fetch_* or load_* from sklearn.datasets do, which is to have an arg which changes the type of the returned type. For instance, load_iris would return a dict, but load_iris(..., return_X_y=True) would return a tuple.

Here we can have a similar arg such as return_X which would then only return a single DataSet or an array.

lhoestq · 2022-11-07T14:08:27Z

I feel like the second idea is a bit more overkill.

Overkill in what sense ?

Here we can have a similar arg such as return_X which would then only return a single DataSet or an array.

Right now one can already pass split="all" to get one Dataset object with all the data in it (unsplit). We could also have something like return_all=True so make the API clearer.

I would be more in favor of option 2 rather than returning various types of objects from load_dataset and handling carefully the possible collisions indeed

I think it would be ok to handle the collision by allowing both ds["train"] and ds["column_name"] (and maybe adding something like ds.splits for those who want to iterate over the splits or add new ones)

thomasw21 · 2022-11-08T12:43:41Z

Would it make sense to remove the notion of "split" in load_dataset? I feel a lof of it comes from the want to have some sort of group of more or less similar dataset. "train"/"test"/"validation" are the traditional ones, but there are some datasets that have much more splits.

Would it make sense to force load_dataset to only load a single Dataset object, and fail if it doesn't point to one. And have another method that's like load_dataset_group_info that can return a very arbitrary info class (Dict, List whatever), but you need to pass individual infos to load_dataset to run anything? Typically I don't think DatasetDict.map is really that helpful, but that's my personal opinion. This would help make things more readable (typically knowing if an object is a Dataset or a DatasetDict)

lhoestq · 2022-11-08T13:44:07Z

Would it make sense to remove the notion of "split" in load_dataset?

I think we need to keep it - though in practice people can name the splits whatever they want anyway.

Would it make sense to force load_dataset to only load a single Dataset object, and fail if it doesn't point to one.

We need to keep backward compatibility ideally - in particular the load_dataset + ds["train"] one

thomasw21 · 2022-11-08T13:54:03Z

I think we need to keep it - though in practice people can name the splits whatever they want anyway.

It was my understanding that the whole issue was that load_dataset returned multiple types of objects.

We need to keep backward compatibility ideally - in particular the load_dataset + ds["train"] one

Yeah sorry I meant ideally. One can always start developing load_dataset_v2 can deprecate the first one and remove it in the longer term.

lhoestq · 2022-11-08T13:55:52Z

It was my understanding that the whole issue was that load_dataset returned multiple types of objects.

Yes indeed, but we still want to keep a way to load the train/val/test/whatever splits alone ;)

merveenoyan · 2022-11-08T18:52:29Z

@thomasw21's solution is good but it will break backwards compatibility. 😅

lhoestq · 2022-11-23T16:16:13Z

Started to experiment with merging Dataset and DatasetDict. My plan is to define the splits of a Dataset in Dataset.info.splits (already exists, but never used). A Dataset would then be the concatenation of its splits if they exist.

Not sure yet this is the way to go. My plan is to play with it and see and share it with you, so we can see if it makes sense from a UX point of view.

thomasw21 · 2022-11-25T11:19:34Z

So just to make sure that I understand the current direction, people will have to be extra careful when handling splits right?
Imagine "potato" a dataset containing train/validation split:

load_dataset("potato") # returns the concatenation of all the splits

Previously the design would force you to choose a split (it would raise otherwise), or manually concat them if you really wanted to play with concatenated splits. Now it would potentially run without raising for a bit of time until you figure out that you've been training on both train and validation split.

Would it make sense to use a dataset specific default instead of using the concatenation, typically "potato" dataset's default would be train?

load_dataset("potato") # returns "train" split
load_dataset("potato", split="train") # returns "train" split
load_dataset("potato", split="validation") # returns "validation" split
concatenate_datasets([load_dataset("potato", split="train"), load_dataset("potato", split="validation")]) # returns concatenation

lhoestq · 2022-11-25T13:55:08Z

load_dataset("potato") # returns "train" split

To avoid a breaking change we need to be able to do load_dataset("potato")["validation"] as well.

In that case I'd wonder where the validation split comes from, since the rows of the dataset wouldn't contain the validation split according to your example. That's why I'm more in favor of concatenating.

A dataset is one table, that optionally has some split info about subsets (e.g. for training an evaluation)

This also allows anyone to re-split the dataset the way they want if they're not happy with the default:

ds = load_dataset("potato").train_test_split(test_size=0.2)
train_ds = ds["train"]
test_ds = ds["test"]

adrinjalali · 2022-11-25T17:17:39Z

Just thinking about this, we could just have to_dataframe() as load_dataset("blah").to_dataframe() to get the whole dataset, and not change anything else.

lhoestq · 2022-11-28T19:17:18Z

I have a first implementation of option 2 (merging Dataset and DatasetDict) in this PR: #5301

Feel free to play with it if you're interested, and let me know what you think. In this PR, a dataset is one table that optionally has some split info about subsets.

merveenoyan · 2022-11-29T11:02:16Z

@adrinjalali we already have to_pandas AFAIK that essentially does the same thing (for a dataset, not for a dataset dict), I was wondering if it makes sense to have this as I don't know portion of people who load non-tabular datasets into dataframes. @lhoestq I saw your PR and it will break a lot of things imo, WDYT of this option?

lhoestq · 2022-11-29T11:25:26Z

we already have to_pandas AFAIK that essentially does the same thing (for a dataset, not for a dataset dict)

yes correct :)

I saw your PR and it will break a lot of things imo

Do you have concrete examples you can share ?

WDYT of this option?

The to_dataframe option ? I think it not enough, since you'd still get a DatasetDict({"train": Dataset()}) if you load a dataset with no splits (e.g. one CSV), and this doesn't really make sense.

Note that in the PR I opened you can do

ds = load_dataset("dataset_with_just_one_csv")  # Dataset type
df = load_dataset("dataset_with_just_one_csv").to_pandas()  # DataFrame type

merveenoyan · 2022-11-29T12:04:19Z

@lhoestq no I think @adrinjalali and I meant when user calls to_dataframe if there's only train split in DatasetDict we could directly load that into dataframe. This might cause a confusion given there's to_pandas but I think it's more intuitive and least breaking change. (given people -who use datasets for tabular workflows- will eventually call to_pandas anyway)

lhoestq · 2022-11-29T13:57:36Z

So in that case it would be fine to still end up with a dataset dict with a "train" split ?

adrinjalali · 2022-11-29T14:32:15Z

yeah what I mean is this:

dataset = load_dataset("blah")

# deal with a split of the dataset
train = dataset["train"]
train_df = dataset["train"].to_dataframe()

# deal with the whole dataset
dataset_df = dataset.to_dataframe()

So we do two things to improve tabular experience:

allow datasets to have a single split
add to_dataframe to the root dict level so that users can simply call df = load_dataset("blah").to_dataframe() and have it in their pandas.DataFrame object.

lhoestq · 2022-11-29T14:46:11Z

Ok ! Note that we already have Dataset.to_pandas() so for consistency I'd call it DatasetDict.to_pandas() as well, does it sound good to you ? This is something we can add pretty easily

adrinjalali · 2022-11-29T14:48:30Z

yeah that sounds perfect @lhoestq !

lhoestq · 2022-12-01T16:18:49Z

So just to make sure that I understand the current direction, people will have to be extra careful when handling splits right?

We can raise an error if someone does load_dataset(...)[0] if the dataset is made of several splits, and return the first example if there's one or zero splits (i.e. when it's not ambiguous). Had this idea from the dicussions in #5312 WDYT @thomasw21 ?

thomasw21 · 2022-12-02T13:50:12Z

We can raise an error if someone does load_dataset(...)[0] if the dataset is made of several splits,

But then how is that different to have the distinction between DatasetDict and Dataset then? Is it just that "default behaviour when there are no splits or single split, it returns directly the split when there's no ambiguity".

Also I was wondering how the concatenation could have heavy impacts when running mapping functions/filtering in batch? Typically can batch be somehow mixed?

lhoestq · 2022-12-02T14:05:41Z

But then how is that different to have the distinction between DatasetDict and Dataset then?

Because it doesn't make sense to be able to do example = ds[0] or examples = list(ds) on a class named DatasetDict of type Dict[str, Dataset].

Also I was wondering how the concatenation could have heavy impacts when running mapping functions/filtering in batch? Typically can batch be somehow mixed?

No, we run each function on each split separated

thomasw21 · 2022-12-02T14:09:22Z

Because it doesn't make sense to be able to do example = ds[0] or examples = list(ds) on a class named DatasetDict of type Dict[str, Dataset].

Hum but you're still going to raise an exception in both those cases with your current change no? (actually list(ds) would return the name of the splits no?)

No, we run each function on each split separated

Nice!

lhoestq · 2022-12-02T16:01:27Z

Hum but you're still going to raise an exception in both those cases with your current change no?

only if there are multiple splits - because you need to pick one

(actually list(ds) would return the name of the splits no?)

The goal is to be able to iterate on a dataset without having to specify "[train]" when it doesn't make sense.
So list(ds) would return the list of examples.

thomasw21 · 2022-12-05T15:21:35Z

So what if a dataset has both train and validation, what does list(ds) return? Does it raise? Does it give you the list of splits ... if it's the latter it just really looks like a dict ...

lhoestq · 2022-12-06T12:12:57Z

So what if a dataset has both train and validation, what does list(ds) return? Does it raise? Does it give you the list of splits ... if it's the latter it just really looks like a dict ...

It would raise and ask you to pick a split, and when you pick a split it returns the list of examples.
And if there's only one split it would iterate on the examples

Btw from the discussion in #5301 we may put the Dataset/DatasetDict merge on stand by since we found a simple solution for tabular datasets using to_pandas (PR here)

merveenoyan added the enhancement New feature or request label Nov 2, 2022

lhoestq mentioned this issue Nov 22, 2022

DatasetDict containing Datasets with different features when pushed to hub gets remapped features #4211

Closed

lhoestq self-assigned this Nov 23, 2022

lhoestq mentioned this issue Nov 28, 2022

Return a split Dataset in load_dataset #5301

Closed

6 tasks

lhoestq mentioned this issue Nov 29, 2022

Add DatasetDict.to_pandas #5312

Closed

mariosasko mentioned this issue Oct 26, 2023

Different objects are returned from calls that should be returning the same kind of object. #6350

Open

Reduce friction in tabular dataset workflow by eliminating having splits when dataset is loaded #5189

Reduce friction in tabular dataset workflow by eliminating having splits when dataset is loaded #5189

Comments

merveenoyan commented Nov 2, 2022

Feature request

Motivation

Your contribution

mariosasko commented Nov 3, 2022

Uh oh!

lhoestq commented Nov 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mariosasko commented Nov 4, 2022

Uh oh!

thomwolf commented Nov 4, 2022

Uh oh!

severo commented Nov 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

merveenoyan commented Nov 7, 2022

Uh oh!

severo commented Nov 7, 2022

Uh oh!

adrinjalali commented Nov 7, 2022

Uh oh!

lhoestq commented Nov 7, 2022

Uh oh!

thomasw21 commented Nov 8, 2022

Uh oh!

lhoestq commented Nov 8, 2022

Uh oh!

thomasw21 commented Nov 8, 2022

Uh oh!

lhoestq commented Nov 8, 2022

Uh oh!

merveenoyan commented Nov 8, 2022

Uh oh!

lhoestq commented Nov 23, 2022

Uh oh!

thomasw21 commented Nov 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Nov 25, 2022

Uh oh!

adrinjalali commented Nov 25, 2022

Uh oh!

lhoestq commented Nov 28, 2022

Uh oh!

merveenoyan commented Nov 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Nov 29, 2022

Uh oh!

merveenoyan commented Nov 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Nov 29, 2022

Uh oh!

adrinjalali commented Nov 29, 2022

Uh oh!

lhoestq commented Nov 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali commented Nov 29, 2022

Uh oh!

lhoestq commented Dec 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasw21 commented Dec 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Dec 2, 2022

Uh oh!

thomasw21 commented Dec 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Dec 2, 2022

Uh oh!

thomasw21 commented Dec 5, 2022

Uh oh!

lhoestq commented Nov 4, 2022 •

edited

Loading

severo commented Nov 4, 2022 •

edited

Loading

thomasw21 commented Nov 25, 2022 •

edited

Loading

merveenoyan commented Nov 29, 2022 •

edited

Loading

merveenoyan commented Nov 29, 2022 •

edited

Loading

lhoestq commented Nov 29, 2022 •

edited

Loading

lhoestq commented Dec 1, 2022 •

edited

Loading

thomasw21 commented Dec 2, 2022 •

edited

Loading

thomasw21 commented Dec 2, 2022 •

edited

Loading

lhoestq commented Dec 6, 2022 •

edited

Loading