Skip to content

Reduce friction in tabular dataset workflow by eliminating having splits when dataset is loaded #5189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
merveenoyan opened this issue Nov 2, 2022 · 33 comments
Assignees
Labels
enhancement New feature or request

Comments

@merveenoyan
Copy link
Contributor

Feature request

Sorry for cryptic name but I'd like to explain using code itself. When I want to load a specific dataset from a repository (for instance, this: https://huggingface.co/datasets/inria-soda/tabular-benchmark)

from datasets import load_dataset
dataset = load_dataset("inria-soda/tabular-benchmark", data_files=["reg_cat/house_sales.csv"], streaming=True)
print(next(iter(dataset["train"])))

datasets library is essentially designed for people who'd like to use benchmark datasets on various modalities to fine-tune their models, and these benchmark datasets usually have pre-defined train and test splits. However, for tabular workflows, having train and test splits usually ends up model overfitting to validation split so usually the users would like to do validation techniques like StratifiedKFoldCrossValidation or when they tune for hyperparameters they do GridSearchCrossValidation so often the behavior is to create their own splits. Even in this paper a benchmark is introduced but the split is done by authors.
It's a bit confusing for average tabular user to try and load a dataset and see "train" so it would be nice if we would not load dataset into a split called train by default.

from datasets import load_dataset
dataset = load_dataset("inria-soda/tabular-benchmark", data_files=["reg_cat/house_sales.csv"], streaming=True)
-print(next(iter(dataset["train"])))
+print(next(iter(dataset)))

Motivation

I explained it above 😅

Your contribution

I think this is quite a big change that seems small (e.g. how to determine datasets that will not be load to train split?), it's best if we discuss first!

@merveenoyan merveenoyan added the enhancement New feature or request label Nov 2, 2022
@mariosasko
Copy link
Collaborator

I have to admit I'm not a fan of this idea, as this would result in a non-consistent behavior between tabular and non-tabular datasets, which is confusing if done without the context you provided. Instead, we could consider returning a Dataset object rather than DatasetDict if there is only one split in the generated dataset. But then again, I think this lib is a bit too old to make such changes. @lhoestq @albertvillanova WDYT?

@lhoestq
Copy link
Member

lhoestq commented Nov 4, 2022

We can brainstorm here to see how we could make it happen ? And then depending on the options we see if it's a change we can do.

I'm starting with a first reasoning

Currently not passing split= in load_dataset means "return a dict with each split".

Now what would happen if a dataset has no split ? Ideally it should return one Dataset. And passing split= would have no sense. So depending on the dataset content, not passing split= should return a dict or a Dataset. In particular, those two cases should work:

# case 1: dataset without split
ds = load_dataset("dataset_without_split")
ds[0], ds["column_name"], list(ds) # we want this

# case 2: dataset with splits
ds = load_dataset("dataset_with_splits")
ds["train"] # this works and can't be changed
ds = load_dataset("dataset_with_splits", split="train")
ds[0], ds["column_name"], list(ds) # this works and can't be changed

I can see several ideas:

  1. allowing load_dataset to return a different object based on the dataset content - either a Dataset or a DatasetDict
    • we can update get_dataset_split_names to return None or a list if users want to know in advance what object will be returned. They can also use isinstance a posteriori
    • but in this case we expect users to be careful when loading datasets and always to extra steps to check if they got a Dataset or DatasetDict
  2. merge Dataset and DatasetDict objects
    • they already share many functions: map, filter, push_to_hub etc.
    • we can define ds[0] to be the first item of the first split, and consider that the uses accesses rows from the full table of all the splits concatenated
    • however there is a collision when doing ds["column_name"] or ds["train"] that we need to address: the first returns a list, while the other returns a Dataset.

What are your opinions on those two ideas ? Do you have other ideas in mind ?

@mariosasko
Copy link
Collaborator

I like the first idea more (concatenating splits doesn't seem useful, no?). This is a significant breaking change, so I think we should do a poll (or something similar) to gather more info on the actual "expected behavior" and wait for Datasets 3.0 if we decide to implement it.

PS: @thomwolf also suggested the same thing a while ago (#743 (comment)).

@thomwolf
Copy link
Member

thomwolf commented Nov 4, 2022

I think it's an interesting improvement to the user experience for a case that comes often (no split) so I would definitively support it.

I would be more in favor of option 2 rather than returning various types of objects from load_dataset and handling carefully the possible collisions indeed

@severo
Copy link
Collaborator

severo commented Nov 4, 2022

Related: if a dataset only has one split, we don't show the splits select control in the dataset viewer on the Hub, eg. compare https://huggingface.co/datasets/hf-internal-testing/fixtures_image_utils/viewer/image/test with https://huggingface.co/datasets/glue/viewer/mnli/test.

See https://github.com/huggingface/moon-landing/pull/3858 for more details (internal)

@merveenoyan
Copy link
Contributor Author

I feel like the second idea is a bit more overkill.
@severo I would say it's a bit irrelevant to the problem we have but is a separate problem @polinaeterna is solving at the moment. 😅 (also discussed on slack)

@severo
Copy link
Collaborator

severo commented Nov 7, 2022

OK, sorry for polluting the thread. The relation I saw with the dataset viewer is that from a UX point of view, we hide the concepts of split and configuration whenever possible -> this issue feels like doing the same in the datasets library.

@adrinjalali
Copy link

I would agree that returning different types based on the content of the dataset might be confusing.

We can do something similar to what fetch_* or load_* from sklearn.datasets do, which is to have an arg which changes the type of the returned type. For instance, load_iris would return a dict, but load_iris(..., return_X_y=True) would return a tuple.

Here we can have a similar arg such as return_X which would then only return a single DataSet or an array.

@lhoestq
Copy link
Member

lhoestq commented Nov 7, 2022

I feel like the second idea is a bit more overkill.

Overkill in what sense ?

Here we can have a similar arg such as return_X which would then only return a single DataSet or an array.

Right now one can already pass split="all" to get one Dataset object with all the data in it (unsplit). We could also have something like return_all=True so make the API clearer.

I would be more in favor of option 2 rather than returning various types of objects from load_dataset and handling carefully the possible collisions indeed

I think it would be ok to handle the collision by allowing both ds["train"] and ds["column_name"] (and maybe adding something like ds.splits for those who want to iterate over the splits or add new ones)

@thomasw21
Copy link
Contributor

Would it make sense to remove the notion of "split" in load_dataset? I feel a lof of it comes from the want to have some sort of group of more or less similar dataset. "train"/"test"/"validation" are the traditional ones, but there are some datasets that have much more splits.

Would it make sense to force load_dataset to only load a single Dataset object, and fail if it doesn't point to one. And have another method that's like load_dataset_group_info that can return a very arbitrary info class (Dict, List whatever), but you need to pass individual infos to load_dataset to run anything? Typically I don't think DatasetDict.map is really that helpful, but that's my personal opinion. This would help make things more readable (typically knowing if an object is a Dataset or a DatasetDict)

@lhoestq
Copy link
Member

lhoestq commented Nov 8, 2022

Would it make sense to remove the notion of "split" in load_dataset?

I think we need to keep it - though in practice people can name the splits whatever they want anyway.

Would it make sense to force load_dataset to only load a single Dataset object, and fail if it doesn't point to one.

We need to keep backward compatibility ideally - in particular the load_dataset + ds["train"] one

@thomasw21
Copy link
Contributor

I think we need to keep it - though in practice people can name the splits whatever they want anyway.

It was my understanding that the whole issue was that load_dataset returned multiple types of objects.

We need to keep backward compatibility ideally - in particular the load_dataset + ds["train"] one

Yeah sorry I meant ideally. One can always start developing load_dataset_v2 can deprecate the first one and remove it in the longer term.

@lhoestq
Copy link
Member

lhoestq commented Nov 8, 2022

It was my understanding that the whole issue was that load_dataset returned multiple types of objects.

Yes indeed, but we still want to keep a way to load the train/val/test/whatever splits alone ;)

@merveenoyan
Copy link
Contributor Author

@thomasw21's solution is good but it will break backwards compatibility. 😅

@lhoestq
Copy link
Member

lhoestq commented Nov 23, 2022

Started to experiment with merging Dataset and DatasetDict. My plan is to define the splits of a Dataset in Dataset.info.splits (already exists, but never used). A Dataset would then be the concatenation of its splits if they exist.

Not sure yet this is the way to go. My plan is to play with it and see and share it with you, so we can see if it makes sense from a UX point of view.

@thomasw21
Copy link
Contributor

thomasw21 commented Nov 25, 2022

So just to make sure that I understand the current direction, people will have to be extra careful when handling splits right?
Imagine "potato" a dataset containing train/validation split:

load_dataset("potato") # returns the concatenation of all the splits

Previously the design would force you to choose a split (it would raise otherwise), or manually concat them if you really wanted to play with concatenated splits. Now it would potentially run without raising for a bit of time until you figure out that you've been training on both train and validation split.

Would it make sense to use a dataset specific default instead of using the concatenation, typically "potato" dataset's default would be train?

load_dataset("potato") # returns "train" split
load_dataset("potato", split="train") # returns "train" split
load_dataset("potato", split="validation") # returns "validation" split
concatenate_datasets([load_dataset("potato", split="train"), load_dataset("potato", split="validation")]) # returns concatenation

@lhoestq
Copy link
Member

lhoestq commented Nov 25, 2022

load_dataset("potato") # returns "train" split

To avoid a breaking change we need to be able to do load_dataset("potato")["validation"] as well.

In that case I'd wonder where the validation split comes from, since the rows of the dataset wouldn't contain the validation split according to your example. That's why I'm more in favor of concatenating.

A dataset is one table, that optionally has some split info about subsets (e.g. for training an evaluation)

This also allows anyone to re-split the dataset the way they want if they're not happy with the default:

ds = load_dataset("potato").train_test_split(test_size=0.2)
train_ds = ds["train"]
test_ds = ds["test"]

@adrinjalali
Copy link

Just thinking about this, we could just have to_dataframe() as load_dataset("blah").to_dataframe() to get the whole dataset, and not change anything else.

@lhoestq
Copy link
Member

lhoestq commented Nov 28, 2022

I have a first implementation of option 2 (merging Dataset and DatasetDict) in this PR: #5301

Feel free to play with it if you're interested, and let me know what you think. In this PR, a dataset is one table that optionally has some split info about subsets.

@merveenoyan
Copy link
Contributor Author

merveenoyan commented Nov 29, 2022

@adrinjalali we already have to_pandas AFAIK that essentially does the same thing (for a dataset, not for a dataset dict), I was wondering if it makes sense to have this as I don't know portion of people who load non-tabular datasets into dataframes. @lhoestq I saw your PR and it will break a lot of things imo, WDYT of this option?

@lhoestq
Copy link
Member

lhoestq commented Nov 29, 2022

we already have to_pandas AFAIK that essentially does the same thing (for a dataset, not for a dataset dict)

yes correct :)

I saw your PR and it will break a lot of things imo

Do you have concrete examples you can share ?

WDYT of this option?

The to_dataframe option ? I think it not enough, since you'd still get a DatasetDict({"train": Dataset()}) if you load a dataset with no splits (e.g. one CSV), and this doesn't really make sense.

Note that in the PR I opened you can do

ds = load_dataset("dataset_with_just_one_csv")  # Dataset type
df = load_dataset("dataset_with_just_one_csv").to_pandas()  # DataFrame type

@merveenoyan
Copy link
Contributor Author

merveenoyan commented Nov 29, 2022

@lhoestq no I think @adrinjalali and I meant when user calls to_dataframe if there's only train split in DatasetDict we could directly load that into dataframe. This might cause a confusion given there's to_pandas but I think it's more intuitive and least breaking change. (given people -who use datasets for tabular workflows- will eventually call to_pandas anyway)

@lhoestq
Copy link
Member

lhoestq commented Nov 29, 2022

So in that case it would be fine to still end up with a dataset dict with a "train" split ?

@adrinjalali
Copy link

yeah what I mean is this:

dataset = load_dataset("blah")

# deal with a split of the dataset
train = dataset["train"]
train_df = dataset["train"].to_dataframe()

# deal with the whole dataset
dataset_df = dataset.to_dataframe()

So we do two things to improve tabular experience:

  • allow datasets to have a single split
  • add to_dataframe to the root dict level so that users can simply call df = load_dataset("blah").to_dataframe() and have it in their pandas.DataFrame object.

@lhoestq
Copy link
Member

lhoestq commented Nov 29, 2022

Ok ! Note that we already have Dataset.to_pandas() so for consistency I'd call it DatasetDict.to_pandas() as well, does it sound good to you ? This is something we can add pretty easily

@adrinjalali
Copy link

yeah that sounds perfect @lhoestq !

@lhoestq
Copy link
Member

lhoestq commented Dec 1, 2022

So just to make sure that I understand the current direction, people will have to be extra careful when handling splits right?

We can raise an error if someone does load_dataset(...)[0] if the dataset is made of several splits, and return the first example if there's one or zero splits (i.e. when it's not ambiguous). Had this idea from the dicussions in #5312 WDYT @thomasw21 ?

@thomasw21
Copy link
Contributor

thomasw21 commented Dec 2, 2022

We can raise an error if someone does load_dataset(...)[0] if the dataset is made of several splits,

But then how is that different to have the distinction between DatasetDict and Dataset then? Is it just that "default behaviour when there are no splits or single split, it returns directly the split when there's no ambiguity".

Also I was wondering how the concatenation could have heavy impacts when running mapping functions/filtering in batch? Typically can batch be somehow mixed?

@lhoestq
Copy link
Member

lhoestq commented Dec 2, 2022

But then how is that different to have the distinction between DatasetDict and Dataset then?

Because it doesn't make sense to be able to do example = ds[0] or examples = list(ds) on a class named DatasetDict of type Dict[str, Dataset].

Also I was wondering how the concatenation could have heavy impacts when running mapping functions/filtering in batch? Typically can batch be somehow mixed?

No, we run each function on each split separated

@thomasw21
Copy link
Contributor

thomasw21 commented Dec 2, 2022

Because it doesn't make sense to be able to do example = ds[0] or examples = list(ds) on a class named DatasetDict of type Dict[str, Dataset].

Hum but you're still going to raise an exception in both those cases with your current change no? (actually list(ds) would return the name of the splits no?)

No, we run each function on each split separated

Nice!

@lhoestq
Copy link
Member

lhoestq commented Dec 2, 2022

Hum but you're still going to raise an exception in both those cases with your current change no?

only if there are multiple splits - because you need to pick one

(actually list(ds) would return the name of the splits no?)

The goal is to be able to iterate on a dataset without having to specify "[train]" when it doesn't make sense.
So list(ds) would return the list of examples.

@thomasw21
Copy link
Contributor

So what if a dataset has both train and validation, what does list(ds) return? Does it raise? Does it give you the list of splits ... if it's the latter it just really looks like a dict ...

@lhoestq
Copy link
Member

lhoestq commented Dec 6, 2022

So what if a dataset has both train and validation, what does list(ds) return? Does it raise? Does it give you the list of splits ... if it's the latter it just really looks like a dict ...

It would raise and ask you to pick a split, and when you pick a split it returns the list of examples.
And if there's only one split it would iterate on the examples


Btw from the discussion in #5301 we may put the Dataset/DatasetDict merge on stand by since we found a simple solution for tabular datasets using to_pandas (PR here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants