-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Reduce friction in tabular dataset workflow by eliminating having splits when dataset is loaded #5189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I have to admit I'm not a fan of this idea, as this would result in a non-consistent behavior between tabular and non-tabular datasets, which is confusing if done without the context you provided. Instead, we could consider returning a |
We can brainstorm here to see how we could make it happen ? And then depending on the options we see if it's a change we can do. I'm starting with a first reasoning Currently not passing Now what would happen if a dataset has no split ? Ideally it should return one Dataset. And passing # case 1: dataset without split
ds = load_dataset("dataset_without_split")
ds[0], ds["column_name"], list(ds) # we want this
# case 2: dataset with splits
ds = load_dataset("dataset_with_splits")
ds["train"] # this works and can't be changed
ds = load_dataset("dataset_with_splits", split="train")
ds[0], ds["column_name"], list(ds) # this works and can't be changed I can see several ideas:
What are your opinions on those two ideas ? Do you have other ideas in mind ? |
I like the first idea more (concatenating splits doesn't seem useful, no?). This is a significant breaking change, so I think we should do a poll (or something similar) to gather more info on the actual "expected behavior" and wait for Datasets 3.0 if we decide to implement it. PS: @thomwolf also suggested the same thing a while ago (#743 (comment)). |
I think it's an interesting improvement to the user experience for a case that comes often (no split) so I would definitively support it. I would be more in favor of option 2 rather than returning various types of objects from load_dataset and handling carefully the possible collisions indeed |
Related: if a dataset only has one split, we don't show the splits select control in the dataset viewer on the Hub, eg. compare https://huggingface.co/datasets/hf-internal-testing/fixtures_image_utils/viewer/image/test with https://huggingface.co/datasets/glue/viewer/mnli/test. See https://github.com/huggingface/moon-landing/pull/3858 for more details (internal) |
I feel like the second idea is a bit more overkill. |
OK, sorry for polluting the thread. The relation I saw with the dataset viewer is that from a UX point of view, we hide the concepts of split and configuration whenever possible -> this issue feels like doing the same in the datasets library. |
I would agree that returning different types based on the content of the dataset might be confusing. We can do something similar to what Here we can have a similar arg such as |
Overkill in what sense ?
Right now one can already pass
I think it would be ok to handle the collision by allowing both |
Would it make sense to remove the notion of "split" in Would it make sense to force |
I think we need to keep it - though in practice people can name the splits whatever they want anyway.
We need to keep backward compatibility ideally - in particular the load_dataset + ds["train"] one |
It was my understanding that the whole issue was that
Yeah sorry I meant ideally. One can always start developing |
Yes indeed, but we still want to keep a way to load the train/val/test/whatever splits alone ;) |
@thomasw21's solution is good but it will break backwards compatibility. 😅 |
Started to experiment with merging Dataset and DatasetDict. My plan is to define the splits of a Dataset in Dataset.info.splits (already exists, but never used). A Dataset would then be the concatenation of its splits if they exist. Not sure yet this is the way to go. My plan is to play with it and see and share it with you, so we can see if it makes sense from a UX point of view. |
So just to make sure that I understand the current direction, people will have to be extra careful when handling splits right?
Previously the design would force you to choose a split (it would raise otherwise), or manually concat them if you really wanted to play with concatenated splits. Now it would potentially run without raising for a bit of time until you figure out that you've been training on both train and validation split. Would it make sense to use a dataset specific default instead of using the concatenation, typically "potato" dataset's default would be train?
|
To avoid a breaking change we need to be able to do In that case I'd wonder where the validation split comes from, since the rows of the dataset wouldn't contain the validation split according to your example. That's why I'm more in favor of concatenating. A dataset is one table, that optionally has some split info about subsets (e.g. for training an evaluation) This also allows anyone to re-split the dataset the way they want if they're not happy with the default: ds = load_dataset("potato").train_test_split(test_size=0.2)
train_ds = ds["train"]
test_ds = ds["test"] |
Just thinking about this, we could just have |
I have a first implementation of option 2 (merging Dataset and DatasetDict) in this PR: #5301 Feel free to play with it if you're interested, and let me know what you think. In this PR, a dataset is one table that optionally has some split info about subsets. |
@adrinjalali we already have to_pandas AFAIK that essentially does the same thing (for a dataset, not for a dataset dict), I was wondering if it makes sense to have this as I don't know portion of people who load non-tabular datasets into dataframes. @lhoestq I saw your PR and it will break a lot of things imo, WDYT of this option? |
yes correct :)
Do you have concrete examples you can share ?
The to_dataframe option ? I think it not enough, since you'd still get a Note that in the PR I opened you can do ds = load_dataset("dataset_with_just_one_csv") # Dataset type
df = load_dataset("dataset_with_just_one_csv").to_pandas() # DataFrame type |
@lhoestq no I think @adrinjalali and I meant when user calls |
So in that case it would be fine to still end up with a dataset dict with a "train" split ? |
yeah what I mean is this: dataset = load_dataset("blah")
# deal with a split of the dataset
train = dataset["train"]
train_df = dataset["train"].to_dataframe()
# deal with the whole dataset
dataset_df = dataset.to_dataframe() So we do two things to improve tabular experience:
|
Ok ! Note that we already have |
yeah that sounds perfect @lhoestq ! |
We can raise an error if someone does |
But then how is that different to have the distinction between DatasetDict and Dataset then? Is it just that "default behaviour when there are no splits or single split, it returns directly the split when there's no ambiguity". Also I was wondering how the concatenation could have heavy impacts when running mapping functions/filtering in batch? Typically can batch be somehow mixed? |
Because it doesn't make sense to be able to do
No, we run each function on each split separated |
Hum but you're still going to raise an exception in both those cases with your current change no? (actually list(ds) would return the name of the splits no?)
Nice! |
only if there are multiple splits - because you need to pick one
The goal is to be able to iterate on a dataset without having to specify "[train]" when it doesn't make sense. |
So what if a dataset has both |
It would raise and ask you to pick a split, and when you pick a split it returns the list of examples. Btw from the discussion in #5301 we may put the Dataset/DatasetDict merge on stand by since we found a simple solution for tabular datasets using |
Feature request
Sorry for cryptic name but I'd like to explain using code itself. When I want to load a specific dataset from a repository (for instance, this: https://huggingface.co/datasets/inria-soda/tabular-benchmark)
datasets
library is essentially designed for people who'd like to use benchmark datasets on various modalities to fine-tune their models, and these benchmark datasets usually have pre-defined train and test splits. However, for tabular workflows, having train and test splits usually ends up model overfitting to validation split so usually the users would like to do validation techniques likeStratifiedKFoldCrossValidation
or when they tune for hyperparameters they doGridSearchCrossValidation
so often the behavior is to create their own splits. Even in this paper a benchmark is introduced but the split is done by authors.It's a bit confusing for average tabular user to try and load a dataset and see
"train"
so it would be nice if we would not load dataset into a split calledtrain
by default.Motivation
I explained it above 😅
Your contribution
I think this is quite a big change that seems small (e.g. how to determine datasets that will not be load to train split?), it's best if we discuss first!
The text was updated successfully, but these errors were encountered: