Skip to content

Language tags #771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 8, 2022
Merged

Language tags #771

merged 9 commits into from
Jul 8, 2022

Conversation

stephenbach
Copy link
Member

This PR adds a language tagging feature, so that users can annotate prompts with the language(s) used in the prompt.

Even though this is motivated by the eval hackathon, this PR targets main because it affects all prompts. All existing prompts in main are tagged with English. After merging into main, another PR should merge main into eval hackathon. This will require somewhat careful coordination because new prompts on that branch will need to have their metadata updated in the .yaml before they will work in the UI.

Regarding the tags themselves, the eval group requested using the subtags in this list. I took the liberty of changing how the tags are displayed in the UI, appending the English names in parens, but this is changeable.

Screen Shot 2022-05-12 at 1 48 37 PM

@stephenbach stephenbach requested review from awebson and VictorSanh May 12, 2022 17:49
Copy link
Contributor

@awebson awebson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! Thanks so much Steve!

I think we agreed to emphasize somewhere either in the UI or in the contribution guide that the language tag should be about the languages of each prompt, not the languages of the dataset examples (which should be already documented by the datasets themselves)?

Copy link
Member

@VictorSanh VictorSanh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you for adding this @stephenbach!

Regarding the language codes, the HF ecosystem (including datasets) is using ISO 639 for language codes (see https://huggingface.co/languages). Could we use the same here?
It will make running analysis on languages of prompts and datasets actually possible (or at least it will be smoother).

@stephenbach stephenbach merged commit 0cc4b0c into main Jul 8, 2022
@stephenbach stephenbach deleted the language_tags branch July 8, 2022 21:29
stephenbach added a commit that referenced this pull request Jul 12, 2022
* Accelerate `get_infos` by caching the `DataseInfoDict`s (#778)

* accelerate `get_infos` by caching the `DataseInfoDict`s

* quality

* consistency

* fix `filter_english_datasets` since `languages` became `language` in dataset metadatas

* fix empty documents - multi_news (#793)

* fix empty documents - multi_news

* fix test - unrecognized variable

* Language tags (#771)

* Added languages widget to UI.

* Style fixes.

* Added English tag to existing datasets.

* Add languages to viewer mode.

* Update language codes.

* Update CONTRIBUTING.md.

* Update screenshot.

* Add "Prompt" to UI to clarify languages tag usage.

* Add blank languages list.

Co-authored-by: Victor SANH <[email protected]>
stephenbach added a commit that referenced this pull request Oct 26, 2022
* remove language restrictions

* add arabic dataset to primary_task

* Accelerate `get_infos` by caching the `DataseInfoDict`s (#778)

* accelerate `get_infos` by caching the `DataseInfoDict`s

* quality

* consistency

* add arabic prompts

* cleaning

* Consistency in prompt naming.

* cleaning

* fix `filter_english_datasets` since `languages` became `language` in dataset metadatas

* fix empty documents - multi_news (#793)

* fix empty documents - multi_news

* fix test - unrecognized variable

* Language tags (#771)

* Added languages widget to UI.

* Style fixes.

* Added English tag to existing datasets.

* Add languages to viewer mode.

* Update language codes.

* Update CONTRIBUTING.md.

* Update screenshot.

* Add "Prompt" to UI to clarify languages tag usage.

* update

* update prompts

* Remove duplicates lines

* update

* regenerate prompts

* cleaning

* lang tag missing

Co-authored-by: Victor SANH <[email protected]>
Co-authored-by: Stephen Bach <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants