-
Notifications
You must be signed in to change notification settings - Fork 365
Language tags #771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Language tags #771
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me! Thanks so much Steve!
I think we agreed to emphasize somewhere either in the UI or in the contribution guide that the language tag should be about the languages of each prompt, not the languages of the dataset examples (which should be already documented by the datasets themselves)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thank you for adding this @stephenbach!
Regarding the language codes, the HF ecosystem (including datasets) is using ISO 639 for language codes (see https://huggingface.co/languages). Could we use the same here?
It will make running analysis on languages of prompts and datasets actually possible (or at least it will be smoother).
* Accelerate `get_infos` by caching the `DataseInfoDict`s (#778) * accelerate `get_infos` by caching the `DataseInfoDict`s * quality * consistency * fix `filter_english_datasets` since `languages` became `language` in dataset metadatas * fix empty documents - multi_news (#793) * fix empty documents - multi_news * fix test - unrecognized variable * Language tags (#771) * Added languages widget to UI. * Style fixes. * Added English tag to existing datasets. * Add languages to viewer mode. * Update language codes. * Update CONTRIBUTING.md. * Update screenshot. * Add "Prompt" to UI to clarify languages tag usage. * Add blank languages list. Co-authored-by: Victor SANH <[email protected]>
* remove language restrictions * add arabic dataset to primary_task * Accelerate `get_infos` by caching the `DataseInfoDict`s (#778) * accelerate `get_infos` by caching the `DataseInfoDict`s * quality * consistency * add arabic prompts * cleaning * Consistency in prompt naming. * cleaning * fix `filter_english_datasets` since `languages` became `language` in dataset metadatas * fix empty documents - multi_news (#793) * fix empty documents - multi_news * fix test - unrecognized variable * Language tags (#771) * Added languages widget to UI. * Style fixes. * Added English tag to existing datasets. * Add languages to viewer mode. * Update language codes. * Update CONTRIBUTING.md. * Update screenshot. * Add "Prompt" to UI to clarify languages tag usage. * update * update prompts * Remove duplicates lines * update * regenerate prompts * cleaning * lang tag missing Co-authored-by: Victor SANH <[email protected]> Co-authored-by: Stephen Bach <[email protected]>
This PR adds a language tagging feature, so that users can annotate prompts with the language(s) used in the prompt.
Even though this is motivated by the eval hackathon, this PR targets main because it affects all prompts. All existing prompts in main are tagged with English. After merging into main, another PR should merge main into eval hackathon. This will require somewhat careful coordination because new prompts on that branch will need to have their metadata updated in the .yaml before they will work in the UI.
Regarding the tags themselves, the eval group requested using the subtags in this list. I took the liberty of changing how the tags are displayed in the UI, appending the English names in parens, but this is changeable.