Language tags #771

stephenbach · 2022-05-12T17:49:32Z

This PR adds a language tagging feature, so that users can annotate prompts with the language(s) used in the prompt.

Even though this is motivated by the eval hackathon, this PR targets main because it affects all prompts. All existing prompts in main are tagged with English. After merging into main, another PR should merge main into eval hackathon. This will require somewhat careful coordination because new prompts on that branch will need to have their metadata updated in the .yaml before they will work in the UI.

Regarding the tags themselves, the eval group requested using the subtags in this list. I took the liberty of changing how the tags are displayed in the UI, appending the English names in parens, but this is changeable.

awebson

Looks great to me! Thanks so much Steve!

I think we agreed to emphasize somewhere either in the UI or in the contribution guide that the language tag should be about the languages of each prompt, not the languages of the dataset examples (which should be already documented by the datasets themselves)?

VictorSanh

Looks good to me, thank you for adding this @stephenbach!

Regarding the language codes, the HF ecosystem (including datasets) is using ISO 639 for language codes (see https://huggingface.co/languages). Could we use the same here?
It will make running analysis on languages of prompts and datasets actually possible (or at least it will be smoother).

* Accelerate `get_infos` by caching the `DataseInfoDict`s (#778) * accelerate `get_infos` by caching the `DataseInfoDict`s * quality * consistency * fix `filter_english_datasets` since `languages` became `language` in dataset metadatas * fix empty documents - multi_news (#793) * fix empty documents - multi_news * fix test - unrecognized variable * Language tags (#771) * Added languages widget to UI. * Style fixes. * Added English tag to existing datasets. * Add languages to viewer mode. * Update language codes. * Update CONTRIBUTING.md. * Update screenshot. * Add "Prompt" to UI to clarify languages tag usage. * Add blank languages list. Co-authored-by: Victor SANH <[email protected]>

* remove language restrictions * add arabic dataset to primary_task * Accelerate `get_infos` by caching the `DataseInfoDict`s (#778) * accelerate `get_infos` by caching the `DataseInfoDict`s * quality * consistency * add arabic prompts * cleaning * Consistency in prompt naming. * cleaning * fix `filter_english_datasets` since `languages` became `language` in dataset metadatas * fix empty documents - multi_news (#793) * fix empty documents - multi_news * fix test - unrecognized variable * Language tags (#771) * Added languages widget to UI. * Style fixes. * Added English tag to existing datasets. * Add languages to viewer mode. * Update language codes. * Update CONTRIBUTING.md. * Update screenshot. * Add "Prompt" to UI to clarify languages tag usage. * update * update prompts * Remove duplicates lines * update * regenerate prompts * cleaning * lang tag missing Co-authored-by: Victor SANH <[email protected]> Co-authored-by: Stephen Bach <[email protected]>

stephenbach added 4 commits May 12, 2022 13:11

Added languages widget to UI.

a2d10ce

Style fixes.

fca8ce8

Added English tag to existing datasets.

d923e0b

Add languages to viewer mode.

3ded075

stephenbach requested review from awebson and VictorSanh May 12, 2022 17:49

awebson approved these changes May 12, 2022

View reviewed changes

VictorSanh approved these changes May 12, 2022

View reviewed changes

jon-tow mentioned this pull request Jun 13, 2022

Add multilingual tokenization for ROUGE bigscience-workshop/lm-evaluation-harness#79

Draft

stephenbach added 5 commits July 7, 2022 12:21

Merge branch 'main' into language_tags

245bc78

Update language codes.

c848b7f

Update CONTRIBUTING.md.

08f5869

Update screenshot.

a8a0128

Add "Prompt" to UI to clarify languages tag usage.

b47c5bb

stephenbach merged commit 0cc4b0c into main Jul 8, 2022

stephenbach deleted the language_tags branch July 8, 2022 21:29

stephenbach mentioned this pull request Aug 9, 2022

Flag in Promptsource for PromptLanguage #764

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language tags #771

Language tags #771

stephenbach commented May 12, 2022

awebson left a comment •

edited

Loading

VictorSanh left a comment

Language tags #771

Language tags #771

Conversation

stephenbach commented May 12, 2022

awebson left a comment • edited Loading

Choose a reason for hiding this comment

VictorSanh left a comment

Choose a reason for hiding this comment

awebson left a comment •

edited

Loading