Skip to content

Add additional tests for stop sequences #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
abetlen opened this issue Apr 2, 2023 · 5 comments
Closed

Add additional tests for stop sequences #8

abetlen opened this issue Apr 2, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@abetlen
Copy link
Owner

abetlen commented Apr 2, 2023

Stop sequence implementation is currently a little complicated due to needing to support streaming. Also behavior is ill-defined.

@abetlen abetlen added the bug Something isn't working label Apr 2, 2023
@MillionthOdin16
Copy link
Contributor

I noticed I was getting complaints when stop was set in the default params as well as somewhere else (the completion itself I think). I couldn't exactly figure out what the purpose was, but I remember running into it in the past with another repo.

Can you give a quick summary of the actual vs intended behavior?

@abetlen abetlen added the good first issue Good for newcomers label Apr 4, 2023
@abetlen
Copy link
Owner Author

abetlen commented Apr 4, 2023

Quick spec for stop sequences as I understand them. If anything here seems wrong please lmk and I'll correct.

  • Stop sequences are an array of strings that, once generated, language model text completion is finished.
  • When a completion is finished by a stop sequence, the returned string is truncated to just before the stop sequence.
  • It appears that this behaviour remains consistent in streaming mode so e.g. you can't just generate a token, detokenize it, and return the string as it may contain part of a stop sequence.

What is unclear to me, and requires some testing (against OpenAI for example), is what should happen when streaming, stop sequences, and other stop conditions such as max_tokens and the eos token are combined together.

@MillionthOdin16
Copy link
Contributor

I agree with your explanation. Here's my approach from the streaming/chat/interactive (continuous conversation) perspective.

  • Stop Sequences: The model continuously monitors the generated text to check for stop sequences. If a stop sequence is found, text generation is halted and the generated text is truncated before the stop sequence.
  • Max Tokens (max_tokens): If max_tokens is reached before a stop sequence or an eos token is generated, text generation is halted and the output is returned as-is up to max_tokens.
  • EOS Token: If the model generates an eos token, text generation may be halted. However, this behavior can vary based on the configuration and whether it's operating in interactive mode.

What is unclear to me, and requires some testing (against OpenAI for example), is what should happen when streaming, stop sequences, and other stop conditions such as max_tokens and the eos token are combined together.

While I agree it's important to follow the approaches OpenAI uses for completions for the most part, we might also have cases where we need to ignore EOS, or at least have the ability to bypass it using API, just because our models are smaller and can benefit from proceeding past the EOS. Especially in a chat type mode, I normally generate until I hit a reverse prompt looking for user input (i.e. User:) or just hit max_tokens.

Relevant snippets from llama.cpp main:

Ignoring EOS by overwriting llama_token_eos() (Usually 2):
(Completely removing EOS functionality)

if (params.ignore_eos) {
    logits[llama_token_eos()] = 0;
}

Handling interactive mode where instead of stopping on EOS, we loop back and request info from the user:
(There's also some code that replaces EOS with a newline that I didn't include for brevity)

// end of text token
if (embd.back() == llama_token_eos()) {
    if (params.instruct) {
        is_interacting = true;
    } else {
        fprintf(stderr, " [end of text]\n");
        break;
    }
}

While generation is occurring, check to make sure we're below the max_tokens, otherwise return the generated text and await user input:

// In interactive mode, respect the maximum number of tokens and drop back to user input when reached.
if (params.interactive && n_remain <= 0 && params.n_predict != -1) {
    n_remain = params.n_predict;
    is_interacting = true;
}

https://github.com/ggerganov/llama.cpp/blob/53dbba769537e894ead5c6913ab2fd3a4658b738/examples/main/main.cpp#LL443C2-L447C10

@abetlen
Copy link
Owner Author

abetlen commented Apr 4, 2023

@MillionthOdin16 wrt to what you're saying about the eos token, I agree that I don't want our hands tied with OpenAI compatibility (so we can reap the benefits of the local model) but I don't want to change the existing __call__ / create_completions / create_chat_completions API.

For this issue just focusing on the functionality of those methods.

However I did create a new issue #22 to track what we discussed which is similar to interactive mode in llama.cpp

@MillionthOdin16
Copy link
Contributor

This makes sense. I ask for clarification over in #22 about interactive mode, and if that's the case then I understand now.

@abetlen abetlen changed the title Fix stop sequence implementation Verify stop sequence implementation Apr 5, 2023
@abetlen abetlen changed the title Verify stop sequence implementation Add additional tests for stop sequences Apr 5, 2023
@abetlen abetlen added enhancement New feature or request and removed bug Something isn't working good first issue Good for newcomers labels Apr 5, 2023
@abetlen abetlen pinned this issue Apr 5, 2023
@abetlen abetlen unpinned this issue Apr 6, 2023
@abetlen abetlen closed this as completed May 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants