Add additional tests for `stop` sequences #8

abetlen · 2023-04-02T06:21:00Z

Stop sequence implementation is currently a little complicated due to needing to support streaming. Also behavior is ill-defined.

MillionthOdin16 · 2023-04-03T11:27:52Z

I noticed I was getting complaints when stop was set in the default params as well as somewhere else (the completion itself I think). I couldn't exactly figure out what the purpose was, but I remember running into it in the past with another repo.

Can you give a quick summary of the actual vs intended behavior?

abetlen · 2023-04-04T18:11:13Z

Quick spec for stop sequences as I understand them. If anything here seems wrong please lmk and I'll correct.

Stop sequences are an array of strings that, once generated, language model text completion is finished.
When a completion is finished by a stop sequence, the returned string is truncated to just before the stop sequence.
It appears that this behaviour remains consistent in streaming mode so e.g. you can't just generate a token, detokenize it, and return the string as it may contain part of a stop sequence.

What is unclear to me, and requires some testing (against OpenAI for example), is what should happen when streaming, stop sequences, and other stop conditions such as max_tokens and the eos token are combined together.

MillionthOdin16 · 2023-04-04T20:30:18Z

I agree with your explanation. Here's my approach from the streaming/chat/interactive (continuous conversation) perspective.

Stop Sequences: The model continuously monitors the generated text to check for stop sequences. If a stop sequence is found, text generation is halted and the generated text is truncated before the stop sequence.
Max Tokens (max_tokens): If max_tokens is reached before a stop sequence or an eos token is generated, text generation is halted and the output is returned as-is up to max_tokens.
EOS Token: If the model generates an eos token, text generation may be halted. However, this behavior can vary based on the configuration and whether it's operating in interactive mode.

What is unclear to me, and requires some testing (against OpenAI for example), is what should happen when streaming, stop sequences, and other stop conditions such as max_tokens and the eos token are combined together.

While I agree it's important to follow the approaches OpenAI uses for completions for the most part, we might also have cases where we need to ignore EOS, or at least have the ability to bypass it using API, just because our models are smaller and can benefit from proceeding past the EOS. Especially in a chat type mode, I normally generate until I hit a reverse prompt looking for user input (i.e. User:) or just hit max_tokens.

Relevant snippets from `llama.cpp` main:

Ignoring EOS by overwriting llama_token_eos() (Usually 2):
(Completely removing EOS functionality)

if (params.ignore_eos) {
    logits[llama_token_eos()] = 0;
}

Handling interactive mode where instead of stopping on EOS, we loop back and request info from the user:
(There's also some code that replaces EOS with a newline that I didn't include for brevity)

// end of text token
if (embd.back() == llama_token_eos()) {
    if (params.instruct) {
        is_interacting = true;
    } else {
        fprintf(stderr, " [end of text]\n");
        break;
    }
}

While generation is occurring, check to make sure we're below the max_tokens, otherwise return the generated text and await user input:

// In interactive mode, respect the maximum number of tokens and drop back to user input when reached.
if (params.interactive && n_remain <= 0 && params.n_predict != -1) {
    n_remain = params.n_predict;
    is_interacting = true;
}

https://github.com/ggerganov/llama.cpp/blob/53dbba769537e894ead5c6913ab2fd3a4658b738/examples/main/main.cpp#LL443C2-L447C10

abetlen · 2023-04-04T22:08:36Z

@MillionthOdin16 wrt to what you're saying about the eos token, I agree that I don't want our hands tied with OpenAI compatibility (so we can reap the benefits of the local model) but I don't want to change the existing __call__ / create_completions / create_chat_completions API.

For this issue just focusing on the functionality of those methods.

However I did create a new issue #22 to track what we discussed which is similar to interactive mode in llama.cpp

MillionthOdin16 · 2023-04-04T22:15:40Z

This makes sense. I ask for clarification over in #22 about interactive mode, and if that's the case then I understand now.

abetlen added the bug Something isn't working label Apr 2, 2023

abetlen added the good first issue Good for newcomers label Apr 4, 2023

abetlen changed the title ~~Fix stop sequence implementation~~ Verify stop sequence implementation Apr 5, 2023

abetlen changed the title ~~Verify stop sequence implementation~~ Add additional tests for stop sequences Apr 5, 2023

abetlen added enhancement New feature or request and removed bug Something isn't working good first issue Good for newcomers labels Apr 5, 2023

abetlen pinned this issue Apr 5, 2023

abetlen unpinned this issue Apr 6, 2023

abetlen closed this as completed May 5, 2023

themanyone mentioned this issue Jun 8, 2024

CUDA: Generate error message for unsupported quantizations like iq4_nl #1473

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add additional tests for `stop` sequences #8

Add additional tests for `stop` sequences #8

abetlen commented Apr 2, 2023

MillionthOdin16 commented Apr 3, 2023

abetlen commented Apr 4, 2023

MillionthOdin16 commented Apr 4, 2023

abetlen commented Apr 4, 2023

MillionthOdin16 commented Apr 4, 2023

Add additional tests for stop sequences #8

Add additional tests for stop sequences #8

Comments

abetlen commented Apr 2, 2023

MillionthOdin16 commented Apr 3, 2023

abetlen commented Apr 4, 2023

MillionthOdin16 commented Apr 4, 2023

Relevant snippets from llama.cpp main:

abetlen commented Apr 4, 2023

MillionthOdin16 commented Apr 4, 2023

Add additional tests for `stop` sequences #8

Add additional tests for `stop` sequences #8

Relevant snippets from `llama.cpp` main: