Skip to content

[Feature]: guided decoding on TPU #11104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
carlesoctav opened this issue Dec 11, 2024 · 9 comments
Closed
1 task done

[Feature]: guided decoding on TPU #11104

carlesoctav opened this issue Dec 11, 2024 · 9 comments
Labels

Comments

@carlesoctav
Copy link

🚀 The feature, motivation and pitch

I’m not sure if this is possible, but right now the execute_model function on the TPUModelRunner is only outputting the predicted token_ids, rather than the distribution of tokens that we can sample from with some guidance (e.g., using outlines). I believe structured output is becoming more common, and most projects that require LLMs need this structured output feature.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@carlesoctav carlesoctav added the feature request New feature or request label Dec 11, 2024
@carlesoctav
Copy link
Author

I'm down to work on these features :).

@robertgshaw2-redhat
Copy link
Collaborator

Thanks @carlesoctav - to get started, we need to enable support for LogitsProcessing on TPUs. Do you need a pointer to get started?

@bvrockwell
Copy link
Contributor

Thanks so much for lending a hand @carlesoctav ! Indeed, this is super important :)

@carlesoctav
Copy link
Author

hi, I've been working on making these features viable and concluded with this approach:

  1. Extract the logits_processors in the prepare_sample function (previously outputting just n, p, t params).
  2. Pass the logits_processors as one of the parameters for the ModelWrapper.forward class.
  3. Iteratively apply the logits_processor (similar to the _apply_logits_processor function).

However, there are still some missing parameters needed for a LogitProcessor, mainly prompt_token_ids or past_tokens_ids and it required sample_indices to extract those params.
may I know how I can get these parameters? Do I need to extract them in the prepare_input function?

also here's the diff for the changes I made:
carlesoctav@34703fc

@bvrockwell
Copy link
Contributor

cc @dyli-google 👍

@bvrockwell
Copy link
Contributor

@Chenyaaang

@dyli-google
Copy link
Contributor

@carlesoctav

Sorry for the late reply.

Is there any update on this? Is carlesoctav@34703fc still the latest commit?

Also, do you want to create a pull request for this?

Thanks.

@Chenyaaang
Copy link
Contributor

I did some investigation yesterday, carlesoctav's way is workable, but given the pr to support structured decoding on GPU V1 (#12388), we only need to do the same thing on v1/worker/tpu_model_runner as v1/worker/gpu_model_runner. I can implement it after the pr is merged.

@russellb
Copy link
Member

I believe this was completed by #16499

@github-project-automation github-project-automation bot moved this from In progress to Done in Structured Output Apr 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests

6 participants