-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Add option to keep output and embed tensors at f16 #8715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will prevent conversion to Q8_0 altogether for BERT, Gemma, Gemma2, Command-R, OpenELM, and BitNet, because they to not have
MODEL_TENSOR.OUTPUT
in their tensor type list.Use
self.match_model_tensor_name
instead.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel free to modify my idea and code for your needs. As it is I can use it. But sure it would be better if done more properly. I just wish not to use quantize if it's not strictly needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not on the models I mentioned.
Try it, and you should see an error with
--outtype q8_0
, both with and without--z
.Using
self.match_model_tensor_name
instead ofself.format_tensor_name
would avoid this problem (self.format_tensor_name
is only intended to be used for tensors which will actually be in the model).I understand; reducing the steps needed to get a desired quant is partly why
q8_0
conversion was added in #7234.I think that a more general type override for
token_embd.weight
andoutput.weight
might be useful for your purpose, but I'm not yet sure how to implement that in a simple enough way. I think it might only be possible after a refactor of the type selection logic, which should be done relatively soon anyway to better support types for ternary models.The currently proposed flag only has an effect along with
--outtype q8_0
, which can be confusing UX-wise.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well.. I could add an option like in quantize program... --emdt --outt or something like that? don't know..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's also true that if you quantize to q8 without --z you get a q8 as before.. if you quantize at f16 then you don't need --z and if for some reason you quantize at f32 --z has no reason to be... that's why I used it only for q8_0