-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Add option to keep output and embed tensors at f16 #8715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add option to keep output and embed tensors at f16
Example:
|
How much of a difference does this make compared to pure q8_0? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced f16
is measurably better than Q8_0
for the token embeddings and output tensors when the requested type is Q8_0
.
If the goal is to keep these tensors unchanged from the original model, why not also offer to use bf16
when appropriate?
--outtype q8_0
in convert_hf_to_gguf.py
is intended to give the same result as llama-quantize
with q8_0
output.
I could see why this option could be useful, but
- I expect there would be close to no difference in inference quality.
- Measuring this could be useful.
- It should have a better name than
--z
- It should not break
Q8_0
conversion for models withoutoutput.weight
- (maybe) it should allow choosing the overridden type of
output.weight
andtoken_embd.weight
instead of always usingf16
@@ -319,7 +321,7 @@ def prepare_tensors(self): | |||
assert data.dtype == np.int16 | |||
data_qtype = gguf.GGMLQuantizationType.BF16 | |||
|
|||
elif self.ftype == gguf.LlamaFileType.MOSTLY_Q8_0 and gguf.can_quantize_to_q8_0(data): | |||
elif self.ftype == gguf.LlamaFileType.MOSTLY_Q8_0 and gguf.can_quantize_to_q8_0(data) and (self.z and new_name not in (self.format_tensor_name(gguf.MODEL_TENSOR.OUTPUT), self.format_tensor_name(gguf.MODEL_TENSOR.TOKEN_EMBD))): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will prevent conversion to Q8_0 altogether for BERT, Gemma, Gemma2, Command-R, OpenELM, and BitNet, because they to not have MODEL_TENSOR.OUTPUT
in their tensor type list.
Use self.match_model_tensor_name
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel free to modify my idea and code for your needs. As it is I can use it. But sure it would be better if done more properly. I just wish not to use quantize if it's not strictly needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is I can use it.
Not on the models I mentioned.
BERT, Gemma, Gemma2, Command-R, OpenELM, and BitNet
Try it, and you should see an error with --outtype q8_0
, both with and without --z
.
$ python3 convert_hf_to_gguf.py models/OpenELM-270M-Instruct/ --dry-run --outtype q8_0 --z
INFO:hf-to-gguf:Loading model: OpenELM-270M-Instruct
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:blk.0.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {64}
Traceback (most recent call last):
...
File "convert_hf_to_gguf.py", line 324, in prepare_tensors
elif self.ftype == gguf.LlamaFileType.MOSTLY_Q8_0 and gguf.can_quantize_to_q8_0(data) and (self.z and new_name not in (self.format_tensor_name(gguf.MODEL_TENSOR.OUTPUT), self.format_tensor_name(gguf.MODEL_TENSOR.TOKEN_EMBD))):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "convert_hf_to_gguf.py", line 179, in format_tensor_name
raise ValueError(f"Missing {key!r} for MODEL_TENSORS of {self.model_arch!r}")
ValueError: Missing <MODEL_TENSOR.OUTPUT: 5> for MODEL_TENSORS of <MODEL_ARCH.OPENELM: 34>
Using self.match_model_tensor_name
instead of self.format_tensor_name
would avoid this problem (self.format_tensor_name
is only intended to be used for tensors which will actually be in the model).
I just wish not to use quantize if it's not strictly needed.
I understand; reducing the steps needed to get a desired quant is partly why q8_0
conversion was added in #7234.
I think that a more general type override for token_embd.weight
and output.weight
might be useful for your purpose, but I'm not yet sure how to implement that in a simple enough way. I think it might only be possible after a refactor of the type selection logic, which should be done relatively soon anyway to better support types for ternary models.
The currently proposed flag only has an effect along with --outtype q8_0
, which can be confusing UX-wise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well.. I could add an option like in quantize program... --emdt --outt or something like that? don't know..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's also true that if you quantize to q8 without --z you get a q8 as before.. if you quantize at f16 then you don't need --z and if for some reason you quantize at f32 --z has no reason to be... that's why I used it only for q8_0
Co-authored-by: compilade <[email protected]>
Co-authored-by: compilade <[email protected]>
@compilade Is it ok now? |
In my own experience (chatting with the models, not using automated tools to judge them) it's way better and more similar to the pure f16. A lot of people using my quantizations, say the same. |
@compilade I'm with you there, I did MMLU pro tests of Gemma and found that besides a couple outliers the f16 output/embed performed worse than Q8_0, but he seems to like making them so more power to him --z is a terrible CLI arg though, please rename it |
please change it to your liking. |
I did some quick tests with q8_0 conventional
q8_0 --z
The values for conventional q8_0 seem to be slightly better but the differences are so small that I am not comfortable about drawing any conclusions from this since the supposed effects could just be artifacts of the measurement. But these results definitely do not provide any evidence that FP16 output tensors provide a measurable benefit; my personal opinion is that without such evidence we should just keep the code simple and not add the option. |
Oh that's interesting, I didn't know llama-perplexity could also do KLD and token probabilities, that's a ton of extra useful information |
More info here. |
I put --z as an example.. obviously that must be changed to whatever @ggerganov likes. |
--output-tensor-type f16 --token-embedding-type f16 works fine as is?! The options are are great as is in my opinion, you can test and tune perplexity exactly as needed, some models do fine with q3 embeddings, some need bf16... not sure why this deserves a whole other config setting |
THOSE are in the QUANTIZE program! |
Considering the lack of value add and the fact that it's barely any more work to use the already existing quantize option, I don't see this option getting approved Should also probably be more polite to people like Nisten who are trying to provide feedback.. |
@bartowski1182 using first convert then quantize is kid of dumb in this case. Anyway I just wanted to share an option I added for myself. And since there isn the option already in quantize, it would be nice to have the same option in "convert" program. That's all. |
Add option to keep output and embed tensors at f16
Normally to do this takes 2 steps: convert then quantize.
In this way it's possible for example to convert directly a model to q8_0 but the output and embed tensors to f16