Skip to content

Add option to keep output and embed tensors at f16 #8715

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Add option to keep output and embed tensors at f16 #8715

wants to merge 3 commits into from

Conversation

0wwafa
Copy link

@0wwafa 0wwafa commented Jul 26, 2024

Add option to keep output and embed tensors at f16

Normally to do this takes 2 steps: convert then quantize.
In this way it's possible for example to convert directly a model to q8_0 but the output and embed tensors to f16

Add option to keep output and embed tensors at f16
@github-actions github-actions bot added the python python script changes label Jul 26, 2024
@0wwafa
Copy link
Author

0wwafa commented Jul 27, 2024

Example:

python3 convert_hf_to_gguf.py --z --outtype q8_0 /mnt/i/models/Mistral-7b-Instruct-v0.3  --outfile Mistral-7b-Instruct-v0.3.f16.q8_0.gguf

@JohannesGaessler
Copy link
Collaborator

How much of a difference does this make compared to pure q8_0?

Copy link
Collaborator

@compilade compilade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced f16 is measurably better than Q8_0 for the token embeddings and output tensors when the requested type is Q8_0.

If the goal is to keep these tensors unchanged from the original model, why not also offer to use bf16 when appropriate?

--outtype q8_0 in convert_hf_to_gguf.py is intended to give the same result as llama-quantize with q8_0 output.

I could see why this option could be useful, but

  • I expect there would be close to no difference in inference quality.
    • Measuring this could be useful.
  • It should have a better name than --z
  • It should not break Q8_0 conversion for models without output.weight
  • (maybe) it should allow choosing the overridden type of output.weight and token_embd.weight instead of always using f16

@@ -319,7 +321,7 @@ def prepare_tensors(self):
assert data.dtype == np.int16
data_qtype = gguf.GGMLQuantizationType.BF16

elif self.ftype == gguf.LlamaFileType.MOSTLY_Q8_0 and gguf.can_quantize_to_q8_0(data):
elif self.ftype == gguf.LlamaFileType.MOSTLY_Q8_0 and gguf.can_quantize_to_q8_0(data) and (self.z and new_name not in (self.format_tensor_name(gguf.MODEL_TENSOR.OUTPUT), self.format_tensor_name(gguf.MODEL_TENSOR.TOKEN_EMBD))):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will prevent conversion to Q8_0 altogether for BERT, Gemma, Gemma2, Command-R, OpenELM, and BitNet, because they to not have MODEL_TENSOR.OUTPUT in their tensor type list.

Use self.match_model_tensor_name instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to modify my idea and code for your needs. As it is I can use it. But sure it would be better if done more properly. I just wish not to use quantize if it's not strictly needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is I can use it.

Not on the models I mentioned.

BERT, Gemma, Gemma2, Command-R, OpenELM, and BitNet

Try it, and you should see an error with --outtype q8_0, both with and without --z.

$ python3 convert_hf_to_gguf.py models/OpenELM-270M-Instruct/ --dry-run --outtype q8_0 --z
INFO:hf-to-gguf:Loading model: OpenELM-270M-Instruct
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:blk.0.attn_k_norm.weight,  torch.bfloat16 --> F32, shape = {64}
Traceback (most recent call last):
...
  File "convert_hf_to_gguf.py", line 324, in prepare_tensors
    elif self.ftype == gguf.LlamaFileType.MOSTLY_Q8_0 and gguf.can_quantize_to_q8_0(data) and (self.z and new_name not in (self.format_tensor_name(gguf.MODEL_TENSOR.OUTPUT), self.format_tensor_name(gguf.MODEL_TENSOR.TOKEN_EMBD))):
                                                                                                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "convert_hf_to_gguf.py", line 179, in format_tensor_name
    raise ValueError(f"Missing {key!r} for MODEL_TENSORS of {self.model_arch!r}")
ValueError: Missing <MODEL_TENSOR.OUTPUT: 5> for MODEL_TENSORS of <MODEL_ARCH.OPENELM: 34>

Using self.match_model_tensor_name instead of self.format_tensor_name would avoid this problem (self.format_tensor_name is only intended to be used for tensors which will actually be in the model).

I just wish not to use quantize if it's not strictly needed.

I understand; reducing the steps needed to get a desired quant is partly why q8_0 conversion was added in #7234.

I think that a more general type override for token_embd.weight and output.weight might be useful for your purpose, but I'm not yet sure how to implement that in a simple enough way. I think it might only be possible after a refactor of the type selection logic, which should be done relatively soon anyway to better support types for ternary models.

The currently proposed flag only has an effect along with --outtype q8_0, which can be confusing UX-wise.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well.. I could add an option like in quantize program... --emdt --outt or something like that? don't know..

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's also true that if you quantize to q8 without --z you get a q8 as before.. if you quantize at f16 then you don't need --z and if for some reason you quantize at f32 --z has no reason to be... that's why I used it only for q8_0

0wwafa and others added 2 commits July 28, 2024 02:03
@0wwafa
Copy link
Author

0wwafa commented Jul 27, 2024

@compilade Is it ok now?

@0wwafa
Copy link
Author

0wwafa commented Jul 27, 2024

How much of a difference does this make compared to pure q8_0?

In my own experience (chatting with the models, not using automated tools to judge them) it's way better and more similar to the pure f16. A lot of people using my quantizations, say the same.

@bartowski1182
Copy link
Contributor

@compilade I'm with you there, I did MMLU pro tests of Gemma and found that besides a couple outliers the f16 output/embed performed worse than Q8_0, but he seems to like making them so more power to him

--z is a terrible CLI arg though, please rename it

@0wwafa
Copy link
Author

0wwafa commented Jul 28, 2024

@compilade I'm with you there, I did MMLU pro tests of Gemma and found that besides a couple outliers the f16 output/embed performed worse than Q8_0, but he seems to like making them so more power to him

--z is a terrible CLI arg though, please rename it

please change it to your liking.

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Jul 28, 2024

I did some quick tests with llama-perplexity using LLaMA 3.1 8b.

q8_0 conventional
====== Perplexity statistics ======
Mean PPL(Q)                   :   6.413076 ±   0.039490
Mean PPL(base)                :   6.403059 ±   0.039368
Cor(ln(PPL(Q)), ln(PPL(base))):  99.93%
Mean ln(PPL(Q)/PPL(base))     :   0.001563 ±   0.000231
Mean PPL(Q)/PPL(base)         :   1.001564 ±   0.000231
Mean PPL(Q)-PPL(base)         :   0.010016 ±   0.001484

====== KL divergence statistics ======
Mean    KLD:   0.003026 ±   0.000027
Maximum KLD:   2.694473
99.9%   KLD:   0.067251
99.0%   KLD:   0.020983
99.0%   KLD:   0.020983
Median  KLD:   0.001843
10.0%   KLD:   0.000077
 5.0%   KLD:   0.000019
 1.0%   KLD:   0.000001
Minimum KLD:  -0.000021

====== Token probability statistics ======
Mean    Δp: -0.001 ± 0.004 %
Maximum Δp: 74.656%
99.9%   Δp:  9.160%
99.0%   Δp:  4.755%
95.0%   Δp:  2.479%
90.0%   Δp:  1.530%
75.0%   Δp:  0.366%
Median  Δp:  0.000%
25.0%   Δp: -0.340%
10.0%   Δp: -1.503%
 5.0%   Δp: -2.505%
 1.0%   Δp: -4.942%
 0.1%   Δp: -10.337%
Minimum Δp: -79.540%
RMS Δp    :  1.669 ± 0.022 %
Same top p: 96.944 ± 0.045 %
q8_0 --z
====== Perplexity statistics ======
Mean PPL(Q)                   :   6.418853 ±   0.039522
Mean PPL(base)                :   6.403059 ±   0.039368
Cor(ln(PPL(Q)), ln(PPL(base))):  99.92%
Mean ln(PPL(Q)/PPL(base))     :   0.002464 ±   0.000251
Mean PPL(Q)/PPL(base)         :   1.002467 ±   0.000252
Mean PPL(Q)-PPL(base)         :   0.015793 ±   0.001616

====== KL divergence statistics ======
Mean    KLD:   0.003659 ±   0.000023
Maximum KLD:   1.525696
99.9%   KLD:   0.086219
99.0%   KLD:   0.026490
99.0%   KLD:   0.026490
Median  KLD:   0.002224
10.0%   KLD:   0.000097
 5.0%   KLD:   0.000026
 1.0%   KLD:   0.000002
Minimum KLD:  -0.000019

====== Token probability statistics ======
Mean    Δp: -0.028 ± 0.005 %
Maximum Δp: 78.045%
99.9%   Δp: 10.310%
99.0%   Δp:  5.184%
95.0%   Δp:  2.636%
90.0%   Δp:  1.606%
75.0%   Δp:  0.371%
Median  Δp:  0.000%
25.0%   Δp: -0.401%
10.0%   Δp: -1.673%
 5.0%   Δp: -2.752%
 1.0%   Δp: -5.499%
 0.1%   Δp: -11.852%
Minimum Δp: -47.852%
RMS Δp    :  1.824 ± 0.018 %
Same top p: 96.677 ± 0.047 %

The values for conventional q8_0 seem to be slightly better but the differences are so small that I am not comfortable about drawing any conclusions from this since the supposed effects could just be artifacts of the measurement. But these results definitely do not provide any evidence that FP16 output tensors provide a measurable benefit; my personal opinion is that without such evidence we should just keep the code simple and not add the option.

@bartowski1182
Copy link
Contributor

Oh that's interesting, I didn't know llama-perplexity could also do KLD and token probabilities, that's a ton of extra useful information

@JohannesGaessler
Copy link
Collaborator

More info here.

@0wwafa
Copy link
Author

0wwafa commented Jul 28, 2024

I put --z as an example.. obviously that must be changed to whatever @ggerganov likes.
So? tell me do I need to change anything?

@nisten
Copy link

nisten commented Aug 1, 2024

--output-tensor-type f16 --token-embedding-type f16 works fine as is?!

The options are are great as is in my opinion, you can test and tune perplexity exactly as needed, some models do fine with q3 embeddings, some need bf16... not sure why this deserves a whole other config setting

@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Aug 1, 2024
@0wwafa
Copy link
Author

0wwafa commented Aug 1, 2024

--output-tensor-type f16 --token-embedding-type f16 works fine as is?!

The options are are great as is in my opinion, you can test and tune perplexity exactly as needed, some models do fine with q3 embeddings, some need bf16... not sure why this deserves a whole other config setting

THOSE are in the QUANTIZE program!
This mod is for the CONVERT program. (at least read the thread before commenting) @nisten

@bartowski1182
Copy link
Contributor

Considering the lack of value add and the fact that it's barely any more work to use the already existing quantize option, I don't see this option getting approved

Should also probably be more polite to people like Nisten who are trying to provide feedback..

@0wwafa
Copy link
Author

0wwafa commented Aug 3, 2024

@bartowski1182 using first convert then quantize is kid of dumb in this case. Anyway I just wanted to share an option I added for myself. And since there isn the option already in quantize, it would be nice to have the same option in "convert" program. That's all.

@0wwafa 0wwafa closed this by deleting the head repository Oct 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants