Add option to keep output and embed tensors at f16 #8715

0wwafa · 2024-07-26T20:44:59Z

Add option to keep output and embed tensors at f16

Normally to do this takes 2 steps: convert then quantize.
In this way it's possible for example to convert directly a model to q8_0 but the output and embed tensors to f16

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Add option to keep output and embed tensors at f16

0wwafa · 2024-07-27T12:15:13Z

Example:

python3 convert_hf_to_gguf.py --z --outtype q8_0 /mnt/i/models/Mistral-7b-Instruct-v0.3  --outfile Mistral-7b-Instruct-v0.3.f16.q8_0.gguf

JohannesGaessler · 2024-07-27T18:51:19Z

How much of a difference does this make compared to pure q8_0?

compilade

I'm not convinced f16 is measurably better than Q8_0 for the token embeddings and output tensors when the requested type is Q8_0.

If the goal is to keep these tensors unchanged from the original model, why not also offer to use bf16 when appropriate?

--outtype q8_0 in convert_hf_to_gguf.py is intended to give the same result as llama-quantize with q8_0 output.

I could see why this option could be useful, but

I expect there would be close to no difference in inference quality.
- Measuring this could be useful.
It should have a better name than --z
It should not break Q8_0 conversion for models without output.weight
(maybe) it should allow choosing the overridden type of output.weight and token_embd.weight instead of always using f16

convert_hf_to_gguf.py

compilade · 2024-07-27T19:24:11Z

convert_hf_to_gguf.py

@@ -319,7 +321,7 @@ def prepare_tensors(self):
                        assert data.dtype == np.int16
                        data_qtype = gguf.GGMLQuantizationType.BF16

-                    elif self.ftype == gguf.LlamaFileType.MOSTLY_Q8_0 and gguf.can_quantize_to_q8_0(data):
+                    elif self.ftype == gguf.LlamaFileType.MOSTLY_Q8_0 and gguf.can_quantize_to_q8_0(data) and (self.z and new_name not in (self.format_tensor_name(gguf.MODEL_TENSOR.OUTPUT), self.format_tensor_name(gguf.MODEL_TENSOR.TOKEN_EMBD))):


This will prevent conversion to Q8_0 altogether for BERT, Gemma, Gemma2, Command-R, OpenELM, and BitNet, because they to not have MODEL_TENSOR.OUTPUT in their tensor type list.

Use self.match_model_tensor_name instead.

feel free to modify my idea and code for your needs. As it is I can use it. But sure it would be better if done more properly. I just wish not to use quantize if it's not strictly needed.

As it is I can use it.

Not on the models I mentioned.

BERT, Gemma, Gemma2, Command-R, OpenELM, and BitNet

Try it, and you should see an error with --outtype q8_0, both with and without --z.

$ python3 convert_hf_to_gguf.py models/OpenELM-270M-Instruct/ --dry-run --outtype q8_0 --z INFO:hf-to-gguf:Loading model: OpenELM-270M-Instruct INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only INFO:hf-to-gguf:Exporting model... INFO:hf-to-gguf:gguf: loading model part 'model.safetensors' INFO:hf-to-gguf:blk.0.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {64} Traceback (most recent call last): ... File "convert_hf_to_gguf.py", line 324, in prepare_tensors elif self.ftype == gguf.LlamaFileType.MOSTLY_Q8_0 and gguf.can_quantize_to_q8_0(data) and (self.z and new_name not in (self.format_tensor_name(gguf.MODEL_TENSOR.OUTPUT), self.format_tensor_name(gguf.MODEL_TENSOR.TOKEN_EMBD))): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "convert_hf_to_gguf.py", line 179, in format_tensor_name raise ValueError(f"Missing {key!r} for MODEL_TENSORS of {self.model_arch!r}") ValueError: Missing <MODEL_TENSOR.OUTPUT: 5> for MODEL_TENSORS of <MODEL_ARCH.OPENELM: 34>

Using self.match_model_tensor_name instead of self.format_tensor_name would avoid this problem (self.format_tensor_name is only intended to be used for tensors which will actually be in the model).

I just wish not to use quantize if it's not strictly needed.

I understand; reducing the steps needed to get a desired quant is partly why q8_0 conversion was added in #7234.

I think that a more general type override for token_embd.weight and output.weight might be useful for your purpose, but I'm not yet sure how to implement that in a simple enough way. I think it might only be possible after a refactor of the type selection logic, which should be done relatively soon anyway to better support types for ternary models.

The currently proposed flag only has an effect along with --outtype q8_0, which can be confusing UX-wise.

well.. I could add an option like in quantize program... --emdt --outt or something like that? don't know..

it's also true that if you quantize to q8 without --z you get a q8 as before.. if you quantize at f16 then you don't need --z and if for some reason you quantize at f32 --z has no reason to be... that's why I used it only for q8_0

convert_hf_to_gguf.py

Co-authored-by: compilade <[email protected]>

0wwafa · 2024-07-27T23:04:46Z

@compilade Is it ok now?

0wwafa · 2024-07-27T23:06:16Z

How much of a difference does this make compared to pure q8_0?

In my own experience (chatting with the models, not using automated tools to judge them) it's way better and more similar to the pure f16. A lot of people using my quantizations, say the same.

bartowski1182 · 2024-07-28T00:43:38Z

@compilade I'm with you there, I did MMLU pro tests of Gemma and found that besides a couple outliers the f16 output/embed performed worse than Q8_0, but he seems to like making them so more power to him

--z is a terrible CLI arg though, please rename it

0wwafa · 2024-07-28T11:45:10Z

@compilade I'm with you there, I did MMLU pro tests of Gemma and found that besides a couple outliers the f16 output/embed performed worse than Q8_0, but he seems to like making them so more power to him

--z is a terrible CLI arg though, please rename it

please change it to your liking.

JohannesGaessler · 2024-07-28T12:39:14Z

I did some quick tests with llama-perplexity using LLaMA 3.1 8b.

q8_0 conventional

====== Perplexity statistics ======
Mean PPL(Q)                   :   6.413076 ±   0.039490
Mean PPL(base)                :   6.403059 ±   0.039368
Cor(ln(PPL(Q)), ln(PPL(base))):  99.93%
Mean ln(PPL(Q)/PPL(base))     :   0.001563 ±   0.000231
Mean PPL(Q)/PPL(base)         :   1.001564 ±   0.000231
Mean PPL(Q)-PPL(base)         :   0.010016 ±   0.001484

====== KL divergence statistics ======
Mean    KLD:   0.003026 ±   0.000027
Maximum KLD:   2.694473
99.9%   KLD:   0.067251
99.0%   KLD:   0.020983
99.0%   KLD:   0.020983
Median  KLD:   0.001843
10.0%   KLD:   0.000077
 5.0%   KLD:   0.000019
 1.0%   KLD:   0.000001
Minimum KLD:  -0.000021

====== Token probability statistics ======
Mean    Δp: -0.001 ± 0.004 %
Maximum Δp: 74.656%
99.9%   Δp:  9.160%
99.0%   Δp:  4.755%
95.0%   Δp:  2.479%
90.0%   Δp:  1.530%
75.0%   Δp:  0.366%
Median  Δp:  0.000%
25.0%   Δp: -0.340%
10.0%   Δp: -1.503%
 5.0%   Δp: -2.505%
 1.0%   Δp: -4.942%
 0.1%   Δp: -10.337%
Minimum Δp: -79.540%
RMS Δp    :  1.669 ± 0.022 %
Same top p: 96.944 ± 0.045 %

q8_0 --z

====== Perplexity statistics ======
Mean PPL(Q)                   :   6.418853 ±   0.039522
Mean PPL(base)                :   6.403059 ±   0.039368
Cor(ln(PPL(Q)), ln(PPL(base))):  99.92%
Mean ln(PPL(Q)/PPL(base))     :   0.002464 ±   0.000251
Mean PPL(Q)/PPL(base)         :   1.002467 ±   0.000252
Mean PPL(Q)-PPL(base)         :   0.015793 ±   0.001616

====== KL divergence statistics ======
Mean    KLD:   0.003659 ±   0.000023
Maximum KLD:   1.525696
99.9%   KLD:   0.086219
99.0%   KLD:   0.026490
99.0%   KLD:   0.026490
Median  KLD:   0.002224
10.0%   KLD:   0.000097
 5.0%   KLD:   0.000026
 1.0%   KLD:   0.000002
Minimum KLD:  -0.000019

====== Token probability statistics ======
Mean    Δp: -0.028 ± 0.005 %
Maximum Δp: 78.045%
99.9%   Δp: 10.310%
99.0%   Δp:  5.184%
95.0%   Δp:  2.636%
90.0%   Δp:  1.606%
75.0%   Δp:  0.371%
Median  Δp:  0.000%
25.0%   Δp: -0.401%
10.0%   Δp: -1.673%
 5.0%   Δp: -2.752%
 1.0%   Δp: -5.499%
 0.1%   Δp: -11.852%
Minimum Δp: -47.852%
RMS Δp    :  1.824 ± 0.018 %
Same top p: 96.677 ± 0.047 %

The values for conventional q8_0 seem to be slightly better but the differences are so small that I am not comfortable about drawing any conclusions from this since the supposed effects could just be artifacts of the measurement. But these results definitely do not provide any evidence that FP16 output tensors provide a measurable benefit; my personal opinion is that without such evidence we should just keep the code simple and not add the option.

bartowski1182 · 2024-07-28T13:30:08Z

Oh that's interesting, I didn't know llama-perplexity could also do KLD and token probabilities, that's a ton of extra useful information

JohannesGaessler · 2024-07-28T13:35:59Z

More info here.

0wwafa · 2024-07-28T17:22:08Z

I put --z as an example.. obviously that must be changed to whatever @ggerganov likes.
So? tell me do I need to change anything?

nisten · 2024-08-01T02:25:07Z

--output-tensor-type f16 --token-embedding-type f16 works fine as is?!

The options are are great as is in my opinion, you can test and tune perplexity exactly as needed, some models do fine with q3 embeddings, some need bf16... not sure why this deserves a whole other config setting

0wwafa · 2024-08-01T20:51:16Z

--output-tensor-type f16 --token-embedding-type f16 works fine as is?!

The options are are great as is in my opinion, you can test and tune perplexity exactly as needed, some models do fine with q3 embeddings, some need bf16... not sure why this deserves a whole other config setting

THOSE are in the QUANTIZE program!
This mod is for the CONVERT program. (at least read the thread before commenting) @nisten

bartowski1182 · 2024-08-01T21:20:17Z

Considering the lack of value add and the fact that it's barely any more work to use the already existing quantize option, I don't see this option getting approved

Should also probably be more polite to people like Nisten who are trying to provide feedback..

0wwafa · 2024-08-03T00:41:07Z

@bartowski1182 using first convert then quantize is kid of dumb in this case. Anyway I just wanted to share an option I added for myself. And since there isn the option already in quantize, it would be nice to have the same option in "convert" program. That's all.

Add option to keep output and embed tensors at f16

7ff3083

Add option to keep output and embed tensors at f16

github-actions bot added the python python script changes label Jul 26, 2024

compilade reviewed Jul 27, 2024

View reviewed changes

0wwafa and others added 2 commits July 28, 2024 02:03

Update convert_hf_to_gguf.py

9758c5a

Co-authored-by: compilade <[email protected]>

Update convert_hf_to_gguf.py

40d1698

Co-authored-by: compilade <[email protected]>

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Aug 1, 2024

compilade mentioned this pull request Aug 2, 2024

gguf-py : simplify support for quant types #8838

Merged

2 tasks

0wwafa closed this by deleting the head repository Oct 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to keep output and embed tensors at f16 #8715

Add option to keep output and embed tensors at f16 #8715

0wwafa commented Jul 26, 2024

0wwafa commented Jul 27, 2024

JohannesGaessler commented Jul 27, 2024

compilade left a comment •

edited

Loading

compilade Jul 27, 2024

0wwafa Jul 27, 2024

compilade Jul 29, 2024

0wwafa Jul 29, 2024

0wwafa Jul 29, 2024

0wwafa commented Jul 27, 2024 •

edited

Loading

0wwafa commented Jul 27, 2024

bartowski1182 commented Jul 28, 2024

0wwafa commented Jul 28, 2024

JohannesGaessler commented Jul 28, 2024 •

edited

Loading

bartowski1182 commented Jul 28, 2024

JohannesGaessler commented Jul 28, 2024

0wwafa commented Jul 28, 2024

nisten commented Aug 1, 2024

0wwafa commented Aug 1, 2024 •

edited

Loading

bartowski1182 commented Aug 1, 2024

0wwafa commented Aug 3, 2024

Add option to keep output and embed tensors at f16 #8715

Add option to keep output and embed tensors at f16 #8715

Conversation

0wwafa commented Jul 26, 2024

0wwafa commented Jul 27, 2024

JohannesGaessler commented Jul 27, 2024

compilade left a comment • edited Loading

Choose a reason for hiding this comment

compilade Jul 27, 2024

Choose a reason for hiding this comment

0wwafa Jul 27, 2024

Choose a reason for hiding this comment

compilade Jul 29, 2024

Choose a reason for hiding this comment

0wwafa Jul 29, 2024

Choose a reason for hiding this comment

0wwafa Jul 29, 2024

Choose a reason for hiding this comment

0wwafa commented Jul 27, 2024 • edited Loading

0wwafa commented Jul 27, 2024

bartowski1182 commented Jul 28, 2024

0wwafa commented Jul 28, 2024

JohannesGaessler commented Jul 28, 2024 • edited Loading

bartowski1182 commented Jul 28, 2024

JohannesGaessler commented Jul 28, 2024

0wwafa commented Jul 28, 2024

nisten commented Aug 1, 2024

0wwafa commented Aug 1, 2024 • edited Loading

bartowski1182 commented Aug 1, 2024

0wwafa commented Aug 3, 2024

compilade left a comment •

edited

Loading

0wwafa commented Jul 27, 2024 •

edited

Loading

JohannesGaessler commented Jul 28, 2024 •

edited

Loading

0wwafa commented Aug 1, 2024 •

edited

Loading