-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Vulkan: Support fp32 accumulator in quantized matmul to fix GLM4-32B incoherence #13607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I can try to repro this on Monday. I wonder if differences in split_k/stream_k might explain the difference vs other backends? If the partial results are all in range then the resolve can accumulate them at f32. |
@JohannesGaessler @ggerganov Did you see any similar problems with CUDA/Metal? GLM4 seems to produce values outside of the range of 16-bit floats, so 32-bit accumulators are needed, even in tensors that have not yet been set to that precision mode. That they have not yet been set to F32 precision suggests that either the issues weren't noticed (and are covered up by some feature in your backends, like clamping?), or that they don't occur in CUDA/Metal. |
In CUDA the only place where FP16 accumulators are potentially used is cuBLAS GEMM, but only if the precision is |
Isn't CUDA using int8 MMA for most of these, then converting to fp32 for the accumulators?
|
This change just allows f32 accumulation if requested, it doesn't enforce it. |
I was concerned that we'd be using the f32 path much more often. But I looked at a few models and it seems like a small number of layers use f32, the ones that do are relatively small, and they go away with FA enabled. So I'm less concerned about performance now. So far I haven't been able to repro the remaining issues with fp16 accumulators. Which exact model are you using and what command line? |
Integer accumulators are used within a block of quantized data, then the floating point scales are applied to convert the integers to FP32, then those FP32 partial results are added to the FP32 accumulators of the output tiles. |
I've been using a q4_0 quant of GLM4-32B. I didn't set any special parameters apart from offloading fully. I tested with and without flash attention and the output was always |
With your change applied, I haven't been able to repro any corruption using the Q4_0 or Q4_K model. I've tried scalar/coopmat1/coopmat2 paths. My command line is: |
The prompt is too short. Not sure why exactly, but it only triggers with long prompts. |
For me it starts somewhere between 300 and 380 tokens. |
I was able to repro with a longer prompt. IMO the right thing to do about it is to mark the mul_mat as f32 for this model. It's the mul_mat at the top of llm_graph_context::build_lora_mm. I think Metal always uses fp32 accumulators, and cuda uses the int8 path with fp32 accumulators, so I think it's expected that this is only affecting Vulkan? |
Generally setting the first mul_mat in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change LGTM. We can try to change the precision for GLM4 in a separate change.
I think the correct way to think about is that the CPU implementation is the reference and it uses F32 accumulators for all operations. So the default for the backends would be to also use F32 accumulators. We should probably add an option in |
That is not a good idea for performance reasons. CPUs prefer FP32 calculations, GPUs do not. As Jeff Bolz said:
Isn't that how the current precision selection was created? |
If I remember correctly, yes it was. But I guess the primary concern should be to have correct results. Performance considerations come after that.
Indeed, even the Metal backend can be noticeably faster with F16 accumulators (see #10220). But this performance comes with some quality loss as can be seen by the PPL figures there.
Thinking some more about this, if an LLM was trained with BF16 range then the best thing to do technically is to run all computations with BF16. So maybe the correct logic is:
Btw, I'm just thinking out loud here - not sure if this makes complete sense, so feel free to object. |
I wonder if it's possible to know the dynamic range of the values for each model beforehand. Model might be trained with bf16, but if it turns out everything fits within -100.0 to 100.0 then it'll look kinda silly to enforce bf16. Practical example: TAESD weight values all fit within the range of -5.0 to 5.0. |
fp16 seems to be working pretty well - other than attention, GLM4 is the only model that needs to force fp32 accumulators? IMO it's unjustified to alter the defaults due to one model. The slowdown for using fp32 accumulators on Geforce is 2x (not even counting increase in register usage), much larger than you saw in Metal. If we had to default to fp32, we'd probably need to switch to int8 MMA like CUDA and throw away all existing investment/tuning in the current shaders. And other backends would all be faced with similar questions. |
I haven't seen effects of small drops in perplexity, so I think performance is more important. As long as we properly mark tensors that require F32 accumulation, I think the current system is fine. |
Currently we only support fp16 accumulators in quantized mul mat and this leads to numerical issues with GLM4-32B, some of which have been addressed with the model precision parameter, which was getting ignored by the Vulkan backend.
However, this is draft because it solves only a part of the problem. It seems there are still tensors that fail with fp16 accumulators and have not been set to GGML_PREC_F32 in llama.cpp. I'm not sure why this only affects the Vulkan backend. Forcing all tensors to run with fp32 accumulation resolves the incoherence.
I don't have a good way of finding these all of these tensors. The problem seems to be
infinity
values in the result. Using the internal results checker of the Vulkan backend gives me the first of the problematic tensors:blk.1.attn_output.weight
.The trouble with GLM4 seems to be unusually large values. I'm opening this in the hopes that someone can help me figure out the rest of the problem, especially why this doesn't affect CUDA or Metal.