-
Notifications
You must be signed in to change notification settings - Fork 11.6k
Feature matrix
Eve edited this page May 4, 2025
·
11 revisions
CPU (AVX/AVX2) | CPU (ARM NEON) | Metal | CUDA | ROCm | SYCL | Vulkan | Kompute | |
---|---|---|---|---|---|---|---|---|
K-quants | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ 🐢⁵ | 🚫 |
I-quants | ✅ 🐢⁴ | ✅ 🐢⁴ | ✅ 🐢⁴ | ✅ | ✅ | Partial¹ | ✅ 🐢⁴ | 🚫 |
Parallel Multi-GPU⁶ | N/A | N/A | N/A | ✅ | ✅ | Sequential only | Sequential only | ❓ |
K cache quants | ✅ | ✅ | ✅ | ✅ | ✅ | ❓ | ✅ | 🚫 |
MoE architecture | ✅ | ✅ | ✅ | ✅ | ✅ | ❓ | ✅ | 🚫 |
Flash Attention | ✅ | ✅ | ✅ | ✅ | ✅ | ❓ | Partial⁷ | 🚫 |
- ✅: feature works
- 🚫: feature does not work
- ❓: unknown, please contribute if you can test it yourself
- 🐢: feature is slow
- ¹: IQ3_S and IQ1_S, see #5886
- ²: Only with
-ngl 0
- ³: Inference is 50% slower
- ⁴: Slower than K-quants of comparable size
- ⁵: Generally the CUDA or ROCM backends are faster, though there are cases where Vulkan has faster text generation. See #10879 for benchmarks.
- ⁶: By default, all GPU backends can utilize multiple devices by running them sequentially. The CUDA code (which is also used for ROCm via HIP) also has code for running GPUs in parallel via
--split-mode row
. However, this is optimized relatively poorly and is only faster if the interconnect speed is fast vs. the speed of a single GPU. - ⁷: This is only implemented for coopmat2. Otherwise the Flash Attention ops will run slowly on CPU instead.
Useful information for users that doesn't fit into Readme.
- Home
- Feature Matrix
- GGML Tips & Tricks
- Chat Templating
- Metadata Override
- HuggingFace Model Card Metadata Interoperability Consideration
These are information useful for Maintainers and Developers which does not fit into code comments
Click on a badge to jump to workflow. This is here as a useful general view of all the actions so that we may notice quicker if main branch automation is broken and where.