You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This change introduces a llamafile_mixmul() API that allows tinyBLAS to
speed up "Mixture of Expert" models. On my Threadripper, Mixtral's 8x7b
F16 weights now process prompts 2x faster. I'm also seeing a 60 percent
improvement with Mixtral 8x22b Q4_0. The same applies to Q8_0, which is
also supported by tinyBLAS. MoE models spend the majority of their time
inside MUL_MAT_ID rather than MUL_MAT, which is why llamafile_sgemm was
not able to help them before. llamafile_mixmul works by decomposing the
mixmul operation into sgemm calls.
0 commit comments