You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Mamba, SSM, and hybrid transformer models are an important path forward towards models that scale linearly with sequence length. vLLM currently supports many models of this class (Jamba, Mamba, Codestral Mamba, Falcon Mamba, Bamba, Zamba2, MinimaxText01, Plamo2), and should continue to maintain excellent support for these models.
The Problem
SSM model generally are less-well supported than transformers in vLLM, and have several deficiencies.
This RFC proposes several improvements (some already in progress) to SSM models, and additionally will serve as an issue tracker.
The major issue is that SSM models not supported in vLLM V1, and should be supported before V0 is deprecated.
In addition:
Since the SSM state is not managed by the block manager, SSM models are incompatible with prefix caching, KV cache offloading, and prefill-decode disaggregation.
There are major performance issues with chunked prefill.
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
We're testing the two implemented Triton kernels for causal-conv-1d for decode-only and prefix-decode mode of input to extend the supports of Mamba-based models beyond CUDA GPUs. I am working with @cyang49 and @fabianlim to test this integration.
We have raised a PR on Triton-only backends at issue #18218
We have showed that Triton-only backends on Bamba achieved high performant. It is also one step closer to comply with vLLM v1 design where we don't treat prefill requests and decode requests separately.
Motivation.
Mamba, SSM, and hybrid transformer models are an important path forward towards models that scale linearly with sequence length. vLLM currently supports many models of this class (Jamba, Mamba, Codestral Mamba, Falcon Mamba, Bamba, Zamba2, MinimaxText01, Plamo2), and should continue to maintain excellent support for these models.
The Problem
SSM model generally are less-well supported than transformers in vLLM, and have several deficiencies.
This RFC proposes several improvements (some already in progress) to SSM models, and additionally will serve as an issue tracker.
The major issue is that SSM models not supported in vLLM V1, and should be supported before V0 is deprecated.
In addition:
Proposed Change.
Blockers for SSM and hybrid model support vLLM V1
Other improvements
Feedback Period.
No response
CC List.
@fabianlim @cyang49 @mzusman @yury-tokpanov
Any Other Things.
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: