[RFC]: Native support for Mamba, SSM, and hybrid transformer models in vLLM V1 #17140

tlrmchlsmth · 2025-04-24T21:46:06Z

Motivation.

Mamba, SSM, and hybrid transformer models are an important path forward towards models that scale linearly with sequence length. vLLM currently supports many models of this class (Jamba, Mamba, Codestral Mamba, Falcon Mamba, Bamba, Zamba2, MinimaxText01, Plamo2), and should continue to maintain excellent support for these models.

The Problem
SSM model generally are less-well supported than transformers in vLLM, and have several deficiencies.
This RFC proposes several improvements (some already in progress) to SSM models, and additionally will serve as an issue tracker.

The major issue is that SSM models not supported in vLLM V1, and should be supported before V0 is deprecated.
In addition:

SSM state management is a little hacky and is managed by the model definition.
Since the SSM state is not managed by the block manager, SSM models are incompatible with prefix caching, KV cache offloading, and prefill-decode disaggregation.
There are major performance issues with chunked prefill.

Proposed Change.

Blockers for SSM and hybrid model support vLLM V1

Hybrid Allocator: [RFC]: Hybrid Memory Allocator #11382 (initial work is targeted towards sliding-window attention)
Once the hybrid allocator is landed, extend it to support SSM and hybrid models
torch.compile support (needed for piecewise CUDA graphs)

Other improvements

Extend Mamba support beyond CUDA GPUs
Improve performance for chunked prefill
- [Model] Refactor Mamba2 SSD to improve chunked prefill performance #16942
- [Model] Mamba2 causal conv1d Refactor to Split Prefill and Decode Requests for Corresponding Kernels #17146
Support quantization + tensor parallel in Mamba2 [Bug]: Quantization In MambaMixer2 Not Supported when Tensor Parallel is enabled #14618

Feedback Period.

No response

CC List.

@fabianlim @cyang49 @mzusman @yury-tokpanov

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

cyang49 · 2025-04-25T00:18:56Z

Added the draft PR #17146 for the causal_conv1d refactor mentioned by @tlrmchlsmth

thoangtrvn · 2025-04-25T00:24:54Z

We're testing the two implemented Triton kernels for causal-conv-1d for decode-only and prefix-decode mode of input to extend the supports of Mamba-based models beyond CUDA GPUs. I am working with @cyang49 and @fabianlim to test this integration.

thoangtrvn · 2025-05-19T14:49:54Z

We have raised a PR on Triton-only backends at issue #18218

We have showed that Triton-only backends on Bamba achieved high performant. It is also one step closer to comply with vLLM v1 design where we don't treat prefill requests and decode requests separately.

tlrmchlsmth added the RFC label Apr 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Native support for Mamba, SSM, and hybrid transformer models in vLLM V1 #17140

[RFC]: Native support for Mamba, SSM, and hybrid transformer models in vLLM V1 #17140

tlrmchlsmth commented Apr 24, 2025 •

edited

Loading

cyang49 commented Apr 25, 2025

thoangtrvn commented Apr 25, 2025 •

edited

Loading

thoangtrvn commented May 19, 2025

[RFC]: Native support for Mamba, SSM, and hybrid transformer models in vLLM V1 #17140

[RFC]: Native support for Mamba, SSM, and hybrid transformer models in vLLM V1 #17140

Comments

tlrmchlsmth commented Apr 24, 2025 • edited Loading

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

cyang49 commented Apr 25, 2025

thoangtrvn commented Apr 25, 2025 • edited Loading

thoangtrvn commented May 19, 2025

tlrmchlsmth commented Apr 24, 2025 •

edited

Loading

thoangtrvn commented Apr 25, 2025 •

edited

Loading