[Model] Refactor Mamba2 SSD to improve chunked prefill performance #16942

cyang49 · 2025-04-21T21:05:46Z

We found that when chunked prefill is enabled, the performance is bad for benchmark_serving.py with ShareGPTv3. After some analysis, we identified that chunk_scan_fwd_kernel latency increases linearly with the number of "chunks", and while this can process prefill chunks efficiently, each decode request in the mixed batch will give the kernel one full chunk of work to process, despite that the decode request has only a single token.

In this PR, we modify the mamba2 ssd control flow assuming vLLM v0, where the mixed input batch has prefill chunks that come before decode requests. When processing the input, we split the input tensors at the prefill-decode boundary, and invoke SSD processing functions to apply to them separately. In this way, the prefill kernels don't deal with decode requests and can run more efficiently.

For V1 Mamba2 SSD will likely require reordering of the batch for the logic to work and will need some rewriting.

Known issue:

This PR doesn't split prefill and decode for causal_conv1d, and from profile results, conv1d is still a huge bottleneck for benchmark_serving with chunked prefill ON. I built another draft PR [Model] Mamba2 causal conv1d Refactor to Split Prefill and Decode Requests for Corresponding Kernels #17146 that builds on top of this by splitting conv1d inputs, which show very promising performance improvement. We will address conv1d soon with a follow up PR.

github-actions · 2025-04-21T21:05:57Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

cyang49 · 2025-04-22T02:04:26Z

Performance results

TL;DR: this PR gives 1.53x throughput improvement when chunked prefill is ON.

# Server (H100 80GB HBM3) chunked prefill on
vllm serve--model ibm-ai-platform/Bamba-9B --port 9999 
# Server with chunked prefill off
vllm serve--model ibm-ai-platform/Bamba-9B --port 9999 --enable-chunked-prefill=False --max-model-len 4096
# Client
python benchmarks/benchmark_serving.py --model ibm-ai-platform/Bamba-9B \
--dataset-name sharegpt --dataset-path ShareGPT_V3/ShareGPT_V3_unfiltered_cleaned_split.json \
--ignore-eos --port 9999

Main (d9ac9e3) with chunked prefill ON

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  187.13    
Total input tokens:                      215201    
Total generated tokens:                  198343    
Request throughput (req/s):              5.34      
Output token throughput (tok/s):         1059.92   
Total Token throughput (tok/s):          2209.93   
---------------Time to First Token----------------
Mean TTFT (ms):                          70934.30  
Median TTFT (ms):                        65586.01  
P99 TTFT (ms):                           172558.46 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          256.07    
Median TPOT (ms):                        272.59    
P99 TPOT (ms):                           421.32    
---------------Inter-token Latency----------------
Mean ITL (ms):                           221.10    
Median ITL (ms):                         331.62    
P99 ITL (ms):                            431.92    
==================================================

PR with chunked prefill ON

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  122.13
Total input tokens:                      215201
Total generated tokens:                  198343
Request throughput (req/s):              8.19
Output token throughput (tok/s):         1624.08
Total Token throughput (tok/s):          3386.19
---------------Time to First Token----------------
Mean TTFT (ms):                          49267.57
Median TTFT (ms):                        45663.25
P99 TTFT (ms):                           108850.44
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          157.74
Median TPOT (ms):                        156.98
P99 TPOT (ms):                           321.17
---------------Inter-token Latency----------------
Mean ITL (ms):                           131.73
Median ITL (ms):                         166.11
P99 ITL (ms):                            261.08
==================================================

PR with chunked prefill OFF

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  55.00     
Total input tokens:                      215201    
Total generated tokens:                  198343    
Request throughput (req/s):              18.18     
Output token throughput (tok/s):         3606.44   
Total Token throughput (tok/s):          7519.40   
---------------Time to First Token----------------
Mean TTFT (ms):                          17530.92  
Median TTFT (ms):                        16364.44  
P99 TTFT (ms):                           41679.02  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          71.18     
Median TPOT (ms):                        65.49     
P99 TPOT (ms):                           261.05    
---------------Inter-token Latency----------------
Mean ITL (ms):                           58.58     
Median ITL (ms):                         63.76     
P99 ITL (ms):                            126.87    
==================================================

cyang49 · 2025-04-22T11:54:50Z

Output quality

Bamba-9B

lm_eval --model vllm     --model_args pretrained=ibm-ai-platform/Bamba-9B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9 --batch_size auto --trust_remote_code  --cache_requests true --tasks gsm8k

Main (`d9ac9e3`)

vllm (pretrained=ibm-ai-platform/Bamba-9B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.2487|±  |0.0119|
|     |       |strict-match    |     5|exact_match|↑  |0.3563|±  |0.0132|

This PR

It makes the results slightly better..?

vllm (pretrained=ibm-ai-platform/Bamba-9B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.2532|±  |0.0120|
|     |       |strict-match    |     5|exact_match|↑  |0.3586|±  |0.0132|

Zamba2-2.7B

lm_eval --model vllm     --model_args pretrained=Zyphra/Zamba2-2.7B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9 --ba
tch_size auto --trust_remote_code  --cache_requests true --tasks gsm8k

Main

vllm (pretrained=Zyphra/Zamba2-2.7B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5292|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.5436|±  |0.0137|

This PR

vllm (pretrained=Zyphra/Zamba2-2.7B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5299|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.5436|±  |0.0137|

Mamba-Codestral-7B

lm_eval --model vllm     --model_args pretrained=mistralai/Mamba-Codestral-7B-v0.1,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9 --batch_size auto --trust_remote_code  --cache_requests true --tasks gsm8k

Main

vllm (pretrained=mistralai/Mamba-Codestral-7B-v0.1,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4761|±  |0.0138|
|     |       |strict-match    |     5|exact_match|↑  |0.4632|±  |0.0137|

This PR

vllm (pretrained=mistralai/Mamba-Codestral-7B-v0.1,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4761|±  |0.0138|
|     |       |strict-match    |     5|exact_match|↑  |0.4632|±  |0.0137|

cyang49 · 2025-04-24T18:23:08Z

Unit test results

Passed after a fix 182d4ad

pytest -x tests/kernels/mamba

======================================================= test session starts ========================================================
platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0
rootdir: /net/storage149/mnt/md0/ccyang/github.com/vllm
configfile: pyproject.toml
plugins: anyio-4.8.0
collected 1570 items                                                                                                               

tests/kernels/mamba/test_causal_conv1d.py .................................................................................. [  5%]
............................................................................................................................ [ 13%]
............................................................................................................................ [ 21%]
............................................................................................................................ [ 28%]
............................................................................................................................ [ 36%]
............................................................................................................................ [ 44%]
..........................................................                                                                   [ 48%]
tests/kernels/mamba/test_mamba_mixer2.py ....                                                                                [ 48%]
tests/kernels/mamba/test_mamba_ssm.py ...................................................................................... [ 54%]
............................................................................................s.......s....................... [ 62%]
............................................................................................................................ [ 69%]
............................................................................................................................ [ 77%]
......                                                                                                                       [ 78%]
tests/kernels/mamba/test_mamba_ssm_ssd.py .................................................................................. [ 83%]
............................................................................................................................ [ 91%]
............................................................................................................................ [ 99%]
............                                                                                                                 [100%]

===================================== 1568 passed, 2 skipped, 10 warnings in 762.75s (0:12:42) =====================================

prannaykaul · 2025-05-04T07:05:05Z

Through some small experiments I am aware that when chunked prefill is ON with Mamba2 models, the same input repeated across a batch can lead to varying generations under greedy decoding. However, when chunked prefill is OFF, the generations are consistent. Does this PR (or some other effort) plan to address this?

Thanks!

cyang49 · 2025-05-05T11:04:59Z

Through some small experiments I am aware that when chunked prefill is ON with Mamba2 models, the same input repeated across a batch can lead to varying generations under greedy decoding. However, when chunked prefill is OFF, the generations are consistent. Does this PR (or some other effort) plan to address this?

Thanks!

@prannaykaul No, this PR does not attempt to fix what you described. However, the rerouting of the prefill and decode requests in the mamba2 layer may have an effect on that.

Did you do the experiment with changes of this PR?
If you did, could you do the same experiment with [Model] Mamba2 causal conv1d Refactor to Split Prefill and Decode Requests for Corresponding Kernels #17146 ? As in that PR I also separate prefill and decode requests for conv1d layer

prannaykaul · 2025-05-05T17:17:41Z

if __name__ == '__main__':
    import torch
    import sys
    from vllm import LLM, SamplingParams

    chunked_prefill_toggle = bool(int(sys.argv[1]))

    model_name = "mistralai/Mamba-Codestral-7B-v0.1"
    llm = LLM(
        model=model_name,
        tensor_parallel_size=2,
        enable_chunked_prefill=chunked_prefill_toggle,
        max_model_len=4096,
        enforce_eager=True,
        seed=42,
    )

    # Define prompts to test
    prompts = [
        "A special magic number is hidden within the following text. Make sure to memorize it. I will quiz you about the number afterwards.\nOne of the special magic numbers for witty-transcript is: 8374199.\nOne of the special magic numbers for crabby-kielbasa is: 8864697.\nOne of the special magic numbers for dusty-act is: 2062365.\nOne of the special magic numbers for orange-carpeting is: 4608251.\nOne of the special magic numbers for equable-chives is: 8430600.\nOne of the special magic numbers for wistful-moai is: 4645184.\nOne of the special magic numbers for short-shelter is: 4123052.\nOne of the special magic numbers for adorable-coincidence is: 6466396.\nOne of the special magic numbers for easy-range is: 3894805.\nOne of the special magic numbers for detailed-square is: 4591976.\nOne of the special magic numbers for modern-crop is: 3872134.\nOne of the special magic numbers for chivalrous-osmosis is: 8165549.\nOne of the special magic numbers for imperfect-cliff is: 2264379.\nOne of the special magic numbers for terrible-napkin is: 6086827.\nOne of the special magic numbers for foamy-bathroom is: 3413787.\nOne of the special magic numbers for foamy-edger is: 3794477.\nOne of the special magic numbers for immense-sycamore is: 7101744.\nOne of the special magic numbers for wasteful-fridge is: 5102384.\nOne of the special magic numbers for smoggy-age is: 4408903.\nOne of the special magic numbers for idiotic-solitaire is: 6168412.\nOne of the special magic numbers for gorgeous-seaside is: 8010785.\nOne of the special magic numbers for crazy-accusation is: 9205790.\nOne of the special magic numbers for worthless-vessel is: 8768311.\nOne of the special magic numbers for pretty-menorah is: 4671962.\nOne of the special magic numbers for dirty-radio is: 7594818.\nOne of the special magic numbers for harsh-earring is: 5146242.\nOne of the special magic numbers for acceptable-collard is: 6580507.\nOne of the special magic numbers for gleaming-precipitation is: 5259672.\nOne of the special magic numbers for smoggy-alcohol is: 9357022.\nOne of the special magic numbers for subdued-microphone is: 2954191.\nOne of the special magic numbers for earthy-belly is: 8496508.\nOne of the special magic numbers for stereotyped-choice is: 8361775.\nOne of the special magic numbers for abandoned-excursion is: 4133895.\nOne of the special magic numbers for aspiring-particular is: 4047623.\nOne of the special magic numbers for skinny-thongs is: 3924781.\nOne of the special magic numbers for numerous-blast is: 2172433.\nOne of the special magic numbers for disagreeable-fringe is: 1930087.\nOne of the special magic numbers for incompetent-employ is: 5135405.\nOne of the special magic numbers for unadvised-cocoa is: 8860310.\nOne of the special magic numbers for repulsive-infancy is: 9350035.\nOne of the special magic numbers for woebegone-liberty is: 2887656.\nOne of the special magic numbers for fallacious-speakerphone is: 9226330.\nOne of the special magic numbers for wet-brassiere is: 6484082.\nOne of the special magic numbers for dapper-trove is: 9706450.\nOne of the special magic numbers for freezing-hearthside is: 3192250.\nOne of the special magic numbers for defiant-junk is: 1471076.\nOne of the special magic numbers for unsightly-pouch is: 2653502.\nOne of the special magic numbers for damaged-maelstrom is: 7112061.\nOne of the special magic numbers for maniacal-tract is: 1910845.\nOne of the special magic numbers for scandalous-goodnight is: 4161506.\nOne of the special magic numbers for drunk-commitment is: 1631781.\nOne of the special magic numbers for ancient-wall is: 5451189.\nOne of the special magic numbers for secretive-boogeyman is: 9203143.\nOne of the special magic numbers for stingy-speakerphone is: 8811505.\nOne of the special magic numbers for ratty-break is: 9446669.\nOne of the special magic numbers for quarrelsome-grace is: 8354627.\nOne of the special magic numbers for jumbled-singing is: 4533042.\nOne of the special magic numbers for volatile-baboon is: 7244360.\nOne of the special magic numbers for unadvised-velocity is: 9629101.\nOne of the special magic numbers for labored-rugby is: 4098017.\nOne of the special magic numbers for foamy-raiment is: 9975635.\nOne of the special magic numbers for makeshift-issue is: 2021413.\nOne of the special magic numbers for aloof-wood is: 4388917.\nOne of the special magic numbers for absurd-home is: 5666250.\nOne of the special magic numbers for historical-illegal is: 4964243.\nOne of the special magic numbers for macabre-icon is: 2375897.\nOne of the special magic numbers for slimy-timpani is: 6381025.\nOne of the special magic numbers for coherent-fate is: 8721122.\nOne of the special magic numbers for better-bob is: 6819805.\nOne of the special magic numbers for discreet-lasagna is: 4524077.\nOne of the special magic numbers for wry-survey is: 2569831.\nOne of the special magic numbers for acoustic-cutover is: 3503418.\nOne of the special magic numbers for thankful-response is: 3809849.\nOne of the special magic numbers for yielding-buck is: 6842624.\nOne of the special magic numbers for ugly-soil is: 1403155.\nOne of the special magic numbers for capable-equipment is: 6079984.\nOne of the special magic numbers for equable-shoe is: 9018984.\nOne of the special magic numbers for grumpy-peasant is: 9579549.\nOne of the special magic numbers for enchanting-gran is: 1951077.\nOne of the special magic numbers for subsequent-paperwork is: 4293914.\nOne of the special magic numbers for abhorrent-bead is: 7972349.\nOne of the special magic numbers for parched-burning is: 2386600.\nOne of the special magic numbers for faded-brown is: 9386347.\nOne of the special magic numbers for capable-suspect is: 2932234.\nOne of the special magic numbers for decorous-intervenor is: 7674303.\nOne of the special magic numbers for anxious-step is: 3532214.\nOne of the special magic numbers for youthful-mixture is: 3546569.\nOne of the special magic numbers for condemned-step is: 4477004.\nOne of the special magic numbers for didactic-mortgage is: 2981040.\nOne of the special magic numbers for vast-whistle is: 8723303.\nOne of the special magic numbers for miscreant-waist is: 8827457.\nOne of the special magic numbers for truculent-bonsai is: 3342687.\nOne of the special magic numbers for assorted-cation is: 2932163.\nOne of the special magic numbers for stimulating-tonight is: 2497141.\nOne of the special magic numbers for confused-epoch is: 2890953.\nOne of the special magic numbers for lowly-tune is: 7756122.\nOne of the special magic numbers for clean-commodity is: 3172691.\nOne of the special magic numbers for earthy-gesture is: 1702145.\nOne of the special magic numbers for crabby-loft is: 1535484.\nOne of the special magic numbers for obscene-turtle is: 3281479.\nOne of the special magic numbers for workable-retention is: 6753560.\nOne of the special magic numbers for typical-wraparound is: 9877751.\nOne of the special magic numbers for nosy-worshiper is: 5172054.\nOne of the special magic numbers for envious-proportion is: 4591821.\nOne of the special magic numbers for alike-salon is: 5925995.\nOne of the special magic numbers for sleepy-uniformity is: 4228684.\nOne of the special magic numbers for nifty-feng is: 7380100.\nOne of the special magic numbers for instinctive-weather is: 4496403.\nOne of the special magic numbers for romantic-sampan is: 1224966.\nOne of the special magic numbers for truculent-driver is: 4000398.\nOne of the special magic numbers for onerous-switch is: 6057746.\nOne of the special magic numbers for receptive-anterior is: 2738615.\nOne of the special magic numbers for defiant-warrant is: 5731929.\nOne of the special magic numbers for dusty-rehospitalisation is: 2685416.\nOne of the special magic numbers for cagey-gator is: 5772286.\nOne of the special magic numbers for acoustic-bead is: 7623371.\nOne of the special magic numbers for tall-antigen is: 9804957.\nOne of the special magic numbers for dazzling-sneaker is: 6656532.\nOne of the special magic numbers for abnormal-silo is: 6480097.\nOne of the special magic numbers for damaged-hive is: 3518731.\nOne of the special magic numbers for squealing-chemical is: 5810245.\nOne of the special magic numbers for longing-document is: 3161029.\nOne of the special magic numbers for absurd-chipmunk is: 4248973.\nOne of the special magic numbers for warm-blinker is: 7152661.\nOne of the special magic numbers for jumpy-painter is: 7383657.\nOne of the special magic numbers for berserk-offence is: 1020737.\nOne of the special magic numbers for famous-elver is: 6792145.\nOne of the special magic numbers for cultured-jewelry is: 9330861.\nOne of the special magic numbers for absurd-camp is: 2630209.\nOne of the special magic numbers for flagrant-toll is: 8612398.\nOne of the special magic numbers for cooperative-plover is: 3380334.\nOne of the special magic numbers for grubby-lining is: 4200361.\nOne of the special magic numbers for nasty-volunteer is: 7023290.\nOne of the special magic numbers for scary-ligand is: 6480092.\nOne of the special magic numbers for royal-billing is: 7152047.\nOne of the special magic numbers for lyrical-step-father is: 6149685.\nOne of the special magic numbers for curved-bulk is: 4556981.\nOne of the special magic numbers for addicted-hint is: 5351292.\nOne of the special magic numbers for blushing-percentage is: 3620107.\nOne of the special magic numbers for adamant-protection is: 3363000.\nOne of the special magic numbers for placid-people is: 1902872.\nOne of the special magic numbers for narrow-cuff-link is: 3065490.\nOne of the special magic numbers for late-distributor is: 3717134.\nOne of the special magic numbers for phobic-legacy is: 4872295.\nOne of the special magic numbers for selfish-conifer is: 9302324.\nOne of the special magic numbers for guarded-missile is: 2011816.\nOne of the special magic numbers for distinct-pasta is: 2032110.\nOne of the special magic numbers for oafish-running is: 2432331.\nOne of the special magic numbers for nifty-bakeware is: 8150430.\nOne of the special magic numbers for spiritual-fennel is: 1375480.\nOne of the special magic numbers for modern-range is: 4828581.\nWhat is the special magic number for anxious-step mentioned in the provided text?\nThe special magic number for anxious-step mentioned in the provided text is",
    ]

    prompts *= 8

    print(f"BEGIN TESTING. Chunked prefill is {chunked_prefill_toggle}")

    # Configure greedy decoding by setting temperature to 0
    sampling_params = SamplingParams(temperature=0, max_tokens=100)

    # Generate completions
    outputs = llm.generate(prompts, sampling_params)

    # Print results
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        partial_prompt = '\n'.join(prompt.split('\n')[-2:])
        print(f"Prompt: {partial_prompt}")
        print(f"Generated text: {generated_text}")
        print("-" * 50)

Using the above script which should be a self-contained qualitative eval of this behaviour.

Installing each branch [a928424, pr_mamba2_chunk_prefill_refactor, pr_mamba2_conv1d_refactor] with the first one containing none of your edits.

I find the generations to be inconsistent in all 3 branches when chunked_prefill is enabled:

python script.py 1

e.g. on pr_mamba_conv1d_refactor:

BEGIN TESTING. Chunked prefill is True
...
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  3532214.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  3532214.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------

whereas when chunked_prefill is disabled, the generations are consistent:

python script.py 0

e.g. on pr_mamba_conv1d_refactor

BEGIN TESTING. Chunked prefill is False
...
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------

Other models such as Bamba also demonstrate the same behavior but tend to require longer greedy generations to see the difference. In pure Mamba2 models (like the Codestral model used), the difference with chunked_prefill enabled tend to be immediate.

cyang49 · 2025-05-05T17:37:52Z

Thanks for the details @prannaykaul
Since there are so many layers involved, to get to the bottom of it could be quite difficult. For the performance optimization PRs I submitted I won't attempt to fix it. And as you've shown, the problem exists in the main branch and not introduced by my changes. I'll bring this to some of the colleagues that work on mamba2 implementation, and see if this gets picked up. Also maybe you can open an issue if you haven't?
cc @tlrmchlsmth

tlrmchlsmth

Left a couple of small comments but LGTM

tlrmchlsmth · 2025-05-05T18:08:24Z

vllm/model_executor/layers/mamba/mamba_mixer2.py

+        num_prefills = attn_metadata.num_prefills  # #requests
+        num_decodes = attn_metadata.num_decode_tokens  # #tokens==#requests
+        num_prefill_tokens = attn_metadata.num_prefill_tokens  # #tokens


Could you explain what is meant by the comments at the ends of these lines?

Comments mean the corresponding variable values are counting "number of requests" or "number of tokens"

do you want me to change the comments?

Yeah, it would help clarity. Thanks!

tlrmchlsmth · 2025-05-05T18:10:01Z

vllm/model_executor/layers/mamba/mamba_mixer2.py

            n_groups = self.n_groups // self.tp_size
-            A = self.A[:, None, ...][:, :, None].expand(
+            A_d = self.A[:, None, ...][:, :, None].expand(


What does the suffix _d mean in this code?

oh is it decode?

yes, _p means prefill and _d means decode

Signed-off-by: Chih-Chieh-Yang <[email protected]>

mergify · 2025-05-07T01:06:53Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @cyang49.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tlrmchlsmth

@cyang49 could you merge in latest main? #16942 landed so this one has some merge conflicts.

cyang49 · 2025-05-07T01:36:49Z

@cyang49 could you merge in latest main? #16942 landed so this one has some merge conflicts.

@tlrmchlsmth I think you meant #17146
In fact #17146 branched from this one with other improvements. As that one got merged first, this one should be closed.
I'll run a few tests on main to make sure things are working properly

Closing

cyang49 force-pushed the pr_mamba2_chunk_prefill_refactor branch from a7a4561 to 94639f0 Compare April 22, 2025 02:15

cyang49 marked this pull request as ready for review April 22, 2025 11:47

cyang49 mentioned this pull request Apr 22, 2025

Use repeat_interleave in Mamba2 seq_idx computation to reduce cpu-gpu syncs #16743

Closed

cyang49 force-pushed the pr_mamba2_chunk_prefill_refactor branch from 6e38447 to e6f78da Compare April 24, 2025 15:52

cyang49 requested review from tlrmchlsmth and WoosukKwon as code owners April 24, 2025 18:22

cyang49 force-pushed the pr_mamba2_chunk_prefill_refactor branch from 182d4ad to 6035d6b Compare April 24, 2025 18:28

tlrmchlsmth mentioned this pull request Apr 24, 2025

[RFC]: Native support for Mamba, SSM, and hybrid transformer models in vLLM V1 #17140

Open

9 tasks

cyang49 mentioned this pull request Apr 25, 2025

[Model] Mamba2 causal conv1d Refactor to Split Prefill and Decode Requests for Corresponding Kernels #17146

Merged

cyang49 force-pushed the pr_mamba2_chunk_prefill_refactor branch from 6035d6b to 5b0acfe Compare May 3, 2025 14:29

tlrmchlsmth approved these changes May 5, 2025

View reviewed changes

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label May 5, 2025

cyang49 force-pushed the pr_mamba2_chunk_prefill_refactor branch from 0b19ed4 to bdf4e64 Compare May 6, 2025 00:38

cyang49 added 9 commits May 6, 2025 08:12

remove unnecessary assert

095fea3

Signed-off-by: Chih-Chieh-Yang <[email protected]>

draft refactoring

d8879c5

Signed-off-by: Chih-Chieh-Yang <[email protected]>

fix bug in C splitting

8a3f4bf

Signed-off-by: Chih-Chieh-Yang <[email protected]>

clean up and renaming for clarity

d9c2755

Signed-off-by: Chih-Chieh-Yang <[email protected]>

minor improvement

6a363c8

Signed-off-by: Chih-Chieh-Yang <[email protected]>

clean up and remove some redundant info in mamba2metadata

70875ae

Signed-off-by: Chih-Chieh-Yang <[email protected]>

narrow down prep initial state

0327512

Signed-off-by: Chih-Chieh-Yang <[email protected]>

use query_start_loc to compute chunk_indices and chunk_offsets

4d37e59

Signed-off-by: Chih-Chieh-Yang <[email protected]>

refer to the correct B and C tensor in decode path

2ea3051

Signed-off-by: Chih-Chieh-Yang <[email protected]>

cyang49 added 4 commits May 6, 2025 08:12

improve comments

087c888

Signed-off-by: Chih-Chieh-Yang <[email protected]>

helper function interface change

aa3b8fa

Signed-off-by: Chih-Chieh-Yang <[email protected]>

improve comment

4bdff84

Signed-off-by: Chih-Chieh-Yang <[email protected]>

mamba2 metadata func arg change

20452d3

Signed-off-by: Chih-Chieh-Yang <[email protected]>

cyang49 force-pushed the pr_mamba2_chunk_prefill_refactor branch from bdf4e64 to 20452d3 Compare May 6, 2025 12:15

mergify bot added the needs-rebase label May 7, 2025

tlrmchlsmth reviewed May 7, 2025

View reviewed changes

cyang49 closed this May 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Refactor Mamba2 SSD to improve chunked prefill performance #16942

[Model] Refactor Mamba2 SSD to improve chunked prefill performance #16942

cyang49 commented Apr 21, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Apr 21, 2025

cyang49 commented Apr 22, 2025 •

edited

Loading

cyang49 commented Apr 22, 2025 •

edited

Loading

cyang49 commented Apr 24, 2025 •

edited

Loading

prannaykaul commented May 4, 2025 •

edited

Loading

cyang49 commented May 5, 2025

prannaykaul commented May 5, 2025

cyang49 commented May 5, 2025

tlrmchlsmth left a comment

tlrmchlsmth May 5, 2025

cyang49 May 5, 2025

cyang49 May 5, 2025

tlrmchlsmth May 5, 2025

cyang49 May 5, 2025

tlrmchlsmth May 5, 2025

tlrmchlsmth May 5, 2025

cyang49 May 5, 2025

mergify bot commented May 7, 2025

tlrmchlsmth left a comment

cyang49 commented May 7, 2025

[Model] Refactor Mamba2 SSD to improve chunked prefill performance #16942

[Model] Refactor Mamba2 SSD to improve chunked prefill performance #16942

Conversation

cyang49 commented Apr 21, 2025 • edited by github-actions bot Loading

github-actions bot commented Apr 21, 2025

cyang49 commented Apr 22, 2025 • edited Loading

Performance results

cyang49 commented Apr 22, 2025 • edited Loading

Output quality

Bamba-9B

Main (d9ac9e3)

This PR

Zamba2-2.7B

Main

This PR

Mamba-Codestral-7B

Main

This PR

cyang49 commented Apr 24, 2025 • edited Loading

Unit test results

prannaykaul commented May 4, 2025 • edited Loading

cyang49 commented May 5, 2025

prannaykaul commented May 5, 2025

cyang49 commented May 5, 2025

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented May 7, 2025

tlrmchlsmth left a comment

Choose a reason for hiding this comment

cyang49 commented May 7, 2025

cyang49 commented Apr 21, 2025 •

edited by github-actions bot

Loading

cyang49 commented Apr 22, 2025 •

edited

Loading

cyang49 commented Apr 22, 2025 •

edited

Loading

Main (`d9ac9e3`)

cyang49 commented Apr 24, 2025 •

edited

Loading

prannaykaul commented May 4, 2025 •

edited

Loading