Skip to content

[Model] Refactor Mamba2 SSD to improve chunked prefill performance #16942

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

cyang49
Copy link
Contributor

@cyang49 cyang49 commented Apr 21, 2025

We found that when chunked prefill is enabled, the performance is bad for benchmark_serving.py with ShareGPTv3. After some analysis, we identified that chunk_scan_fwd_kernel latency increases linearly with the number of "chunks", and while this can process prefill chunks efficiently, each decode request in the mixed batch will give the kernel one full chunk of work to process, despite that the decode request has only a single token.

In this PR, we modify the mamba2 ssd control flow assuming vLLM v0, where the mixed input batch has prefill chunks that come before decode requests. When processing the input, we split the input tensors at the prefill-decode boundary, and invoke SSD processing functions to apply to them separately. In this way, the prefill kernels don't deal with decode requests and can run more efficiently.

For V1 Mamba2 SSD will likely require reordering of the batch for the logic to work and will need some rewriting.

Known issue:

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@cyang49
Copy link
Contributor Author

cyang49 commented Apr 22, 2025

Performance results

TL;DR: this PR gives 1.53x throughput improvement when chunked prefill is ON.

# Server (H100 80GB HBM3) chunked prefill on
vllm serve--model ibm-ai-platform/Bamba-9B --port 9999 
# Server with chunked prefill off
vllm serve--model ibm-ai-platform/Bamba-9B --port 9999 --enable-chunked-prefill=False --max-model-len 4096
# Client
python benchmarks/benchmark_serving.py --model ibm-ai-platform/Bamba-9B \
--dataset-name sharegpt --dataset-path ShareGPT_V3/ShareGPT_V3_unfiltered_cleaned_split.json \
--ignore-eos --port 9999

Main (d9ac9e3) with chunked prefill ON

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  187.13    
Total input tokens:                      215201    
Total generated tokens:                  198343    
Request throughput (req/s):              5.34      
Output token throughput (tok/s):         1059.92   
Total Token throughput (tok/s):          2209.93   
---------------Time to First Token----------------
Mean TTFT (ms):                          70934.30  
Median TTFT (ms):                        65586.01  
P99 TTFT (ms):                           172558.46 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          256.07    
Median TPOT (ms):                        272.59    
P99 TPOT (ms):                           421.32    
---------------Inter-token Latency----------------
Mean ITL (ms):                           221.10    
Median ITL (ms):                         331.62    
P99 ITL (ms):                            431.92    
==================================================

PR with chunked prefill ON

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  122.13
Total input tokens:                      215201
Total generated tokens:                  198343
Request throughput (req/s):              8.19
Output token throughput (tok/s):         1624.08
Total Token throughput (tok/s):          3386.19
---------------Time to First Token----------------
Mean TTFT (ms):                          49267.57
Median TTFT (ms):                        45663.25
P99 TTFT (ms):                           108850.44
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          157.74
Median TPOT (ms):                        156.98
P99 TPOT (ms):                           321.17
---------------Inter-token Latency----------------
Mean ITL (ms):                           131.73
Median ITL (ms):                         166.11
P99 ITL (ms):                            261.08
==================================================

PR with chunked prefill OFF

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  55.00     
Total input tokens:                      215201    
Total generated tokens:                  198343    
Request throughput (req/s):              18.18     
Output token throughput (tok/s):         3606.44   
Total Token throughput (tok/s):          7519.40   
---------------Time to First Token----------------
Mean TTFT (ms):                          17530.92  
Median TTFT (ms):                        16364.44  
P99 TTFT (ms):                           41679.02  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          71.18     
Median TPOT (ms):                        65.49     
P99 TPOT (ms):                           261.05    
---------------Inter-token Latency----------------
Mean ITL (ms):                           58.58     
Median ITL (ms):                         63.76     
P99 ITL (ms):                            126.87    
==================================================

@cyang49 cyang49 force-pushed the pr_mamba2_chunk_prefill_refactor branch from a7a4561 to 94639f0 Compare April 22, 2025 02:15
@cyang49 cyang49 marked this pull request as ready for review April 22, 2025 11:47
@cyang49
Copy link
Contributor Author

cyang49 commented Apr 22, 2025

Output quality

Bamba-9B

lm_eval --model vllm     --model_args pretrained=ibm-ai-platform/Bamba-9B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9 --batch_size auto --trust_remote_code  --cache_requests true --tasks gsm8k

Main (d9ac9e3)

vllm (pretrained=ibm-ai-platform/Bamba-9B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.2487|±  |0.0119|
|     |       |strict-match    |     5|exact_match|↑  |0.3563|±  |0.0132|

This PR

It makes the results slightly better..?

vllm (pretrained=ibm-ai-platform/Bamba-9B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.2532|±  |0.0120|
|     |       |strict-match    |     5|exact_match|↑  |0.3586|±  |0.0132|

Zamba2-2.7B

lm_eval --model vllm     --model_args pretrained=Zyphra/Zamba2-2.7B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9 --ba
tch_size auto --trust_remote_code  --cache_requests true --tasks gsm8k 

Main

vllm (pretrained=Zyphra/Zamba2-2.7B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5292|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.5436|±  |0.0137|

This PR

vllm (pretrained=Zyphra/Zamba2-2.7B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5299|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.5436|±  |0.0137|

Mamba-Codestral-7B

lm_eval --model vllm     --model_args pretrained=mistralai/Mamba-Codestral-7B-v0.1,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9 --batch_size auto --trust_remote_code  --cache_requests true --tasks gsm8k

Main

vllm (pretrained=mistralai/Mamba-Codestral-7B-v0.1,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4761|±  |0.0138|
|     |       |strict-match    |     5|exact_match|↑  |0.4632|±  |0.0137|

This PR

vllm (pretrained=mistralai/Mamba-Codestral-7B-v0.1,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4761|±  |0.0138|
|     |       |strict-match    |     5|exact_match|↑  |0.4632|±  |0.0137|

@cyang49
Copy link
Contributor Author

cyang49 commented Apr 24, 2025

Unit test results

Passed after a fix 182d4ad

pytest -x tests/kernels/mamba
======================================================= test session starts ========================================================
platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0
rootdir: /net/storage149/mnt/md0/ccyang/github.com/vllm
configfile: pyproject.toml
plugins: anyio-4.8.0
collected 1570 items                                                                                                               

tests/kernels/mamba/test_causal_conv1d.py .................................................................................. [  5%]
............................................................................................................................ [ 13%]
............................................................................................................................ [ 21%]
............................................................................................................................ [ 28%]
............................................................................................................................ [ 36%]
............................................................................................................................ [ 44%]
..........................................................                                                                   [ 48%]
tests/kernels/mamba/test_mamba_mixer2.py ....                                                                                [ 48%]
tests/kernels/mamba/test_mamba_ssm.py ...................................................................................... [ 54%]
............................................................................................s.......s....................... [ 62%]
............................................................................................................................ [ 69%]
............................................................................................................................ [ 77%]
......                                                                                                                       [ 78%]
tests/kernels/mamba/test_mamba_ssm_ssd.py .................................................................................. [ 83%]
............................................................................................................................ [ 91%]
............................................................................................................................ [ 99%]
............                                                                                                                 [100%]

===================================== 1568 passed, 2 skipped, 10 warnings in 762.75s (0:12:42) =====================================

@prannaykaul
Copy link

prannaykaul commented May 4, 2025

Through some small experiments I am aware that when chunked prefill is ON with Mamba2 models, the same input repeated across a batch can lead to varying generations under greedy decoding. However, when chunked prefill is OFF, the generations are consistent. Does this PR (or some other effort) plan to address this?

Thanks!

@cyang49
Copy link
Contributor Author

cyang49 commented May 5, 2025

Through some small experiments I am aware that when chunked prefill is ON with Mamba2 models, the same input repeated across a batch can lead to varying generations under greedy decoding. However, when chunked prefill is OFF, the generations are consistent. Does this PR (or some other effort) plan to address this?

Thanks!

@prannaykaul No, this PR does not attempt to fix what you described. However, the rerouting of the prefill and decode requests in the mamba2 layer may have an effect on that.

@prannaykaul
Copy link

if __name__ == '__main__':
    import torch
    import sys
    from vllm import LLM, SamplingParams

    chunked_prefill_toggle = bool(int(sys.argv[1]))

    model_name = "mistralai/Mamba-Codestral-7B-v0.1"
    llm = LLM(
        model=model_name,
        tensor_parallel_size=2,
        enable_chunked_prefill=chunked_prefill_toggle,
        max_model_len=4096,
        enforce_eager=True,
        seed=42,
    )

    # Define prompts to test
    prompts = [
        "A special magic number is hidden within the following text. Make sure to memorize it. I will quiz you about the number afterwards.\nOne of the special magic numbers for witty-transcript is: 8374199.\nOne of the special magic numbers for crabby-kielbasa is: 8864697.\nOne of the special magic numbers for dusty-act is: 2062365.\nOne of the special magic numbers for orange-carpeting is: 4608251.\nOne of the special magic numbers for equable-chives is: 8430600.\nOne of the special magic numbers for wistful-moai is: 4645184.\nOne of the special magic numbers for short-shelter is: 4123052.\nOne of the special magic numbers for adorable-coincidence is: 6466396.\nOne of the special magic numbers for easy-range is: 3894805.\nOne of the special magic numbers for detailed-square is: 4591976.\nOne of the special magic numbers for modern-crop is: 3872134.\nOne of the special magic numbers for chivalrous-osmosis is: 8165549.\nOne of the special magic numbers for imperfect-cliff is: 2264379.\nOne of the special magic numbers for terrible-napkin is: 6086827.\nOne of the special magic numbers for foamy-bathroom is: 3413787.\nOne of the special magic numbers for foamy-edger is: 3794477.\nOne of the special magic numbers for immense-sycamore is: 7101744.\nOne of the special magic numbers for wasteful-fridge is: 5102384.\nOne of the special magic numbers for smoggy-age is: 4408903.\nOne of the special magic numbers for idiotic-solitaire is: 6168412.\nOne of the special magic numbers for gorgeous-seaside is: 8010785.\nOne of the special magic numbers for crazy-accusation is: 9205790.\nOne of the special magic numbers for worthless-vessel is: 8768311.\nOne of the special magic numbers for pretty-menorah is: 4671962.\nOne of the special magic numbers for dirty-radio is: 7594818.\nOne of the special magic numbers for harsh-earring is: 5146242.\nOne of the special magic numbers for acceptable-collard is: 6580507.\nOne of the special magic numbers for gleaming-precipitation is: 5259672.\nOne of the special magic numbers for smoggy-alcohol is: 9357022.\nOne of the special magic numbers for subdued-microphone is: 2954191.\nOne of the special magic numbers for earthy-belly is: 8496508.\nOne of the special magic numbers for stereotyped-choice is: 8361775.\nOne of the special magic numbers for abandoned-excursion is: 4133895.\nOne of the special magic numbers for aspiring-particular is: 4047623.\nOne of the special magic numbers for skinny-thongs is: 3924781.\nOne of the special magic numbers for numerous-blast is: 2172433.\nOne of the special magic numbers for disagreeable-fringe is: 1930087.\nOne of the special magic numbers for incompetent-employ is: 5135405.\nOne of the special magic numbers for unadvised-cocoa is: 8860310.\nOne of the special magic numbers for repulsive-infancy is: 9350035.\nOne of the special magic numbers for woebegone-liberty is: 2887656.\nOne of the special magic numbers for fallacious-speakerphone is: 9226330.\nOne of the special magic numbers for wet-brassiere is: 6484082.\nOne of the special magic numbers for dapper-trove is: 9706450.\nOne of the special magic numbers for freezing-hearthside is: 3192250.\nOne of the special magic numbers for defiant-junk is: 1471076.\nOne of the special magic numbers for unsightly-pouch is: 2653502.\nOne of the special magic numbers for damaged-maelstrom is: 7112061.\nOne of the special magic numbers for maniacal-tract is: 1910845.\nOne of the special magic numbers for scandalous-goodnight is: 4161506.\nOne of the special magic numbers for drunk-commitment is: 1631781.\nOne of the special magic numbers for ancient-wall is: 5451189.\nOne of the special magic numbers for secretive-boogeyman is: 9203143.\nOne of the special magic numbers for stingy-speakerphone is: 8811505.\nOne of the special magic numbers for ratty-break is: 9446669.\nOne of the special magic numbers for quarrelsome-grace is: 8354627.\nOne of the special magic numbers for jumbled-singing is: 4533042.\nOne of the special magic numbers for volatile-baboon is: 7244360.\nOne of the special magic numbers for unadvised-velocity is: 9629101.\nOne of the special magic numbers for labored-rugby is: 4098017.\nOne of the special magic numbers for foamy-raiment is: 9975635.\nOne of the special magic numbers for makeshift-issue is: 2021413.\nOne of the special magic numbers for aloof-wood is: 4388917.\nOne of the special magic numbers for absurd-home is: 5666250.\nOne of the special magic numbers for historical-illegal is: 4964243.\nOne of the special magic numbers for macabre-icon is: 2375897.\nOne of the special magic numbers for slimy-timpani is: 6381025.\nOne of the special magic numbers for coherent-fate is: 8721122.\nOne of the special magic numbers for better-bob is: 6819805.\nOne of the special magic numbers for discreet-lasagna is: 4524077.\nOne of the special magic numbers for wry-survey is: 2569831.\nOne of the special magic numbers for acoustic-cutover is: 3503418.\nOne of the special magic numbers for thankful-response is: 3809849.\nOne of the special magic numbers for yielding-buck is: 6842624.\nOne of the special magic numbers for ugly-soil is: 1403155.\nOne of the special magic numbers for capable-equipment is: 6079984.\nOne of the special magic numbers for equable-shoe is: 9018984.\nOne of the special magic numbers for grumpy-peasant is: 9579549.\nOne of the special magic numbers for enchanting-gran is: 1951077.\nOne of the special magic numbers for subsequent-paperwork is: 4293914.\nOne of the special magic numbers for abhorrent-bead is: 7972349.\nOne of the special magic numbers for parched-burning is: 2386600.\nOne of the special magic numbers for faded-brown is: 9386347.\nOne of the special magic numbers for capable-suspect is: 2932234.\nOne of the special magic numbers for decorous-intervenor is: 7674303.\nOne of the special magic numbers for anxious-step is: 3532214.\nOne of the special magic numbers for youthful-mixture is: 3546569.\nOne of the special magic numbers for condemned-step is: 4477004.\nOne of the special magic numbers for didactic-mortgage is: 2981040.\nOne of the special magic numbers for vast-whistle is: 8723303.\nOne of the special magic numbers for miscreant-waist is: 8827457.\nOne of the special magic numbers for truculent-bonsai is: 3342687.\nOne of the special magic numbers for assorted-cation is: 2932163.\nOne of the special magic numbers for stimulating-tonight is: 2497141.\nOne of the special magic numbers for confused-epoch is: 2890953.\nOne of the special magic numbers for lowly-tune is: 7756122.\nOne of the special magic numbers for clean-commodity is: 3172691.\nOne of the special magic numbers for earthy-gesture is: 1702145.\nOne of the special magic numbers for crabby-loft is: 1535484.\nOne of the special magic numbers for obscene-turtle is: 3281479.\nOne of the special magic numbers for workable-retention is: 6753560.\nOne of the special magic numbers for typical-wraparound is: 9877751.\nOne of the special magic numbers for nosy-worshiper is: 5172054.\nOne of the special magic numbers for envious-proportion is: 4591821.\nOne of the special magic numbers for alike-salon is: 5925995.\nOne of the special magic numbers for sleepy-uniformity is: 4228684.\nOne of the special magic numbers for nifty-feng is: 7380100.\nOne of the special magic numbers for instinctive-weather is: 4496403.\nOne of the special magic numbers for romantic-sampan is: 1224966.\nOne of the special magic numbers for truculent-driver is: 4000398.\nOne of the special magic numbers for onerous-switch is: 6057746.\nOne of the special magic numbers for receptive-anterior is: 2738615.\nOne of the special magic numbers for defiant-warrant is: 5731929.\nOne of the special magic numbers for dusty-rehospitalisation is: 2685416.\nOne of the special magic numbers for cagey-gator is: 5772286.\nOne of the special magic numbers for acoustic-bead is: 7623371.\nOne of the special magic numbers for tall-antigen is: 9804957.\nOne of the special magic numbers for dazzling-sneaker is: 6656532.\nOne of the special magic numbers for abnormal-silo is: 6480097.\nOne of the special magic numbers for damaged-hive is: 3518731.\nOne of the special magic numbers for squealing-chemical is: 5810245.\nOne of the special magic numbers for longing-document is: 3161029.\nOne of the special magic numbers for absurd-chipmunk is: 4248973.\nOne of the special magic numbers for warm-blinker is: 7152661.\nOne of the special magic numbers for jumpy-painter is: 7383657.\nOne of the special magic numbers for berserk-offence is: 1020737.\nOne of the special magic numbers for famous-elver is: 6792145.\nOne of the special magic numbers for cultured-jewelry is: 9330861.\nOne of the special magic numbers for absurd-camp is: 2630209.\nOne of the special magic numbers for flagrant-toll is: 8612398.\nOne of the special magic numbers for cooperative-plover is: 3380334.\nOne of the special magic numbers for grubby-lining is: 4200361.\nOne of the special magic numbers for nasty-volunteer is: 7023290.\nOne of the special magic numbers for scary-ligand is: 6480092.\nOne of the special magic numbers for royal-billing is: 7152047.\nOne of the special magic numbers for lyrical-step-father is: 6149685.\nOne of the special magic numbers for curved-bulk is: 4556981.\nOne of the special magic numbers for addicted-hint is: 5351292.\nOne of the special magic numbers for blushing-percentage is: 3620107.\nOne of the special magic numbers for adamant-protection is: 3363000.\nOne of the special magic numbers for placid-people is: 1902872.\nOne of the special magic numbers for narrow-cuff-link is: 3065490.\nOne of the special magic numbers for late-distributor is: 3717134.\nOne of the special magic numbers for phobic-legacy is: 4872295.\nOne of the special magic numbers for selfish-conifer is: 9302324.\nOne of the special magic numbers for guarded-missile is: 2011816.\nOne of the special magic numbers for distinct-pasta is: 2032110.\nOne of the special magic numbers for oafish-running is: 2432331.\nOne of the special magic numbers for nifty-bakeware is: 8150430.\nOne of the special magic numbers for spiritual-fennel is: 1375480.\nOne of the special magic numbers for modern-range is: 4828581.\nWhat is the special magic number for anxious-step mentioned in the provided text?\nThe special magic number for anxious-step mentioned in the provided text is",
    ]

    prompts *= 8

    print(f"BEGIN TESTING. Chunked prefill is {chunked_prefill_toggle}")

    # Configure greedy decoding by setting temperature to 0
    sampling_params = SamplingParams(temperature=0, max_tokens=100)

    # Generate completions
    outputs = llm.generate(prompts, sampling_params)

    # Print results
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        partial_prompt = '\n'.join(prompt.split('\n')[-2:])
        print(f"Prompt: {partial_prompt}")
        print(f"Generated text: {generated_text}")
        print("-" * 50)

Using the above script which should be a self-contained qualitative eval of this behaviour.

Installing each branch [a928424, pr_mamba2_chunk_prefill_refactor, pr_mamba2_conv1d_refactor] with the first one containing none of your edits.

I find the generations to be inconsistent in all 3 branches when chunked_prefill is enabled:

python script.py 1

e.g. on pr_mamba_conv1d_refactor:

BEGIN TESTING. Chunked prefill is True
...
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  3532214.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  3532214.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------

whereas when chunked_prefill is disabled, the generations are consistent:

python script.py 0

e.g. on pr_mamba_conv1d_refactor

BEGIN TESTING. Chunked prefill is False
...
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------
Prompt: What is the special magic number for anxious-step mentioned in the provided text?
The special magic number for anxious-step mentioned in the provided text is
Generated text:  4964243.
--------------------------------------------------

Other models such as Bamba also demonstrate the same behavior but tend to require longer greedy generations to see the difference. In pure Mamba2 models (like the Codestral model used), the difference with chunked_prefill enabled tend to be immediate.

@cyang49
Copy link
Contributor Author

cyang49 commented May 5, 2025

Thanks for the details @prannaykaul
Since there are so many layers involved, to get to the bottom of it could be quite difficult. For the performance optimization PRs I submitted I won't attempt to fix it. And as you've shown, the problem exists in the main branch and not introduced by my changes. I'll bring this to some of the colleagues that work on mamba2 implementation, and see if this gets picked up. Also maybe you can open an issue if you haven't?
cc @tlrmchlsmth

Copy link
Collaborator

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of small comments but LGTM

Comment on lines 394 to 396
num_prefills = attn_metadata.num_prefills # #requests
num_decodes = attn_metadata.num_decode_tokens # #tokens==#requests
num_prefill_tokens = attn_metadata.num_prefill_tokens # #tokens
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain what is meant by the comments at the ends of these lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments mean the corresponding variable values are counting "number of requests" or "number of tokens"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want me to change the comments?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would help clarity. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved

n_groups = self.n_groups // self.tp_size
A = self.A[:, None, ...][:, :, None].expand(
A_d = self.A[:, None, ...][:, :, None].expand(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the suffix _d mean in this code?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh is it decode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, _p means prefill and _d means decode

@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label May 5, 2025
@cyang49 cyang49 force-pushed the pr_mamba2_chunk_prefill_refactor branch from 0b19ed4 to bdf4e64 Compare May 6, 2025 00:38
cyang49 added 4 commits May 6, 2025 08:12
Signed-off-by: Chih-Chieh-Yang <[email protected]>
Signed-off-by: Chih-Chieh-Yang <[email protected]>
Signed-off-by: Chih-Chieh-Yang <[email protected]>
Signed-off-by: Chih-Chieh-Yang <[email protected]>
@cyang49 cyang49 force-pushed the pr_mamba2_chunk_prefill_refactor branch from bdf4e64 to 20452d3 Compare May 6, 2025 12:15
Copy link

mergify bot commented May 7, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @cyang49.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 7, 2025
Copy link
Collaborator

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyang49 could you merge in latest main? #16942 landed so this one has some merge conflicts.

@cyang49
Copy link
Contributor Author

cyang49 commented May 7, 2025

@cyang49 could you merge in latest main? #16942 landed so this one has some merge conflicts.

@tlrmchlsmth I think you meant #17146
In fact #17146 branched from this one with other improvements. As that one got merged first, this one should be closed.
I'll run a few tests on main to make sure things are working properly

Closing

@cyang49 cyang49 closed this May 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants