[Model] Mamba2 causal conv1d Refactor to Split Prefill and Decode Requests for Corresponding Kernels #17146

cyang49 · 2025-04-25T00:14:48Z

As a follow up to PR #16942

The CUDA Causal Conv1d kernel has a similar problem to the mamba2 ssd prefill kernels. It doesn't perform well when chunked prefill is turned ON resulting in the input batch to have mixed prefill and decode requests. This PR splits the requests, and we observe a big total throughput improvement for benchmark_serving.py with SharedGPT v3 workload.

This PR chunked prefill ON

vllm serve ibm-ai-platform/Bamba-9B --port 9999

python benchmarks/benchmark_serving.py --model ibm-ai-platform/Bamba-9B  --dataset-name sharegpt     --dataset-path /net/storage149/mnt/md0/ccyang/github.com/ShareGPT_V3/ShareGPT_V3_unfiltered_cleaned_split.json --ignore-eos --port 9999 
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  47.63     
Total input tokens:                      215201    
Total generated tokens:                  198343    
Request throughput (req/s):              21.00     
Output token throughput (tok/s):         4164.34   
Total Token throughput (tok/s):          8682.61   
---------------Time to First Token----------------
Mean TTFT (ms):                          15089.14  
Median TTFT (ms):                        13729.31  
P99 TTFT (ms):                           35227.37  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          56.06     
Median TPOT (ms):                        53.62     
P99 TPOT (ms):                           116.86    
---------------Inter-token Latency----------------
Mean ITL (ms):                           48.57     
Median ITL (ms):                         48.07     
P99 ITL (ms):                            118.30    
==================================================

This PR chunked prefill OFF

vllm serve ibm-ai-platform/Bamba-9B --port 9999 --no-enable-chunked-prefill --max_model_len=4096

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  52.72     
Total input tokens:                      215201    
Total generated tokens:                  198343    
Request throughput (req/s):              18.97     
Output token throughput (tok/s):         3761.94   
Total Token throughput (tok/s):          7843.62   
---------------Time to First Token----------------
Mean TTFT (ms):                          16965.86  
Median TTFT (ms):                        15540.24  
P99 TTFT (ms):                           40363.40  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.57     
Median TPOT (ms):                        63.19     
P99 TPOT (ms):                           259.66    
---------------Inter-token Latency----------------
Mean ITL (ms):                           56.02     
Median ITL (ms):                         58.01     
P99 ITL (ms):                            121.75    
==================================================

Main (`d9ac9e3`) chunked prefill on

vllm serve ibm-ai-platform/Bamba-9B --port 9999

python benchmarks/benchmark_serving.py --model ibm-ai-platform/Bamba-9B  --dataset-name sharegpt     --dataset-path /net/storage149/mnt/md0/ccyang/github.com/ShareGPT_V3/ShareGPT_V3_unfiltered_cleaned_split.json --ignore-eos --port 9999 
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  183.82    
Total input tokens:                      215201    
Total generated tokens:                  198343    
Request throughput (req/s):              5.44      
Output token throughput (tok/s):         1079.03   
Total Token throughput (tok/s):          2249.77   
---------------Time to First Token----------------
Mean TTFT (ms):                          66936.01  
Median TTFT (ms):                        59901.93  
P99 TTFT (ms):                           169459.47 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          255.65    
Median TPOT (ms):                        269.80    
P99 TPOT (ms):                           395.92    
---------------Inter-token Latency----------------
Mean ITL (ms):                           220.89    
Median ITL (ms):                         333.07    
P99 ITL (ms):                            427.35    
==================================================

Main (`d9ac9e3`) chunked prefill off

vllm serve ibm-ai-platform/Bamba-9B --port 9999 --no-enable-chunked-prefill --max_model_len=4096

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  52.14     
Total input tokens:                      215201    
Total generated tokens:                  198343    
Request throughput (req/s):              19.18     
Output token throughput (tok/s):         3803.80   
Total Token throughput (tok/s):          7930.91   
---------------Time to First Token----------------
Mean TTFT (ms):                          16630.70  
Median TTFT (ms):                        15313.57  
P99 TTFT (ms):                           39832.65  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          67.94     
Median TPOT (ms):                        62.02     
P99 TPOT (ms):                           257.02    
---------------Inter-token Latency----------------
Mean ITL (ms):                           55.36     
Median ITL (ms):                         56.91     
P99 ITL (ms):                            121.58    
==================================================

## Output Quality

### Bamba-9B

vllm (pretrained=ibm-ai-platform/Bamba-9B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.2623	±	0.0121
		strict-match	5	exact_match	↑	0.3700	±	0.0133


### Zamba2-2.7B

vllm (pretrained=Zyphra/Zamba2-2.7B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.5299	±	0.0137
		strict-match	5	exact_match	↑	0.5436	±	0.0137


### Mamba-Codestral-7B

vllm (pretrained=mistralai/Mamba-Codestral-7B-v0.1,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.4761	±	0.0138
		strict-match	5	exact_match	↑	0.4632	±	0.0137

github-actions · 2025-04-25T00:14:59Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

cyang49 · 2025-04-25T12:09:03Z

Unit tests

================================================== test session starts ===================================================
platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0
rootdir: /net/storage149/mnt/md0/ccyang/github.com/vllm
configfile: pyproject.toml
plugins: anyio-4.8.0
collected 1570 items                                                                                                     

tests/kernels/mamba/test_causal_conv1d.py ........................................................................ [  4%]
.................................................................................................................. [ 11%]
.................................................................................................................. [ 19%]
.................................................................................................................. [ 26%]
.................................................................................................................. [ 33%]
.................................................................................................................. [ 40%]
.................................................................................................................. [ 48%]
....                                                                                                               [ 48%]
tests/kernels/mamba/test_mamba_mixer2.py ....                                                                      [ 48%]
tests/kernels/mamba/test_mamba_ssm.py ............................................................................ [ 53%]
......................................................................................................s.......s... [ 60%]
.................................................................................................................. [ 68%]
.................................................................................................................. [ 75%]
..............................................                                                                     [ 78%]
tests/kernels/mamba/test_mamba_ssm_ssd.py ........................................................................ [ 82%]
.................................................................................................................. [ 90%]
.................................................................................................................. [ 97%]
..........................................                                                                         [100%]
================================ 1568 passed, 2 skipped, 10 warnings in 881.36s (0:14:41) ================================

tlrmchlsmth

This PR looks good to me, nice work.

cyang49 · 2025-05-06T12:09:30Z

@tlrmchlsmth
Getting a seemingly unrelated test failure. I saw this in other people's PR as well

<head></head>
[2025-05-06T03:45:25Z] =========================== short test summary info ============================
--
  | [2025-05-06T03:45:25Z] FAILED entrypoints/openai/test_audio.py::test_chat_streaming_audio[https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/mary_had_lamb.ogg-fixie-ai/ultravox-v0_5-llama-3_2-1b] - AssertionError: assert 'This audio a...t from a poem' == 'This audio a...a traditional'
  | [2025-05-06T03:45:25Z]
  | [2025-05-06T03:45:25Z]   - This audio appears to be a snippet from a traditional
  | [2025-05-06T03:45:25Z]   ?                                           ^^^^^^^ ^^^
  | [2025-05-06T03:45:25Z]   + This audio appears to be a snippet from a poem
  | [2025-05-06T03:45:25Z]   ?                                           ^ ^^
  | [2025-05-06T03:45:25Z] ====== 1 failed, 420 passed, 2 skipped, 36 warnings in 3645.43s (1:00:45) ======
  | [2025-05-06T03:45:28Z] 🚨 Error: The command exited with status 1
  | [2025-05-06T03:45:28Z] user command error: The plugin docker command hook exited with status 1

I'll try rebasing main and force push as #17497 is merged and I need to add new changes

Signed-off-by: Chih-Chieh-Yang <[email protected]>

cyang49 · 2025-05-06T18:36:10Z

@tlrmchlsmth some tests still fail.. they don't look related to my changes. Could you have a look? Thank you!

…uests for Corresponding Kernels (vllm-project#17146) Signed-off-by: Chih-Chieh-Yang <[email protected]> Signed-off-by: Mu Huai <[email protected]>

…uests for Corresponding Kernels (vllm-project#17146) Signed-off-by: Chih-Chieh-Yang <[email protected]>

cyang49 mentioned this pull request Apr 25, 2025

[RFC]: Native support for Mamba, SSM, and hybrid transformer models in vLLM V1 #17140

Open

9 tasks

cyang49 mentioned this pull request Apr 25, 2025

[Model] Refactor Mamba2 SSD to improve chunked prefill performance #16942

Closed

cyang49 force-pushed the pr_mamba2_conv1d_refactor branch 2 times, most recently from 5fe065d to 0bbcced Compare May 5, 2025 19:24

cyang49 marked this pull request as ready for review May 5, 2025 19:25

cyang49 requested review from tlrmchlsmth and WoosukKwon as code owners May 5, 2025 19:25

cyang49 force-pushed the pr_mamba2_conv1d_refactor branch from 0bbcced to 7376f68 Compare May 6, 2025 00:39

tlrmchlsmth approved these changes May 6, 2025

View reviewed changes

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label May 6, 2025

tlrmchlsmth enabled auto-merge (squash) May 6, 2025 01:02

cyang49 added 15 commits May 6, 2025 08:12

remove unnecessary assert

095fea3

Signed-off-by: Chih-Chieh-Yang <[email protected]>

draft refactoring

d8879c5

Signed-off-by: Chih-Chieh-Yang <[email protected]>

fix bug in C splitting

8a3f4bf

Signed-off-by: Chih-Chieh-Yang <[email protected]>

clean up and renaming for clarity

d9c2755

Signed-off-by: Chih-Chieh-Yang <[email protected]>

minor improvement

6a363c8

Signed-off-by: Chih-Chieh-Yang <[email protected]>

clean up and remove some redundant info in mamba2metadata

70875ae

Signed-off-by: Chih-Chieh-Yang <[email protected]>

narrow down prep initial state

0327512

Signed-off-by: Chih-Chieh-Yang <[email protected]>

use query_start_loc to compute chunk_indices and chunk_offsets

4d37e59

Signed-off-by: Chih-Chieh-Yang <[email protected]>

refer to the correct B and C tensor in decode path

2ea3051

Signed-off-by: Chih-Chieh-Yang <[email protected]>

improve comments

087c888

Signed-off-by: Chih-Chieh-Yang <[email protected]>

helper function interface change

aa3b8fa

Signed-off-by: Chih-Chieh-Yang <[email protected]>

improve comment

4bdff84

Signed-off-by: Chih-Chieh-Yang <[email protected]>

mamba2 metadata func arg change

20452d3

Signed-off-by: Chih-Chieh-Yang <[email protected]>

split inputs also for causal_conv1d

ba8beaf

Signed-off-by: Chih-Chieh-Yang <[email protected]>

remove unused variable

d03ead8

Signed-off-by: Chih-Chieh-Yang <[email protected]>

auto-merge was automatically disabled May 6, 2025 12:16
Head branch was pushed to by a user without write access

cyang49 force-pushed the pr_mamba2_conv1d_refactor branch from 7376f68 to d03ead8 Compare May 6, 2025 12:16

simon-mo merged commit 18dd5e0 into vllm-project:main May 7, 2025
53 of 55 checks passed

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request May 14, 2025

[Model] Mamba2 causal conv1d Refactor to Split Prefill and Decode Req…

439c8ac

…uests for Corresponding Kernels (vllm-project#17146) Signed-off-by: Chih-Chieh-Yang <[email protected]>

This was referenced May 15, 2025

add causal-conv1d in Triton and integrate into vLLM with test code #18206

Closed

add causal-conv1d in Triton and integrate into vLLM with test code #18218

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Mamba2 causal conv1d Refactor to Split Prefill and Decode Requests for Corresponding Kernels #17146

[Model] Mamba2 causal conv1d Refactor to Split Prefill and Decode Requests for Corresponding Kernels #17146

cyang49 commented Apr 25, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Apr 25, 2025

cyang49 commented Apr 25, 2025

tlrmchlsmth left a comment

cyang49 commented May 6, 2025 •

edited

Loading

cyang49 commented May 6, 2025

[Model] Mamba2 causal conv1d Refactor to Split Prefill and Decode Requests for Corresponding Kernels #17146

[Model] Mamba2 causal conv1d Refactor to Split Prefill and Decode Requests for Corresponding Kernels #17146

Conversation

cyang49 commented Apr 25, 2025 • edited by github-actions bot Loading

This PR chunked prefill ON

This PR chunked prefill OFF

Main (d9ac9e3) chunked prefill on

Main (d9ac9e3) chunked prefill off

github-actions bot commented Apr 25, 2025

cyang49 commented Apr 25, 2025

Unit tests

tlrmchlsmth left a comment

Choose a reason for hiding this comment

cyang49 commented May 6, 2025 • edited Loading

cyang49 commented May 6, 2025

cyang49 commented Apr 25, 2025 •

edited by github-actions bot

Loading

Main (`d9ac9e3`) chunked prefill on

Main (`d9ac9e3`) chunked prefill off

cyang49 commented May 6, 2025 •

edited

Loading