Skip to content

[Model] Mamba2 causal conv1d Refactor to Split Prefill and Decode Requests for Corresponding Kernels #17146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
May 7, 2025

Conversation

cyang49
Copy link
Contributor

@cyang49 cyang49 commented Apr 25, 2025

As a follow up to PR #16942

The CUDA Causal Conv1d kernel has a similar problem to the mamba2 ssd prefill kernels. It doesn't perform well when chunked prefill is turned ON resulting in the input batch to have mixed prefill and decode requests. This PR splits the requests, and we observe a big total throughput improvement for benchmark_serving.py with SharedGPT v3 workload.

This PR chunked prefill ON

vllm serve ibm-ai-platform/Bamba-9B --port 9999 
python benchmarks/benchmark_serving.py --model ibm-ai-platform/Bamba-9B  --dataset-name sharegpt     --dataset-path /net/storage149/mnt/md0/ccyang/github.com/ShareGPT_V3/ShareGPT_V3_unfiltered_cleaned_split.json --ignore-eos --port 9999 
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  47.63     
Total input tokens:                      215201    
Total generated tokens:                  198343    
Request throughput (req/s):              21.00     
Output token throughput (tok/s):         4164.34   
Total Token throughput (tok/s):          8682.61   
---------------Time to First Token----------------
Mean TTFT (ms):                          15089.14  
Median TTFT (ms):                        13729.31  
P99 TTFT (ms):                           35227.37  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          56.06     
Median TPOT (ms):                        53.62     
P99 TPOT (ms):                           116.86    
---------------Inter-token Latency----------------
Mean ITL (ms):                           48.57     
Median ITL (ms):                         48.07     
P99 ITL (ms):                            118.30    
==================================================

This PR chunked prefill OFF

vllm serve ibm-ai-platform/Bamba-9B --port 9999 --no-enable-chunked-prefill --max_model_len=4096
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  52.72     
Total input tokens:                      215201    
Total generated tokens:                  198343    
Request throughput (req/s):              18.97     
Output token throughput (tok/s):         3761.94   
Total Token throughput (tok/s):          7843.62   
---------------Time to First Token----------------
Mean TTFT (ms):                          16965.86  
Median TTFT (ms):                        15540.24  
P99 TTFT (ms):                           40363.40  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.57     
Median TPOT (ms):                        63.19     
P99 TPOT (ms):                           259.66    
---------------Inter-token Latency----------------
Mean ITL (ms):                           56.02     
Median ITL (ms):                         58.01     
P99 ITL (ms):                            121.75    
==================================================

Main (d9ac9e3) chunked prefill on

vllm serve ibm-ai-platform/Bamba-9B --port 9999 
python benchmarks/benchmark_serving.py --model ibm-ai-platform/Bamba-9B  --dataset-name sharegpt     --dataset-path /net/storage149/mnt/md0/ccyang/github.com/ShareGPT_V3/ShareGPT_V3_unfiltered_cleaned_split.json --ignore-eos --port 9999 
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  183.82    
Total input tokens:                      215201    
Total generated tokens:                  198343    
Request throughput (req/s):              5.44      
Output token throughput (tok/s):         1079.03   
Total Token throughput (tok/s):          2249.77   
---------------Time to First Token----------------
Mean TTFT (ms):                          66936.01  
Median TTFT (ms):                        59901.93  
P99 TTFT (ms):                           169459.47 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          255.65    
Median TPOT (ms):                        269.80    
P99 TPOT (ms):                           395.92    
---------------Inter-token Latency----------------
Mean ITL (ms):                           220.89    
Median ITL (ms):                         333.07    
P99 ITL (ms):                            427.35    
==================================================

Main (d9ac9e3) chunked prefill off

vllm serve ibm-ai-platform/Bamba-9B --port 9999 --no-enable-chunked-prefill --max_model_len=4096
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  52.14     
Total input tokens:                      215201    
Total generated tokens:                  198343    
Request throughput (req/s):              19.18     
Output token throughput (tok/s):         3803.80   
Total Token throughput (tok/s):          7930.91   
---------------Time to First Token----------------
Mean TTFT (ms):                          16630.70  
Median TTFT (ms):                        15313.57  
P99 TTFT (ms):                           39832.65  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          67.94     
Median TPOT (ms):                        62.02     
P99 TPOT (ms):                           257.02    
---------------Inter-token Latency----------------
Mean ITL (ms):                           55.36     
Median ITL (ms):                         56.91     
P99 ITL (ms):                            121.58    
==================================================

## Output Quality

### Bamba-9B

vllm (pretrained=ibm-ai-platform/Bamba-9B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.2623 ± 0.0121
strict-match 5 exact_match 0.3700 ± 0.0133

### Zamba2-2.7B

vllm (pretrained=Zyphra/Zamba2-2.7B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.5299 ± 0.0137
strict-match 5 exact_match 0.5436 ± 0.0137

### Mamba-Codestral-7B

vllm (pretrained=mistralai/Mamba-Codestral-7B-v0.1,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.4761 ± 0.0138
strict-match 5 exact_match 0.4632 ± 0.0137

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@cyang49
Copy link
Contributor Author

cyang49 commented Apr 25, 2025

Unit tests

================================================== test session starts ===================================================
platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0
rootdir: /net/storage149/mnt/md0/ccyang/github.com/vllm
configfile: pyproject.toml
plugins: anyio-4.8.0
collected 1570 items                                                                                                     

tests/kernels/mamba/test_causal_conv1d.py ........................................................................ [  4%]
.................................................................................................................. [ 11%]
.................................................................................................................. [ 19%]
.................................................................................................................. [ 26%]
.................................................................................................................. [ 33%]
.................................................................................................................. [ 40%]
.................................................................................................................. [ 48%]
....                                                                                                               [ 48%]
tests/kernels/mamba/test_mamba_mixer2.py ....                                                                      [ 48%]
tests/kernels/mamba/test_mamba_ssm.py ............................................................................ [ 53%]
......................................................................................................s.......s... [ 60%]
.................................................................................................................. [ 68%]
.................................................................................................................. [ 75%]
..............................................                                                                     [ 78%]
tests/kernels/mamba/test_mamba_ssm_ssd.py ........................................................................ [ 82%]
.................................................................................................................. [ 90%]
.................................................................................................................. [ 97%]
..........................................                                                                         [100%]
================================ 1568 passed, 2 skipped, 10 warnings in 881.36s (0:14:41) ================================

@cyang49 cyang49 force-pushed the pr_mamba2_conv1d_refactor branch 2 times, most recently from 5fe065d to 0bbcced Compare May 5, 2025 19:24
@cyang49 cyang49 marked this pull request as ready for review May 5, 2025 19:25
@cyang49 cyang49 force-pushed the pr_mamba2_conv1d_refactor branch from 0bbcced to 7376f68 Compare May 6, 2025 00:39
Copy link
Collaborator

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good to me, nice work.

@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label May 6, 2025
@tlrmchlsmth tlrmchlsmth enabled auto-merge (squash) May 6, 2025 01:02
@cyang49
Copy link
Contributor Author

cyang49 commented May 6, 2025

@tlrmchlsmth
Getting a seemingly unrelated test failure. I saw this in other people's PR as well

<head></head>
[2025-05-06T03:45:25Z] =========================== short test summary info ============================
--
  | [2025-05-06T03:45:25Z] FAILED entrypoints/openai/test_audio.py::test_chat_streaming_audio[https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/mary_had_lamb.ogg-fixie-ai/ultravox-v0_5-llama-3_2-1b] - AssertionError: assert 'This audio a...t from a poem' == 'This audio a...a traditional'
  | [2025-05-06T03:45:25Z]
  | [2025-05-06T03:45:25Z]   - This audio appears to be a snippet from a traditional
  | [2025-05-06T03:45:25Z]   ?                                           ^^^^^^^ ^^^
  | [2025-05-06T03:45:25Z]   + This audio appears to be a snippet from a poem
  | [2025-05-06T03:45:25Z]   ?                                           ^ ^^
  | [2025-05-06T03:45:25Z] ====== 1 failed, 420 passed, 2 skipped, 36 warnings in 3645.43s (1:00:45) ======
  | [2025-05-06T03:45:28Z] 🚨 Error: The command exited with status 1
  | [2025-05-06T03:45:28Z] user command error: The plugin docker command hook exited with status 1


I'll try rebasing main and force push as #17497 is merged and I need to add new changes

cyang49 added 15 commits May 6, 2025 08:12
Signed-off-by: Chih-Chieh-Yang <[email protected]>
Signed-off-by: Chih-Chieh-Yang <[email protected]>
Signed-off-by: Chih-Chieh-Yang <[email protected]>
Signed-off-by: Chih-Chieh-Yang <[email protected]>
Signed-off-by: Chih-Chieh-Yang <[email protected]>
Signed-off-by: Chih-Chieh-Yang <[email protected]>
Signed-off-by: Chih-Chieh-Yang <[email protected]>
Signed-off-by: Chih-Chieh-Yang <[email protected]>
Signed-off-by: Chih-Chieh-Yang <[email protected]>
Signed-off-by: Chih-Chieh-Yang <[email protected]>
Signed-off-by: Chih-Chieh-Yang <[email protected]>
auto-merge was automatically disabled May 6, 2025 12:16

Head branch was pushed to by a user without write access

@cyang49 cyang49 force-pushed the pr_mamba2_conv1d_refactor branch from 7376f68 to d03ead8 Compare May 6, 2025 12:16
@cyang49
Copy link
Contributor Author

cyang49 commented May 6, 2025

@tlrmchlsmth some tests still fail.. they don't look related to my changes. Could you have a look? Thank you!

@simon-mo simon-mo merged commit 18dd5e0 into vllm-project:main May 7, 2025
53 of 55 checks passed
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
…uests for Corresponding Kernels (vllm-project#17146)

Signed-off-by: Chih-Chieh-Yang <[email protected]>
Signed-off-by: Mu Huai <[email protected]>
mawong-amd pushed a commit to ROCm/vllm that referenced this pull request May 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants