[Hardware][AMD] integrate aiter into vllm #17710

fsx950223 · 2025-05-06T10:25:34Z

CMD: VLLM_TORCH_PROFILER_DIR=/mnt/raid0/sixifang/vllm/vllm_profile HIP_VISIBLE_DEVICES=4,5,6,7 VLLM_ROCM_USE_AITER=1 VLLM_USE_V1=1 vllm serve /models/models--amd--Meta-Llama-3.1-8B-Instruct-FP8-KV/snapshots/fa42f9a9105c545755fea25cf69f49ac8c8b40e1/ --tensor-parallel-size 4 --gpu-memory-utilization 0.9 --trust-remote-code --disable-log-requests --block-size 16 --max-model-len 32768 --dtype float16 --quantization fp8 --no-enable-prefix-caching --max-num-batched-tokens=8192

Performance without aiter:

vllm (pretrained=/models/models--amd--Meta-Llama-3.1-8B-Instruct-FP8-KV/snapshots/fa42f9a9105c545755fea25cf69f49ac8c8b40e1/,tensor_parallel_size=1,max_model_len=10000,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7733|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7437|±  |0.0120|

Performance with aiter:

vllm (pretrained=/models/models--amd--Meta-Llama-3.1-8B-Instruct-FP8-KV/snapshots/fa42f9a9105c545755fea25cf69f49ac8c8b40e1/,tensor_parallel_size=1,max_model_len=10000,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7612|±  |0.0117|
|     |       |strict-match    |     5|exact_match|↑  |0.7233|±  |0.0123|

github-actions · 2025-05-06T10:25:42Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-05-06T10:26:12Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fsx950223.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-05-06T15:00:21Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fsx950223.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gshtras · 2025-05-07T15:12:22Z

What is the minimal AITER commit that has the required functionality?
Also, I think we need a separate flag to toggle this part of AITER on and off, like we have for the others

mergify · 2025-05-08T02:14:43Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fsx950223.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tjtanaa · 2025-05-08T12:01:29Z

vllm/platforms/rocm.py

-            logger.info("Using Triton Attention backend on V1 engine.")
-            return ("vllm.v1.attention.backends."
-                    "triton_attn.TritonAttentionBackend")
+            if envs.VLLM_ROCM_USE_AITER and envs.VLLM_ROCM_USE_AITER_MHA:


should we add on_mi250_mi300() to the condition?

MHA should be used in mi350 too. I won't add the condition.

LGTM.
It is fine to leave the condition out if we don't expect Radeon GPU users to use AITER.

fsx950223 · 2025-05-08T13:35:33Z

@fsx950223 could you run lm_eval for some of the models that is using this new kernels and share some performance gain that comes with this kernel?

How to run?

tjtanaa · 2025-05-08T14:50:54Z

@fsx950223 could you run lm_eval for some of the models that is using this new kernels and share some performance gain that comes with this kernel?

How to run?

@fsx950223

The steps are:

Install lm_eval

python3 -m pip install lm_eval

Example command

VLLM_USE_V1=1 \
VLLM_ROCM_USE_AITER=1 \
VLLM_USE_TRITON_FLASH_ATTN=0 \
SAFETENSORS_FAST_GPU=1 \
lm_eval --model vllm --model_args pretrained=Qwen/Qwen3-32B,tensor_parallel_size=1,max_model_len=10000 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto \
> pr_gsm8k-Qwen_Qwen3-32B.log 2>&1

Example output:

vllm (pretrained=Qwen/Qwen3-32B,tensor_parallel_size=1,max_model_len=10000,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6255|±  |0.0133|
|     |       |strict-match    |     5|exact_match|↑  |0.7369|±  |0.0121|

Can you provide lm_eval of non-AITER as baseline reference as well?

gshtras · 2025-05-08T22:51:17Z

vllm/envs.py

+    # Whether to use aiter mha ops.
+    # By default is enabled.
+    "VLLM_ROCM_USE_AITER_MHA":
+    lambda: (os.getenv("VLLM_ROCM_USE_AITER_MHA", "True").lower() in


Do we want to override #16828 by default?

Signed-off-by: fsx950223 <[email protected]>

fsx950223 · 2025-05-09T05:46:48Z

I found there should be a pa kernel to optimize kernel performance when query len=1. A commit will be submitted.

Signed-off-by: fsx950223 <[email protected]>

fsx950223 · 2025-05-09T08:09:07Z

Done.

Signed-off-by: fsx950223 <[email protected]>

tdoublep · 2025-05-09T14:02:42Z

Wouldn't it make more sense to create a new v1 attention backend called aiter_attn for this rather than changing the flash_attn backend?

coderfeli · 2025-05-10T01:32:33Z

se to create a new v1 attention backend called aiter_attn for this rather than changing the flash_attn backend?

@tdoublep do you think this is a must? We can do it but need some extra time to re-organize the code. as aiter_attn is very similar to flash_attn and it has a big impact to the performance. can we have this flash_attn changed at first and then reconstruct it to a new one as next step?

mergify · 2025-05-10T23:13:16Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fsx950223.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: fsx950223 <[email protected]>

fsx950223 · 2025-05-12T03:45:25Z

Wouldn't it make more sense to create a new v1 attention backend called aiter_attn for this rather than changing the flash_attn backend?

Done

houseroad · 2025-05-12T05:50:47Z

vllm/envs.py

@@ -80,6 +80,7 @@
    VLLM_ROCM_USE_AITER_MOE: bool = True
    VLLM_ROCM_USE_AITER_RMSNORM: bool = True
    VLLM_ROCM_USE_AITER_MLA: bool = True
+    VLLM_ROCM_USE_AITER_MHA: bool = True


wondering what's difference between VLLM_ROCM_USE_AITER and VLLM_ROCM_USE_AITER_MHA?

wondering what's difference between VLLM_ROCM_USE_AITER and VLLM_ROCM_USE_AITER_MHA?

Main switch and submodule switch.

mergify · 2025-05-13T11:40:57Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fsx950223.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: fsx950223 <[email protected]>

tjtanaa · 2025-05-15T09:18:53Z

vllm/model_executor/layers/layernorm.py

+        )
+        rocm_aiter_rms_norm = torch.ops.vllm.rocm_aiter_rms_norm
+
+    except AttributeError:


I think we don't need to use the try catch statement. Because the registration must work as vLLM is going to deprecate V0. If registration does not work when aiter is present on ROCm env, this could mean there is a bug.

An example unit tests to check if the registration works is as follows https://github.com/vllm-project/vllm/blob/main/tests/kernels/moe/test_rocm_aiter_topk.py

Signed-off-by: fsx950223 <[email protected]>

tjtanaa · 2025-05-17T03:06:49Z

@fsx950223 does this feature works with AITER commit: c1debd87ce0391aa27438d9e07e76e4fea7c4b70 ?
We are trying to fix the AITER features when using this AITER commit. Right now after merging this PR #17912, we are fixing the compatibility of the integrated AITER kernels (e.g. PR #18271)

If you could share us the AITER commit that you are using, we could also try to validate if other AITER kernels needs to be fixed.

fsx950223 · 2025-05-17T03:29:44Z

I use aiter main branch directly.

fsx950223 requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners May 6, 2025 10:25

mergify bot added the v1 label May 6, 2025

mergify bot added needs-rebase and removed needs-rebase labels May 6, 2025

fsx950223 force-pushed the fa_upstream branch from 7d46886 to 8171cc2 Compare May 8, 2025 02:13

fsx950223 requested review from tlrmchlsmth, youkaichao, mgoin, simon-mo and zhuohan123 as code owners May 8, 2025 02:13

mergify bot added documentation Improvements or additions to documentation ci/build tpu Related to Google TPUs labels May 8, 2025

mergify bot added tool-calling needs-rebase labels May 8, 2025

github-project-automation bot added this to Tool Calling May 8, 2025

fsx950223 force-pushed the fa_upstream branch from 8171cc2 to 5327872 Compare May 8, 2025 03:08

mergify bot removed the tpu Related to Google TPUs label May 8, 2025

tjtanaa reviewed May 8, 2025

View reviewed changes

tjtanaa mentioned this pull request May 8, 2025

[Feature] [ROCm]: AITER Kernel Integration #14964

Open

45 tasks

gshtras reviewed May 8, 2025

View reviewed changes

rename function

ae85e79

Signed-off-by: fsx950223 <[email protected]>

optimize kernels with small query lens

87ea0ba

Signed-off-by: fsx950223 <[email protected]>

change condition

db4bc55

Signed-off-by: fsx950223 <[email protected]>

mergify bot added the needs-rebase label May 10, 2025

Merge remote-tracking branch 'upstream/main' into fa_upstream

efe59bd

Signed-off-by: fsx950223 <[email protected]>

mergify bot removed the needs-rebase label May 12, 2025

add rocm aiter backend

40654e4

Signed-off-by: fsx950223 <[email protected]>

fsx950223 force-pushed the fa_upstream branch from c6e4ef2 to 40654e4 Compare May 12, 2025 03:43

houseroad reviewed May 12, 2025

View reviewed changes

mergify bot added the needs-rebase label May 13, 2025

fsx950223 added 2 commits May 14, 2025 03:37

us pa layout

3ff8565

Signed-off-by: fsx950223 <[email protected]>

Merge remote-tracking branch 'upstream/main' into fa_upstream2

dcbcd68

Signed-off-by: fsx950223 <[email protected]>

mergify bot removed the needs-rebase label May 14, 2025

tjtanaa reviewed May 15, 2025

View reviewed changes

remove try catch

bc2afe5

Signed-off-by: fsx950223 <[email protected]>

hvluu-8 mentioned this pull request May 17, 2025

I use aiter main branch directly. #18294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hardware][AMD] integrate aiter into vllm #17710

[Hardware][AMD] integrate aiter into vllm #17710

fsx950223 commented May 6, 2025 •

edited by github-actions bot

Loading

github-actions bot commented May 6, 2025

mergify bot commented May 6, 2025

mergify bot commented May 6, 2025

gshtras commented May 7, 2025

mergify bot commented May 8, 2025

tjtanaa May 8, 2025

fsx950223 May 9, 2025

tjtanaa May 9, 2025

fsx950223 commented May 8, 2025

tjtanaa commented May 8, 2025

gshtras May 8, 2025

fsx950223 commented May 9, 2025

fsx950223 commented May 9, 2025

tdoublep commented May 9, 2025

coderfeli commented May 10, 2025 •

edited

Loading

mergify bot commented May 10, 2025

fsx950223 commented May 12, 2025

houseroad May 12, 2025

fsx950223 May 12, 2025

mergify bot commented May 13, 2025

tjtanaa May 15, 2025 •

edited

Loading

fsx950223 May 15, 2025

tjtanaa commented May 17, 2025 •

edited

Loading

fsx950223 commented May 17, 2025

[Hardware][AMD] integrate aiter into vllm #17710

Are you sure you want to change the base?

[Hardware][AMD] integrate aiter into vllm #17710

Conversation

fsx950223 commented May 6, 2025 • edited by github-actions bot Loading

github-actions bot commented May 6, 2025

mergify bot commented May 6, 2025

mergify bot commented May 6, 2025

gshtras commented May 7, 2025

mergify bot commented May 8, 2025

tjtanaa May 8, 2025

Choose a reason for hiding this comment

fsx950223 May 9, 2025

Choose a reason for hiding this comment

tjtanaa May 9, 2025

Choose a reason for hiding this comment

fsx950223 commented May 8, 2025

tjtanaa commented May 8, 2025

gshtras May 8, 2025

Choose a reason for hiding this comment

fsx950223 commented May 9, 2025

fsx950223 commented May 9, 2025

tdoublep commented May 9, 2025

coderfeli commented May 10, 2025 • edited Loading

mergify bot commented May 10, 2025

fsx950223 commented May 12, 2025

houseroad May 12, 2025

Choose a reason for hiding this comment

fsx950223 May 12, 2025

Choose a reason for hiding this comment

mergify bot commented May 13, 2025

tjtanaa May 15, 2025 • edited Loading

Choose a reason for hiding this comment

fsx950223 May 15, 2025

Choose a reason for hiding this comment

tjtanaa commented May 17, 2025 • edited Loading

fsx950223 commented May 17, 2025

fsx950223 commented May 6, 2025 •

edited by github-actions bot

Loading

coderfeli commented May 10, 2025 •

edited

Loading

tjtanaa May 15, 2025 •

edited

Loading

tjtanaa commented May 17, 2025 •

edited

Loading