[V1][PP] Support PP for MultiprocExecutor #14219

bigPYJ1151 · 2025-03-04T16:12:09Z

By offloading MQ reading to a IO thread, MultiprocExecutor can receive multiple batches in flight and support V1 PP seamlessly.

github-actions · 2025-03-04T16:12:23Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

comaniac

Thanks for adding this!
Please see the comments to align the v1 PP interface and design.
Meanwhile, do you have any benchmark numbers?

vllm/executor/executor_base.py

vllm/v1/executor/abstract.py

comaniac · 2025-03-04T17:48:50Z

Also cc @ruisearch42

vllm/executor/executor_base.py

vllm/v1/worker/gpu_worker.py

ruisearch42 · 2025-03-04T19:43:45Z

tests/distributed/test_pipeline_parallel.py

@@ -89,8 +89,8 @@ def detailed(
                              chunked_prefill=False),
            ],
            # only ray is supported for V1
-            distributed_backends=["mp", "ray", "ray"],
-            vllm_major_versions=["0", "0", "1"],
+            distributed_backends=["mp", "mp", "ray", "ray"],


update comment above

youkaichao · 2025-03-05T07:33:23Z

vllm/v1/executor/multiproc_executor.py

    def collective_rpc(self,
                       method: Union[str, Callable],
                       timeout: Optional[float] = None,
                       args: tuple = (),
-                       kwargs: Optional[dict] = None) -> list[Any]:
+                       kwargs: Optional[dict] = None,
+                       non_block: bool = False) -> list[Any]:


since @comaniac and @ruisearch42 do not want to add non_block at the collective_rpc level, you can have a _run_workers function to have the ability to be non_blocking, and then in the execute_model level of this executor, call _run_workerswith non-blocking if pp is enabled.

Good idea :)

youkaichao · 2025-03-05T07:39:35Z

vllm/v1/executor/multiproc_executor.py

+        driver_worker_rank = (vllm_config.parallel_config.world_size -
+                              vllm_config.parallel_config.tensor_parallel_size)


why is it calculated this way?

I thought the driver should be the first rank in the last TP group.

Now changed is_driver checking to self.rank % tp_size == 0 as the definition in V0.

youkaichao · 2025-03-05T07:40:57Z

also cc @njhill and @tlrmchlsmth for the mp-related change.

bigPYJ1151 · 2025-03-05T13:55:10Z

Thanks for @comaniac @ruisearch42 @youkaichao 's comments, I have resolved them all.

I have some initial benchmark results. On a 4xL40 platform:

# TP=2 PP=1
VLLM_USE_V1=1 python3 benchmark_throughput.py --backend=vllm --dataset=./ShareGPT_V3_unfiltered_cleaned_split.json --model=meta-llama/Meta-Llama-3-8B-Instruct --n=1 --num-prompts=1000 --trust-remote-code --disable-log-stats -tp=2 -pp=1 --max-model-len=8192 
Throughput: 5.60 requests/s, 2314.67 total tokens/s, 1110.17 output tokens/s

# TP=1 PP=2
VLLM_USE_V1=1 python3 benchmark_throughput.py --backend=vllm --dataset=./ShareGPT_V3_unfiltered_cleaned_split.json --model=meta-llama/Meta-Llama-3-8B-Instruct --n=1 --num-prompts=1000 --trust-remote-code --disable-log-stats -tp=1 -pp=2 --max-model-len=8192 
Throughput: 5.74 requests/s, 2374.38 total tokens/s, 1138.81 output tokens/s

# TP=1 PP=4
VLLM_USE_V1=1 python3 benchmark_throughput.py --backend=vllm --dataset=./ShareGPT_V3_unfiltered_cleaned_split.json --model=meta-llama/Meta-Llama-3-8B-Instruct --n=1 --num-prompts=1000 --trust-remote-code --disable-log-stats -tp=4 -pp=1 --max-model-len=8192 
Throughput: 7.93 requests/s, 3277.62 total tokens/s, 1572.03 output tokens/s

# TP=1 PP=4
VLLM_USE_V1=1 python3 benchmark_throughput.py --backend=vllm --dataset=./ShareGPT_V3_unfiltered_cleaned_split.json --model=meta-llama/Meta-Llama-3-8B-Instruct --n=1 --num-prompts=1000 --trust-remote-code --disable-log-stats -tp=1 -pp=4 --max-model-len=8192 
Throughput: 9.28 requests/s, 3838.89 total tokens/s, 1841.22 output tokens/s

# TP=2 PP=2
VLLM_USE_V1=1 python3 benchmark_throughput.py --backend=vllm --dataset=./ShareGPT_V3_unfiltered_cleaned_split.json --model=meta-llama/Meta-Llama-3-8B-Instruct --n=1 --num-prompts=1000 --trust-remote-code --disable-log-stats -tp=2 -pp=2 --max-model-len=8192 
Throughput: 9.69 requests/s, 4005.70 total tokens/s, 1921.23 output tokens/s

comaniac · 2025-03-05T15:43:53Z

vllm/v1/executor/multiproc_executor.py

+    ) -> Union[ModelRunnerOutput, Future[ModelRunnerOutput]]:
+        output = self._run_workers("execute_model",
+                                   args=(scheduler_output, ),
+                                   non_block=self.max_concurrent_batches > 1)


nit: We should be able to eliminate non_block here and just have

def _run_workers(self, ...): non_block = self.max_concurrent_batches > 1 ...

Now collective_rpc is using _run_workers. I think this will make all collective_rpc return future objects when using PP 🤔

I see. It's a bit confusing tho. Could we try to add more comments to help future developments? For example:

def _run_workers(self, method: Union[str, Callable], timeout: Optional[float] = None, args: tuple = (), kwargs: Optional[dict] = None, non_block: bool = False) -> list[Any]: """Run the method on workers of all ranks and get their responses. Note that when non_block=True (PP > 1), this function immediately returns future objects to unblock pipeline. Args: ... Return: A list of responses from workers of all ranks. The response will be future objects if non_block=True. """

comaniac · 2025-03-05T15:48:28Z

vllm/v1/executor/multiproc_executor.py

+        # Note: only returns ModelRunnerOutput from the driver worker with the
+        # last PP rank
+        return output[self.world_size -
+                      self.parallel_config.tensor_parallel_size]


This calculation may have issues when other parallelism is used (e.g., EP, DP attention)?

Suggested change

# Note: only returns ModelRunnerOutput from the driver worker with the

# last PP rank

return output[self.world_size -

self.parallel_config.tensor_parallel_size]

# Only returns ModelRunnerOutput from TP rank=0 and PP rank=-1

# (the first TP worker of the last PP stage).

return output[self.world_size -

self.parallel_config.tensor_parallel_size]

self.world_size - self.parallel_config.tensor_parallel_size is tp rank 0 with pp rank = 1. i think it is still not correct.

Assuming TP=8, PP=4, then the world size is 32. IIUC, in this case the PP ranks would be
0-7, PP rank 0
8-15, PP rank 1
16-23, PP rank 2
24-31, PP rank 3

so world_size - tp_size = 32 - 8 = 24 should be PP rank = -1 (i.e. 3)?

EP is using TP group so it will not have issue.

DP will have issue because the rank map shape is (DP, PP, TP). But right now the MP executor suppose there are only TP and PP and checking it in initialization.

oh my bad, this is indeed correct then. we should add more comment here, e.g. including the calculation from @comaniac . it is quite easy to mess things up here.

comaniac

LGTM.
Since we are in the process of making v1 as the default and this PR changes the executor, perhaps it's better to wait for a few more days until other issues are resolved.

Also cc @njhill @robertgshaw2-redhat @WoosukKwon for awareness.

tests/distributed/test_pipeline_parallel.py

comaniac · 2025-03-09T00:00:07Z

vllm/v1/executor/multiproc_executor.py

+    ) -> Union[ModelRunnerOutput, Future[ModelRunnerOutput]]:
+        output = self._run_workers("execute_model",
+                                   args=(scheduler_output, ),
+                                   non_block=self.max_concurrent_batches > 1)


I see. It's a bit confusing tho. Could we try to add more comments to help future developments? For example:

def _run_workers(self, method: Union[str, Callable], timeout: Optional[float] = None, args: tuple = (), kwargs: Optional[dict] = None, non_block: bool = False) -> list[Any]: """Run the method on workers of all ranks and get their responses. Note that when non_block=True (PP > 1), this function immediately returns future objects to unblock pipeline. Args: ... Return: A list of responses from workers of all ranks. The response will be future objects if non_block=True. """

WoosukKwon · 2025-04-28T16:27:19Z

Hi @bigPYJ1151, can you please rebase the PR and resolve merge conflicts?

bigPYJ1151 · 2025-04-29T14:34:00Z

Hi @bigPYJ1151, can you please rebase the PR and resolve merge conflicts?

@WoosukKwon Sure, updated, also verified the unit tests. Please take a look :)

WoosukKwon · 2025-04-29T19:58:49Z

@ruisearch42 @comaniac @youkaichao Can you please take a final look by any chance?

ruisearch42

Otherwise LGTM. Thanks for the PR.
Would be nice if the MP backend owners could review too.

ruisearch42 · 2025-05-01T04:40:26Z

tests/distributed/test_pipeline_parallel.py

@@ -350,6 +349,11 @@ def _compare_tp(
        # Temporary. Currently when zeromq + SPMD is used, it does not properly
        # terminate because of a Ray Compiled Graph issue.
        common_args.append("--disable-frontend-multiprocessing")
+    elif distributed_backend == "mp":


We only support "ray" and "mp", so should have the else branch and assert it's "mp"

ruisearch42 · 2025-05-01T04:43:01Z

vllm/v1/executor/multiproc_executor.py

+            f"tensor_parallel_size ({tensor_parallel_size}) x pipeline"
+            f" parallel_size ({pp_parallel_size}). ")


nit: use a single word pipeline_parallel_size in the message

vllm/v1/worker/gpu_worker.py

ruisearch42

Please fix the other minor issue as well

Signed-off-by: jiang1.li <[email protected]>

Signed-off-by: jiang.li <[email protected]>

WoosukKwon · 2025-05-06T08:06:11Z

@bigPYJ1151 I've just started the CI test. Will merge once it becomes green.

bigPYJ1151 · 2025-05-06T11:50:27Z

@WoosukKwon All required became green, please help to merge, thanks :)

* [Model] Add GraniteMoeHybrid 4.0 model (vllm-project#17497) Signed-off-by: Thomas Ortner <[email protected]> Signed-off-by: Stanislaw Wozniak <[email protected]> Co-authored-by: Thomas Ortner <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [easy] Fix logspam on PiecewiseBackend errors (vllm-project#17138) Signed-off-by: rzou <[email protected]> * [Bugfix] Fixed prompt length for random dataset (vllm-project#17408) Signed-off-by: Mikhail Podvitskii <[email protected]> * [Doc] Update notes for H2O-VL and Gemma3 (vllm-project#17219) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Fix ScalarType float4 naming (vllm-project#17690) Signed-off-by: Lucas Wilkinson <[email protected]> * Fix `dockerfilegraph` pre-commit hook (vllm-project#17698) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix] Fix triton import with local TritonPlaceholder (vllm-project#17446) Signed-off-by: Mengqing Cao <[email protected]> * [V1] Enable TPU V1 backend by default (vllm-project#17673) Signed-off-by: mgoin <[email protected]> * [V1][PP] Support PP for MultiprocExecutor (vllm-project#14219) Signed-off-by: jiang1.li <[email protected]> Signed-off-by: jiang.li <[email protected]> * [v1] AttentionMetadata for each layer (vllm-project#17394) Signed-off-by: Chen Zhang <[email protected]> * [Feat] Add deprecated=True to CLI args (vllm-project#17426) Signed-off-by: Aaron Pham <[email protected]> * [Docs] Use gh-file to add links to tool_calling.md (vllm-project#17709) Signed-off-by: windsonsea <[email protected]> * [v1] Introduce KVCacheBlocks as interface between Scheduler and KVCacheManager (vllm-project#17479) Signed-off-by: Chen Zhang <[email protected]> * [doc] Add RAG Integration example (vllm-project#17692) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix modality limits in vision language example (vllm-project#17721) Signed-off-by: DarkLight1337 <[email protected]> * Make right sidebar more readable in "Supported Models" (vllm-project#17723) Signed-off-by: Harry Mellor <[email protected]> * [TPU] Increase block size and reset block shapes (vllm-project#16458) * [Misc] Add Next Edit Prediction (NEP) datasets support in `benchmark_serving.py` (vllm-project#16839) Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Signed-off-by: dtransposed <> Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> * [Bugfix] Fix for the condition to accept empty encoder inputs for mllama (vllm-project#17732) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Kernel] Unified Triton kernel that doesn't distinguish between prefill + decode (vllm-project#16828) Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> --------- Signed-off-by: Thomas Ortner <[email protected]> Signed-off-by: Stanislaw Wozniak <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Mikhail Podvitskii <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Mengqing Cao <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Signed-off-by: dtransposed <> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: [email protected] <[email protected]> Co-authored-by: Stan Wozniak <[email protected]> Co-authored-by: Thomas Ortner <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Mikhail Podvitskii <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Mengqing Cao <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Jevin Jiang <[email protected]> Co-authored-by: d.transposed <[email protected]> Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Thomas Parnell <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]>

Signed-off-by: jiang1.li <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: Mu Huai <[email protected]>

Signed-off-by: jiang1.li <[email protected]> Signed-off-by: jiang.li <[email protected]>

bigPYJ1151 requested review from youkaichao, WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat and zhuohan123 as code owners March 4, 2025 16:12

mergify bot added the v1 label Mar 4, 2025

comaniac self-assigned this Mar 4, 2025

comaniac requested changes Mar 4, 2025

View reviewed changes

vllm/executor/executor_base.py Outdated Show resolved Hide resolved

vllm/v1/executor/abstract.py Outdated Show resolved Hide resolved

vllm/v1/executor/abstract.py Outdated Show resolved Hide resolved

ruisearch42 reviewed Mar 4, 2025

View reviewed changes

youkaichao reviewed Mar 5, 2025

View reviewed changes

bigPYJ1151 force-pushed the v1_pp branch from 98e7552 to 7a7ac2d Compare March 5, 2025 13:39

comaniac reviewed Mar 5, 2025

View reviewed changes

bigPYJ1151 force-pushed the v1_pp branch 2 times, most recently from 64fc61e to 5ab24c7 Compare March 7, 2025 18:02

comaniac approved these changes Mar 9, 2025

View reviewed changes

bigPYJ1151 force-pushed the v1_pp branch from 771fdf1 to 5d1fb30 Compare March 11, 2025 02:24

bigPYJ1151 force-pushed the v1_pp branch 2 times, most recently from 9b72ac2 to 44610ab Compare April 29, 2025 12:48

ruisearch42 reviewed May 1, 2025

View reviewed changes

ruisearch42 approved these changes May 1, 2025

View reviewed changes

bigPYJ1151 added 2 commits May 6, 2025 07:18

support v1 mp pp

87b8ed2

Signed-off-by: jiang1.li <[email protected]>

fix comments

9ac6841

Signed-off-by: jiang.li <[email protected]>

bigPYJ1151 force-pushed the v1_pp branch from f87c7dd to 9ac6841 Compare May 6, 2025 07:41

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label May 6, 2025

WoosukKwon merged commit a6fed02 into vllm-project:main May 6, 2025
73 checks passed

lk-chen mentioned this pull request May 9, 2025

[ray.data.llm] Update release test: enforce pp_backend ray-project/ray#52923

Open

8 tasks

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request May 14, 2025

[V1][PP] Support PP for MultiprocExecutor (vllm-project#14219)

373694e

Signed-off-by: jiang1.li <[email protected]> Signed-off-by: jiang.li <[email protected]>

rabi mentioned this pull request May 21, 2025

Intialize io_thread_pool attribute in the beginning. #18331

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1][PP] Support PP for MultiprocExecutor #14219

[V1][PP] Support PP for MultiprocExecutor #14219

bigPYJ1151 commented Mar 4, 2025

github-actions bot commented Mar 4, 2025

comaniac left a comment

comaniac commented Mar 4, 2025

ruisearch42 Mar 4, 2025

youkaichao Mar 5, 2025

bigPYJ1151 Mar 5, 2025

youkaichao Mar 5, 2025

bigPYJ1151 Mar 5, 2025

youkaichao commented Mar 5, 2025

bigPYJ1151 commented Mar 5, 2025

comaniac Mar 5, 2025

bigPYJ1151 Mar 5, 2025

comaniac Mar 9, 2025

comaniac Mar 5, 2025

youkaichao Mar 5, 2025

comaniac Mar 5, 2025

bigPYJ1151 Mar 5, 2025

youkaichao Mar 5, 2025

comaniac left a comment

comaniac Mar 9, 2025

WoosukKwon commented Apr 28, 2025

bigPYJ1151 commented Apr 29, 2025 •

edited

Loading

WoosukKwon commented Apr 29, 2025 •

edited

Loading

ruisearch42 left a comment

ruisearch42 May 1, 2025

ruisearch42 May 1, 2025

ruisearch42 left a comment

WoosukKwon commented May 6, 2025

bigPYJ1151 commented May 6, 2025

		driver_worker_rank = (vllm_config.parallel_config.world_size -
		vllm_config.parallel_config.tensor_parallel_size)

		f"tensor_parallel_size ({tensor_parallel_size}) x pipeline"
		f" parallel_size ({pp_parallel_size}). ")

[V1][PP] Support PP for MultiprocExecutor #14219

[V1][PP] Support PP for MultiprocExecutor #14219

Conversation

bigPYJ1151 commented Mar 4, 2025

github-actions bot commented Mar 4, 2025

comaniac left a comment

Choose a reason for hiding this comment

comaniac commented Mar 4, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

youkaichao commented Mar 5, 2025

bigPYJ1151 commented Mar 5, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comaniac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WoosukKwon commented Apr 28, 2025

bigPYJ1151 commented Apr 29, 2025 • edited Loading

WoosukKwon commented Apr 29, 2025 • edited Loading

ruisearch42 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruisearch42 left a comment

Choose a reason for hiding this comment

WoosukKwon commented May 6, 2025

bigPYJ1151 commented May 6, 2025

bigPYJ1151 commented Apr 29, 2025 •

edited

Loading

WoosukKwon commented Apr 29, 2025 •

edited

Loading