-
-
Notifications
You must be signed in to change notification settings - Fork 7.5k
[V1][PP] Support PP for MultiprocExecutor #14219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this!
Please see the comments to align the v1 PP interface and design.
Meanwhile, do you have any benchmark numbers?
Also cc @ruisearch42 |
@@ -89,8 +89,8 @@ def detailed( | |||
chunked_prefill=False), | |||
], | |||
# only ray is supported for V1 | |||
distributed_backends=["mp", "ray", "ray"], | |||
vllm_major_versions=["0", "0", "1"], | |||
distributed_backends=["mp", "mp", "ray", "ray"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update comment above
def collective_rpc(self, | ||
method: Union[str, Callable], | ||
timeout: Optional[float] = None, | ||
args: tuple = (), | ||
kwargs: Optional[dict] = None) -> list[Any]: | ||
kwargs: Optional[dict] = None, | ||
non_block: bool = False) -> list[Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since @comaniac and @ruisearch42 do not want to add non_block
at the collective_rpc
level, you can have a _run_workers
function to have the ability to be non_blocking
, and then in the execute_model
level of this executor, call _run_workers
with non-blocking if pp is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea :)
driver_worker_rank = (vllm_config.parallel_config.world_size - | ||
vllm_config.parallel_config.tensor_parallel_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it calculated this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the driver should be the first rank in the last TP group.
Now changed is_driver
checking to self.rank % tp_size == 0
as the definition in V0.
also cc @njhill and @tlrmchlsmth for the mp-related change. |
Thanks for @comaniac @ruisearch42 @youkaichao 's comments, I have resolved them all. I have some initial benchmark results. On a 4xL40 platform: # TP=2 PP=1
VLLM_USE_V1=1 python3 benchmark_throughput.py --backend=vllm --dataset=./ShareGPT_V3_unfiltered_cleaned_split.json --model=meta-llama/Meta-Llama-3-8B-Instruct --n=1 --num-prompts=1000 --trust-remote-code --disable-log-stats -tp=2 -pp=1 --max-model-len=8192
Throughput: 5.60 requests/s, 2314.67 total tokens/s, 1110.17 output tokens/s
# TP=1 PP=2
VLLM_USE_V1=1 python3 benchmark_throughput.py --backend=vllm --dataset=./ShareGPT_V3_unfiltered_cleaned_split.json --model=meta-llama/Meta-Llama-3-8B-Instruct --n=1 --num-prompts=1000 --trust-remote-code --disable-log-stats -tp=1 -pp=2 --max-model-len=8192
Throughput: 5.74 requests/s, 2374.38 total tokens/s, 1138.81 output tokens/s
# TP=1 PP=4
VLLM_USE_V1=1 python3 benchmark_throughput.py --backend=vllm --dataset=./ShareGPT_V3_unfiltered_cleaned_split.json --model=meta-llama/Meta-Llama-3-8B-Instruct --n=1 --num-prompts=1000 --trust-remote-code --disable-log-stats -tp=4 -pp=1 --max-model-len=8192
Throughput: 7.93 requests/s, 3277.62 total tokens/s, 1572.03 output tokens/s
# TP=1 PP=4
VLLM_USE_V1=1 python3 benchmark_throughput.py --backend=vllm --dataset=./ShareGPT_V3_unfiltered_cleaned_split.json --model=meta-llama/Meta-Llama-3-8B-Instruct --n=1 --num-prompts=1000 --trust-remote-code --disable-log-stats -tp=1 -pp=4 --max-model-len=8192
Throughput: 9.28 requests/s, 3838.89 total tokens/s, 1841.22 output tokens/s
# TP=2 PP=2
VLLM_USE_V1=1 python3 benchmark_throughput.py --backend=vllm --dataset=./ShareGPT_V3_unfiltered_cleaned_split.json --model=meta-llama/Meta-Llama-3-8B-Instruct --n=1 --num-prompts=1000 --trust-remote-code --disable-log-stats -tp=2 -pp=2 --max-model-len=8192
Throughput: 9.69 requests/s, 4005.70 total tokens/s, 1921.23 output tokens/s |
) -> Union[ModelRunnerOutput, Future[ModelRunnerOutput]]: | ||
output = self._run_workers("execute_model", | ||
args=(scheduler_output, ), | ||
non_block=self.max_concurrent_batches > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We should be able to eliminate non_block
here and just have
def _run_workers(self, ...):
non_block = self.max_concurrent_batches > 1
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now collective_rpc
is using _run_workers
. I think this will make all collective_rpc
return future objects when using PP 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. It's a bit confusing tho. Could we try to add more comments to help future developments? For example:
def _run_workers(self,
method: Union[str, Callable],
timeout: Optional[float] = None,
args: tuple = (),
kwargs: Optional[dict] = None,
non_block: bool = False) -> list[Any]:
"""Run the method on workers of all ranks and get their responses.
Note that when non_block=True (PP > 1), this function immediately returns
future objects to unblock pipeline.
Args:
...
Return:
A list of responses from workers of all ranks. The response will be
future objects if non_block=True.
"""
# Note: only returns ModelRunnerOutput from the driver worker with the | ||
# last PP rank | ||
return output[self.world_size - | ||
self.parallel_config.tensor_parallel_size] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This calculation may have issues when other parallelism is used (e.g., EP, DP attention)?
# Note: only returns ModelRunnerOutput from the driver worker with the | |
# last PP rank | |
return output[self.world_size - | |
self.parallel_config.tensor_parallel_size] | |
# Only returns ModelRunnerOutput from TP rank=0 and PP rank=-1 | |
# (the first TP worker of the last PP stage). | |
return output[self.world_size - | |
self.parallel_config.tensor_parallel_size] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.world_size - self.parallel_config.tensor_parallel_size
is tp rank 0 with pp rank = 1. i think it is still not correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming TP=8, PP=4, then the world size is 32. IIUC, in this case the PP ranks would be
0-7, PP rank 0
8-15, PP rank 1
16-23, PP rank 2
24-31, PP rank 3
so world_size - tp_size = 32 - 8 = 24 should be PP rank = -1 (i.e. 3)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EP is using TP group so it will not have issue.
DP will have issue because the rank map shape is (DP, PP, TP). But right now the MP executor suppose there are only TP and PP and checking it in initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh my bad, this is indeed correct then. we should add more comment here, e.g. including the calculation from @comaniac . it is quite easy to mess things up here.
64fc61e
to
5ab24c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Since we are in the process of making v1 as the default and this PR changes the executor, perhaps it's better to wait for a few more days until other issues are resolved.
Also cc @njhill @robertgshaw2-redhat @WoosukKwon for awareness.
) -> Union[ModelRunnerOutput, Future[ModelRunnerOutput]]: | ||
output = self._run_workers("execute_model", | ||
args=(scheduler_output, ), | ||
non_block=self.max_concurrent_batches > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. It's a bit confusing tho. Could we try to add more comments to help future developments? For example:
def _run_workers(self,
method: Union[str, Callable],
timeout: Optional[float] = None,
args: tuple = (),
kwargs: Optional[dict] = None,
non_block: bool = False) -> list[Any]:
"""Run the method on workers of all ranks and get their responses.
Note that when non_block=True (PP > 1), this function immediately returns
future objects to unblock pipeline.
Args:
...
Return:
A list of responses from workers of all ranks. The response will be
future objects if non_block=True.
"""
Hi @bigPYJ1151, can you please rebase the PR and resolve merge conflicts? |
9b72ac2
to
44610ab
Compare
@WoosukKwon Sure, updated, also verified the unit tests. Please take a look :) |
@ruisearch42 @comaniac @youkaichao Can you please take a final look by any chance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM. Thanks for the PR.
Would be nice if the MP backend owners could review too.
@@ -350,6 +349,11 @@ def _compare_tp( | |||
# Temporary. Currently when zeromq + SPMD is used, it does not properly | |||
# terminate because of a Ray Compiled Graph issue. | |||
common_args.append("--disable-frontend-multiprocessing") | |||
elif distributed_backend == "mp": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only support "ray" and "mp", so should have the else branch and assert it's "mp"
f"tensor_parallel_size ({tensor_parallel_size}) x pipeline" | ||
f" parallel_size ({pp_parallel_size}). ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use a single word pipeline_parallel_size
in the message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the other minor issue as well
Signed-off-by: jiang1.li <[email protected]>
Signed-off-by: jiang.li <[email protected]>
@bigPYJ1151 I've just started the CI test. Will merge once it becomes green. |
@WoosukKwon All required became green, please help to merge, thanks :) |
* [Model] Add GraniteMoeHybrid 4.0 model (vllm-project#17497) Signed-off-by: Thomas Ortner <[email protected]> Signed-off-by: Stanislaw Wozniak <[email protected]> Co-authored-by: Thomas Ortner <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [easy] Fix logspam on PiecewiseBackend errors (vllm-project#17138) Signed-off-by: rzou <[email protected]> * [Bugfix] Fixed prompt length for random dataset (vllm-project#17408) Signed-off-by: Mikhail Podvitskii <[email protected]> * [Doc] Update notes for H2O-VL and Gemma3 (vllm-project#17219) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Fix ScalarType float4 naming (vllm-project#17690) Signed-off-by: Lucas Wilkinson <[email protected]> * Fix `dockerfilegraph` pre-commit hook (vllm-project#17698) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix] Fix triton import with local TritonPlaceholder (vllm-project#17446) Signed-off-by: Mengqing Cao <[email protected]> * [V1] Enable TPU V1 backend by default (vllm-project#17673) Signed-off-by: mgoin <[email protected]> * [V1][PP] Support PP for MultiprocExecutor (vllm-project#14219) Signed-off-by: jiang1.li <[email protected]> Signed-off-by: jiang.li <[email protected]> * [v1] AttentionMetadata for each layer (vllm-project#17394) Signed-off-by: Chen Zhang <[email protected]> * [Feat] Add deprecated=True to CLI args (vllm-project#17426) Signed-off-by: Aaron Pham <[email protected]> * [Docs] Use gh-file to add links to tool_calling.md (vllm-project#17709) Signed-off-by: windsonsea <[email protected]> * [v1] Introduce KVCacheBlocks as interface between Scheduler and KVCacheManager (vllm-project#17479) Signed-off-by: Chen Zhang <[email protected]> * [doc] Add RAG Integration example (vllm-project#17692) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix modality limits in vision language example (vllm-project#17721) Signed-off-by: DarkLight1337 <[email protected]> * Make right sidebar more readable in "Supported Models" (vllm-project#17723) Signed-off-by: Harry Mellor <[email protected]> * [TPU] Increase block size and reset block shapes (vllm-project#16458) * [Misc] Add Next Edit Prediction (NEP) datasets support in `benchmark_serving.py` (vllm-project#16839) Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Signed-off-by: dtransposed <> Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> * [Bugfix] Fix for the condition to accept empty encoder inputs for mllama (vllm-project#17732) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Kernel] Unified Triton kernel that doesn't distinguish between prefill + decode (vllm-project#16828) Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> --------- Signed-off-by: Thomas Ortner <[email protected]> Signed-off-by: Stanislaw Wozniak <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Mikhail Podvitskii <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Mengqing Cao <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Signed-off-by: dtransposed <> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: [email protected] <[email protected]> Co-authored-by: Stan Wozniak <[email protected]> Co-authored-by: Thomas Ortner <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Mikhail Podvitskii <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Mengqing Cao <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Jevin Jiang <[email protected]> Co-authored-by: d.transposed <[email protected]> Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Thomas Parnell <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]>
Signed-off-by: jiang1.li <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: Mu Huai <[email protected]>
Signed-off-by: jiang1.li <[email protected]> Signed-off-by: jiang.li <[email protected]>
By offloading MQ reading to a IO thread,
MultiprocExecutor
can receive multiple batches in flight and support V1 PP seamlessly.