[Bugfix] Enable `torch.comple` for 2 parts of model #14913

vadiklyutiy · 2025-03-17T02:23:50Z

Before this PR

Prior to this PR, we were unable to cover two parts of the model using @support_torch_compile.

For instance, the example below fails:

@support_torch_compile
class FirstLinear(nn.Module):
    def __init__(self, input_size=10, output_size=20, *, vllm_config=None, prefix='', **kwargs):
        super().__init__()
        self.linear = nn.Linear(input_size, output_size)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.linear(x)

class ActivationLayer(nn.Module):
    def __init__(self):
        super().__init__()
        self.activation = nn.SiLU()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.activation(x)

@support_torch_compile
class SecondLinear(nn.Module):
    def __init__(self, input_size=20, output_size=5, *, vllm_config=None, prefix='', **kwargs):
        super().__init__()
        self.linear = nn.Linear(input_size, output_size)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.linear(x)

class SimpleModel(nn.Module):
    def __init__(self, input_size=10, hidden_size=20, output_size=5, *, vllm_config=None, prefix=''):
        super().__init__()
        self.first_linear = FirstLinear(
            input_size=input_size,
            output_size=hidden_size,
            vllm_config=vllm_config,
            prefix=f"{prefix}first_linear."
        )
        self.activation = ActivationLayer()
        self.second_linear = SecondLinear(
            input_size=hidden_size,
            output_size=output_size,
            vllm_config=vllm_config,
            prefix=f"{prefix}second_linear."
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.first_linear(x)
        x = self.activation(x)
        x = self.second_linear(x)
        return x

What This PR Fixes

This PR resolves several bugs and this enables covering with @support_torch_compile multiple parts of the model.

Why It's Needed

Consider a model with three parts:

part1()
part2()
part3()

part2() might not be traceable by Dynamo. If we attempt to run it with Dynamo, it will fail. In such cases, we can apply @support_torch_compile to part1() or part3(), but not both.

…ph and parameter shapes into the cache directory naming Signed-off-by: Vadim Gimpelson <[email protected]>

github-actions · 2025-03-17T02:24:01Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Vadim Gimpelson <[email protected]>

youkaichao · 2025-03-22T04:29:51Z

what is the use case here?

vadiklyutiy · 2025-03-22T07:57:15Z

what is the use case here?

The main motivation is models that contain part(s) that can not be traced by dynamo.

If you asked about specific example then motivation example is Qwen2.5-vl. It contains two parts: vision and language. We spend more or less equal time in both parts. Right now we use @support_torch_compile only for language part, so, compile only half of execution time. If try to wrap while model with torch compile, we get dynamo tracing fail due to data dependence(shapes of tensor depends on data in another tensor). It happens in rotary embedding part. Rotary embedding is not performance critical(spend very few time there). Rotary embedding is in very beginning of vision part and we can add @support_torch_compile a bit later/deeper in call stack with still covering 99% of execution time of vision part.

But right now we have fails if use @support_torch_compile twice(not nested). This PR fixes this bug.

I added a simple test that shows a error caused by lack of support of several @support_torch_compile.

youkaichao · 2025-03-22T08:50:17Z

for Qwen2.5-vl, we do plan to compile two parts separately. we need to spend some time to design the config though. right now, the compilation config is only for text model, and you need to have another compilation config for the vision part.

vadiklyutiy · 2025-03-22T08:56:52Z

for Qwen2.5-vl, we do plan to compile two parts separately. we need to spend some time to design the config though. right now, the compilation config is only for text model, and you need to have another compilation config for the vision part.

Ok, lets assume we need standalone compilation config for Qwen2.5-vl. But "to compile two parts separately" I think we need 2 changes made in this PR, no?

In next comment I will provide some comment to code changes

youkaichao · 2025-03-22T09:01:04Z

vllm/compilation/backends.py

+        # Add fxgraph to cache path to avoid conflict with other
+        # @support_torch_compile caches
+        import hashlib
+        graph_code = graph.graph.python_code(root_module="self").src


the text model can have different models, too, e.g. in the pipeline parallel case, but we want them to share the same cache directory.

i think we need to be explicit here, say let the upper level caller indicate a tag for the compilation, like compilation_config.tag = "text_tower"/"vision_tower"

the text model can have different models, too, e.g. in the pipeline parallel case, but we want them to share the same cache directory.

I think if fxgraph IRs (graph.python_code().src) are different, then inductor will produce different code to execute that fxgraph.

i think we need to be explicit here, say let the upper level caller indicate a tag for the compilation, like compilation_config.tag = "text_tower"/"vision_tower"

Did I understand correctly that you propose to add in @support_torch_compile explicit arg that specify suffix for cache (more generally some identifier of compile piece, potentially might be used for another purposes).

vadiklyutiy

Below is my comment to made changes

vadiklyutiy · 2025-03-22T08:59:37Z

vllm/compilation/backends.py

+            if isinstance(input_arg, torch.nn.parameter.Parameter):
+                graph_code += f"\n{str(input_arg.shape)}"
+        graph_hash = hashlib.md5(graph_code.encode()).hexdigest()
+        cache_dir = os.path.join(cache_dir, f"fxgraph_{graph_hash}")


If we "plan to compile two parts separately" we need to add fxgraph to hash. Otherwise 2 both fxgraph will access same computation_graph.py and the second will fail.

vadiklyutiy · 2025-03-22T09:03:11Z

vllm/compilation/backends.py

@@ -345,7 +345,6 @@ def configure_post_pass(self):
            # Config should automatically wrap all inductor passes
            assert isinstance(inductor_config[PASS_KEY], InductorPass)
            self.post_grad_pass_manager.add(inductor_config[PASS_KEY])
-        inductor_config[PASS_KEY] = self.post_grad_pass_manager


This is code is not incorrect even formally. self.post_grad_pass_manager is not instance of InductorPass that checked 2 lines above.

I checked source and for my best understanding we need only initially passed value inductor_config[PASS_KEY]. After we make here self.post_grad_pass_manager.add( ) we don't need to update inductor_config[PASS_KEY].

vadiklyutiy · 2025-04-05T11:24:26Z

@youkaichao kindly remind about this PR

vadiklyutiy · 2025-05-12T01:52:23Z

Support of torch.compile of several models in the same run was implemented in #17211

vadiklyutiy · 2025-05-12T01:52:36Z

Support of torch.compile of several models in the same run was implemented in #17211

Fix bug in cache of torch.compile results to include caching of fxgra…

8a79f2f

…ph and parameter shapes into the cache directory naming Signed-off-by: Vadim Gimpelson <[email protected]>

pre-commit formating

dc0e5e5

Signed-off-by: Vadim Gimpelson <[email protected]>

DarkLight1337 requested a review from youkaichao March 17, 2025 03:11

youkaichao reviewed Mar 22, 2025

View reviewed changes

vadiklyutiy commented Mar 22, 2025

View reviewed changes

vadiklyutiy requested a review from youkaichao April 10, 2025 00:56

nopperl mentioned this pull request Apr 10, 2025

Piecewise compilation pfnet/vllm#1

Open

vadiklyutiy closed this May 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Enable `torch.comple` for 2 parts of model #14913

[Bugfix] Enable `torch.comple` for 2 parts of model #14913

vadiklyutiy commented Mar 17, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Mar 17, 2025

youkaichao commented Mar 22, 2025

vadiklyutiy commented Mar 22, 2025

youkaichao commented Mar 22, 2025

vadiklyutiy commented Mar 22, 2025

youkaichao Mar 22, 2025

vadiklyutiy Mar 22, 2025

vadiklyutiy left a comment

vadiklyutiy Mar 22, 2025

vadiklyutiy Mar 22, 2025

vadiklyutiy commented Apr 5, 2025

vadiklyutiy commented May 12, 2025

vadiklyutiy commented May 12, 2025

[Bugfix] Enable torch.comple for 2 parts of model #14913

[Bugfix] Enable torch.comple for 2 parts of model #14913

Conversation

vadiklyutiy commented Mar 17, 2025 • edited by github-actions bot Loading

Before this PR

What This PR Fixes

Why It's Needed

github-actions bot commented Mar 17, 2025

youkaichao commented Mar 22, 2025

vadiklyutiy commented Mar 22, 2025

youkaichao commented Mar 22, 2025

vadiklyutiy commented Mar 22, 2025

youkaichao Mar 22, 2025

Choose a reason for hiding this comment

vadiklyutiy Mar 22, 2025

Choose a reason for hiding this comment

vadiklyutiy left a comment

Choose a reason for hiding this comment

vadiklyutiy Mar 22, 2025

Choose a reason for hiding this comment

vadiklyutiy Mar 22, 2025

Choose a reason for hiding this comment

vadiklyutiy commented Apr 5, 2025

vadiklyutiy commented May 12, 2025

vadiklyutiy commented May 12, 2025

[Bugfix] Enable `torch.comple` for 2 parts of model #14913

[Bugfix] Enable `torch.comple` for 2 parts of model #14913

vadiklyutiy commented Mar 17, 2025 •

edited by github-actions bot

Loading