A case for performant portable ops #10886

kimishpatel · 2025-05-14T22:28:11Z

kimishpatel
May 14, 2025
Collaborator

There is need for optimum-executorch in enabling transformer based model on ExecuTorch (with eventual target towards non LLMs as well), with decent performance, while etLLM targets more targeted transforms that enable best performance across backends.

In this effort we enabled custom_sdpa, both via graph transform (PR) and via hf’s attention customization API (PR), that improved out-of-box performance significantly. However, there is still a significant gap compared to etLLM. The optimizations that were left out were the ones that are hard to apply in optimum-executorch, namely custom kv cache. This module uses custom op, update_cache, to mutate cache in place without incurring slicing and indexing costs. We wanted to understand the impact of these and maybe other portable ops to prioritize work on improving portable operator performance.

To do this we ran four models, using optimum-executorch, on a ubuntu CI machine. Job details can be found here. Similar profiling on android device is under way. The four models were:
Gemma3 1B
Qwen 3 0.6B
SmolLM2 135M
Llama 3.2 1B

Following is the operator level breakdown, where DELEGATE is xnnpack delegate lowering 4bit quantized linear layers.

Qwen3 0.6B:

Operator Latency (ms) Percentage

DELEGATE_CALL 37.637149 17.11%

native_call__to_copy.out 0.008214 0.00%

native_call_argmin.out 0.104857 0.05%

native_call_cat.out 0.013361 0.01%

native_call_clone.out 0.007732 0.00%

native_call_copy_ 88.946615 40.43%

native_call_copy.out 0.006142 0.00%

native_call_cos.out 0.03191 0.01%

native_call_embedding.out 0.010807 0.00%

native_call_eq.Scalar_out 0.009587 0.00%

native_call_expand_copy.out 0.00278 0.00%

native_call_full.out 0.010558 0.00%

native_call_gt.Tensor_out 0.049565 0.02%

native_call_index_put.out 92.624532 42.10%

native_call_index.Tensor_out 0.013834 0.01%

native_call_mean.out 0.428214 0.19%

native_call_select_copy.int_out 0.038999 0.02%

native_call_sin.out 0.008804 0.00%

native_call_slice_scatter.out 0.013261 0.01%

native_call_unsqueeze_copy.out 0.021628 0.01%

native_call_where.self_out 0.00305 0.00%

Total 219.991599

SmolLM2 135M

Operator Latency (ms) Percentage

DELEGATE_CALL 11.587919 22.68%

native_call__to_copy.out 0.007099 0.01%

native_call_argmin.out 0.108654 0.21%

native_call_cat.out 0.024155 0.05%

native_call_clone.out 0.008742 0.02%

native_call_copy_ 18.475606 36.17%

native_call_copy.out 0.0059 0.01%

native_call_cos.out 0.037289 0.07%

native_call_embedding.out 0.009416 0.02%

native_call_eq.Scalar_out 0.008789 0.02%

native_call_expand_copy.out 0.003052 0.01%

native_call_full.out 0.01089 0.02%

native_call_gt.Tensor_out 0.042084 0.08%

native_call_index_put.out 20.447934 40.03%

native_call_index.Tensor_out 0.015257 0.03%

native_call_mean.out 0.21081 0.41%

native_call_select_copy.int_out 0.039326 0.08%

native_call_sin.out 0.008288 0.02%

native_call_slice_scatter.out 0.013522 0.03%

native_call_unsqueeze_copy.out 0.018119 0.04%

native_call_where.self_out 0.002272 0.00%

Total 51.085123

Llama3.2 1B

Operator Latency (ms) Percentage

DELEGATE_CALL 107.83962 66.75%

native_call__to_copy.out 0.007146 0.00%

native_call_argmin.out 0.063662 0.04%

native_call_cat.out 0.021509 0.01%

native_call_clone.out 0.012993 0.01%

native_call_copy_ 25.707732 15.91%

native_call_copy.out 0.006893 0.00%

native_call_cos.out 0.027805 0.02%

native_call_embedding.out 0.011646 0.01%

native_call_eq.Scalar_out 0.011038 0.01%

native_call_expand_copy.out 0.004541 0.00%

native_call_full.out 0.011406 0.01%

native_call_gt.Tensor_out 0.04126 0.03%

native_call_index_put.out 27.468009 17.00%

native_call_index.Tensor_out 0.019517 0.01%

native_call_mean.out 0.224307 0.14%

native_call_select_copy.int_out 0.025399 0.02%

native_call_sin.out 0.007636 0.00%

native_call_slice_scatter.out 0.01496 0.01%

native_call_unsqueeze_copy.out 0.019221 0.01%

native_call_where.self_out 0.002753 0.00%

Total 161.549053

Note how copy and index_put take up significant portion of the runtime especially in Llama3.2 1B, SmilLM2 and Qwen3 0.6B. Particularly smaller the model worse it is. This is because of a) functionalization that results in full copy of the data and b) lack of mutation means we have to copy entire mutable buffer state back to its original copy. Plus index_put is notoriously hard op to implement in aten compliant manner so it is really really slow.

What can we do?
Three things.
Reverse functionalization: No more copy_.
Implement index_put_.
Implement fßast path for index_put where index updates across a specific dimension should just result in a bunch of memcpy.

On Gemma3 1B:

Operator Latency (ms) Percentage

DELEGATE_CALL 39.321581 7.46%

native_call__to_copy.out 0.014519 0.00%

native_call_add.out 0.026301 0.00%

native_call_argmin.out 0.05055 0.01%

native_call_cat.out 0.011913 0.00%

native_call_clamp.out 0.030432 0.01%

native_call_clone.out 0.003069 0.00%

native_call_copy_ 10.786615 2.05%

native_call_copy.out 0.007192 0.00%

native_call_cos.out 0.042923 0.01%

native_call_cumsum.out 0.068647 0.01%

native_call_embedding.out 0.014164 0.00%

native_call_eq.Scalar_out 0.003636 0.00%

native_call_expand_copy.out 0.00699 0.00%

native_call_full_like.out 2.35553 0.45%

native_call_full.out 0.023417 0.00%

native_call_ge.Scalar_out 0.011451 0.00%

native_call_gelu.out 8.750648 1.66%

native_call_gt.Tensor_out 0.039374 0.01%

native_call_index_put.out 11.078833 2.10%

native_call_index.Tensor_out 418.65092 79.46%

native_call_logical_and.out 0.440137 0.08%

native_call_mean.out 0.571906 0.11%

native_call_remainder.Scalar_out 0.103069 0.02%

native_call_select_copy.int_out 0.032755 0.01%

native_call_sin.out 0.01785 0.00%

native_call_slice_scatter.out 33.727203 6.40%

native_call_sub.out 0.028459 0.01%

native_call_unsqueeze_copy.out 0.051599 0.01%

native_call_where.self_out 0.586211 0.11%

Total 526.857894

Note the start difference in Gemma3 where indexing itself is a significant chunk of the issue. Why? Because gemma3 has local attention that uses sliding window and sliding window implementation literally slices out the last N entries from cache and moves it up. See here. There isnt a good way to handle this quite. We have a ring buffer implementation that does a sliding window in a more efficient way, however this requires module swap at the moment. I think the best thing to do would be to upstream our implementation to HF and everyone gets to benefit.

kimishpatel · 2025-05-14T22:31:26Z

kimishpatel
May 14, 2025
Collaborator Author

cc: @larryliu0820 @JacobSzwejbka @cbilgin @digantdesai

0 replies

GregoryComer · 2025-05-14T22:34:49Z

GregoryComer
May 14, 2025
Collaborator

Thanks for compiling the numbers on this. I'm excited about this - in addition to the benefits for HF, it will also benefit sequence to sequence tasks, ASR, and potentially simplify enablement for any emerging architectures.

1 reply

kimishpatel May 15, 2025
Collaborator Author

indeed

metascroy · 2025-05-15T18:29:45Z

metascroy
May 15, 2025
Collaborator

Here is a "good first issue" to reinplace slice_copy with slice: #10917

0 replies

guangy10 · 2025-05-16T17:27:39Z

guangy10
May 16, 2025
Collaborator

@kimishpatel I think we can probably register a new cache impl with custom ops to perform in-place cache update in Transformers here: https://github.com/huggingface/transformers/blob/main/src/transformers/cache_utils.py

4 replies

kimishpatel May 16, 2025
Collaborator Author

yeah but do I need to upstream the changes? Like is there a pluggable interface with an example? I think cache has to abide by some convention and I think this should be possible but I havent seen any examples of this

guangy10 May 16, 2025
Collaborator

The cache should be composable as long as we implement the cache interface/class. Unlike attention registry, the cache is passed as input either when loading the model or when at the generation time. For the upstream change, we would need to use the new cache impl in the export recipe here: https://github.com/huggingface/transformers/blob/40a493c7ed4f19f08eadb0639cf26d49bfa5e180/src/transformers/integrations/executorch.py#L57-L69.

kimishpatel May 17, 2025
Collaborator Author

Actually I just realized this one, https://github.com/huggingface/transformers/blob/40a493c7ed4f19f08eadb0639cf26d49bfa5e180/src/transformers/integrations/executorch.py#L273C29-L273C40, we can actually source transform? Just replace model.static_cache with custom kv cache. Let me write a diff

guangy10 May 17, 2025
Collaborator

Actually I just realized this one, https://github.com/huggingface/transformers/blob/40a493c7ed4f19f08eadb0639cf26d49bfa5e180/src/transformers/integrations/executorch.py#L273C29-L273C40, we can actually source transform? Just replace model.static_cache with custom kv cache. Let me write a diff

Correct. Instantiate the new cache in the init.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A case for performant portable ops #10886

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

A case for performant portable ops #10886

kimishpatel May 14, 2025 Collaborator

Qwen3 0.6B:

SmolLM2 135M

Llama3.2 1B

On Gemma3 1B:

Replies: 4 comments · 5 replies

kimishpatel May 14, 2025 Collaborator Author

GregoryComer May 14, 2025 Collaborator

kimishpatel May 15, 2025 Collaborator Author

metascroy May 15, 2025 Collaborator

guangy10 May 16, 2025 Collaborator

kimishpatel May 16, 2025 Collaborator Author

guangy10 May 16, 2025 Collaborator

kimishpatel May 17, 2025 Collaborator Author

guangy10 May 17, 2025 Collaborator

kimishpatel
May 14, 2025
Collaborator

Replies: 4 comments 5 replies

kimishpatel
May 14, 2025
Collaborator Author

GregoryComer
May 14, 2025
Collaborator

kimishpatel May 15, 2025
Collaborator Author

metascroy
May 15, 2025
Collaborator

guangy10
May 16, 2025
Collaborator

kimishpatel May 16, 2025
Collaborator Author

guangy10 May 16, 2025
Collaborator

kimishpatel May 17, 2025
Collaborator Author

guangy10 May 17, 2025
Collaborator