[Performance]: Worse prefilling with unified triton attention #18152

haochengxia · 2025-05-14T15:29:55Z

When I used the kernel updated with #16828, I witnessed a huge prefilling (TTFT) performance drop on A100@40GB.

For an 8k token sequence, the old chunked prefilling kernel costs ~600ms while the new one uses 1500ms.

Is there any configuration I should set for this kernel?

Thanks!

haochengxia added the performance Performance-related issues label May 14, 2025

haochengxia closed this as completed May 15, 2025

Provide feedback