Skip to content

[Performance]: Worse prefilling with unified triton attention #18152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
haochengxia opened this issue May 14, 2025 · 0 comments
Closed

[Performance]: Worse prefilling with unified triton attention #18152

haochengxia opened this issue May 14, 2025 · 0 comments
Labels
performance Performance-related issues

Comments

@haochengxia
Copy link
Contributor

haochengxia commented May 14, 2025

Proposal to improve performance

Report of performance regression

When I used the kernel updated with #16828, I witnessed a huge prefilling (TTFT) performance drop on A100@40GB.

For an 8k token sequence, the old chunked prefilling kernel costs ~600ms while the new one uses 1500ms.

Is there any configuration I should set for this kernel?

Thanks!

@haochengxia haochengxia added the performance Performance-related issues label May 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

1 participant