Skip to content

SYCL: Add non contiguous support in RMS_NORM and NORM kernels #13611

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

qnixsynapse
Copy link
Collaborator

@qnixsynapse qnixsynapse commented May 18, 2025

Added non contiguous support in RMS_NORM and NORM kernels. test-backend-ops seems to pass with this change.

Edit: restored logic for handling multi subgroup correctly which was not tested by test-backend-ops

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 18, 2025
@qnixsynapse qnixsynapse deleted the sycl/non_cont_norms branch May 18, 2025 06:42
@qnixsynapse qnixsynapse restored the sycl/non_cont_norms branch May 18, 2025 06:56
@qnixsynapse qnixsynapse reopened this May 18, 2025
@qnixsynapse qnixsynapse marked this pull request as draft May 18, 2025 07:01
@qnixsynapse
Copy link
Collaborator Author

It now seems to pass with ne[0] = 1920

 NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000000): OK
  NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.000000): OK
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000000): OK
  NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.000000): OK
  NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000001): OK
  NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.000001): OK
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000001): OK
  NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.000001): OK
  NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000100): OK
  NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.000100): OK
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000100): OK
  NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.000100): OK
  NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.100000): OK
  NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.100000): OK
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.100000): OK
  NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.100000): OK
  RMS_NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000000): OK
  RMS_NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.000000): OK
  RMS_NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000000): OK
  RMS_NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.000000): OK
  RMS_NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000001): OK
  RMS_NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.000001): OK
  RMS_NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000001): OK
  RMS_NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.000001): OK
  RMS_NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000100): OK
  RMS_NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.000100): OK
  RMS_NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000100): OK
  RMS_NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.000100): OK
  RMS_NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.100000): OK
  RMS_NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.100000): OK
  RMS_NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.100000): OK
  RMS_NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.100000): OK
  5535/5535 tests passed
  Backend SYCL0: OK

@qnixsynapse qnixsynapse marked this pull request as ready for review May 19, 2025 06:16
Copy link
Collaborator

@Rbiessy Rbiessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you able to measure the impact on performance for this change? If it has one we may want to introduce different paths for contiguous and non-contiguous cases.

I'll try to check on my side for some relevant sizes at some point.

@qnixsynapse qnixsynapse force-pushed the sycl/non_cont_norms branch from 4f9b1bc to d5d39b5 Compare May 19, 2025 11:46
@qnixsynapse
Copy link
Collaborator Author

Were you able to measure the impact on performance for this change? If it has one we may want to introduce different paths for contiguous and non-contiguous cases.

I'll try to check on my side for some relevant sizes at some point.

Doesn't seem much different from master:

[llama.cpp][master]$ build/bin/llama-bench -ngl 99 -m ~/Downloads/Weights/pythia-1.4b-q4_0.gguf

model size params backend ngl test t/s
gptneox 1.4B Q4_0 786.31 MiB 1.41 B SYCL 99 pp512 4202.02 ± 7.96
gptneox 1.4B Q4_0 786.31 MiB 1.41 B SYCL 99 tg128 46.34 ± 0.25

build: 92ecdcc (5423)

[llama.cpp][sycl/non_cont_norms]$ build/bin/llama-bench -ngl 99 -m ~/Downloads/Weights/pythia-1.4b-q4_0.gguf

model size params backend ngl test t/s
gptneox 1.4B Q4_0 786.31 MiB 1.41 B SYCL 99 pp512 4202.28 ± 8.47
gptneox 1.4B Q4_0 786.31 MiB 1.41 B SYCL 99 tg128 46.25 ± 0.26

build: This PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants