SYCL: Add non contiguous support in RMS_NORM and NORM kernels #13611

qnixsynapse · 2025-05-18T06:39:49Z

Added non contiguous support in RMS_NORM and NORM kernels. test-backend-ops seems to pass with this change.

Edit: restored logic for handling multi subgroup correctly which was not tested by test-backend-ops

qnixsynapse · 2025-05-19T06:16:09Z

It now seems to pass with ne[0] = 1920

 NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000000): OK
  NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.000000): OK
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000000): OK
  NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.000000): OK
  NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000001): OK
  NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.000001): OK
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000001): OK
  NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.000001): OK
  NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000100): OK
  NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.000100): OK
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000100): OK
  NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.000100): OK
  NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.100000): OK
  NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.100000): OK
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.100000): OK
  NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.100000): OK
  RMS_NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000000): OK
  RMS_NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.000000): OK
  RMS_NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000000): OK
  RMS_NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.000000): OK
  RMS_NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000001): OK
  RMS_NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.000001): OK
  RMS_NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000001): OK
  RMS_NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.000001): OK
  RMS_NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000100): OK
  RMS_NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.000100): OK
  RMS_NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000100): OK
  RMS_NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.000100): OK
  RMS_NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.100000): OK
  RMS_NORM(type=f32,ne=[1920,5,4,3],v=0,eps=0.100000): OK
  RMS_NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.100000): OK
  RMS_NORM(type=f32,ne=[1920,5,4,3],v=1,eps=0.100000): OK
  5535/5535 tests passed
  Backend SYCL0: OK

Rbiessy

Were you able to measure the impact on performance for this change? If it has one we may want to introduce different paths for contiguous and non-contiguous cases.

I'll try to check on my side for some relevant sizes at some point.

ggml-ci

…ernels

ggml-ci

This reverts commit 43be2d6.

ggml-ci

qnixsynapse · 2025-05-19T11:50:14Z

Were you able to measure the impact on performance for this change? If it has one we may want to introduce different paths for contiguous and non-contiguous cases.

I'll try to check on my side for some relevant sizes at some point.

Doesn't seem much different from master:

[llama.cpp][master]$ build/bin/llama-bench -ngl 99 -m ~/Downloads/Weights/pythia-1.4b-q4_0.gguf

model	size	params	backend	ngl	test	t/s
gptneox 1.4B Q4_0	786.31 MiB	1.41 B	SYCL	99	pp512	4202.02 ± 7.96
gptneox 1.4B Q4_0	786.31 MiB	1.41 B	SYCL	99	tg128	46.34 ± 0.25

build: 92ecdcc (5423)

[llama.cpp][sycl/non_cont_norms]$ build/bin/llama-bench -ngl 99 -m ~/Downloads/Weights/pythia-1.4b-q4_0.gguf

model	size	params	backend	ngl	test	t/s
gptneox 1.4B Q4_0	786.31 MiB	1.41 B	SYCL	99	pp512	4202.28 ± 8.47
gptneox 1.4B Q4_0	786.31 MiB	1.41 B	SYCL	99	tg128	46.25 ± 0.26

build: This PR

github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 18, 2025

qnixsynapse closed this May 18, 2025

qnixsynapse deleted the sycl/non_cont_norms branch May 18, 2025 06:42

qnixsynapse restored the sycl/non_cont_norms branch May 18, 2025 06:56

qnixsynapse reopened this May 18, 2025

qnixsynapse marked this pull request as draft May 18, 2025 07:01

qnixsynapse marked this pull request as ready for review May 19, 2025 06:16

qnixsynapse requested review from Alcpz and NeoZhangJianyu May 19, 2025 06:16

Rbiessy reviewed May 19, 2025

View reviewed changes

qnixsynapse added 6 commits May 19, 2025 17:15

SYCL: Add non contiguous input support to norm kernel

27560b8

refactor and add RMS_NORM non contiguous input support

0431a5a

ggml-ci

restore subgroup reduction for multi-subgroup thread blocks in norm k…

820b7b6

…ernels

Swap grid dims of nsamples and nrows

7380ca8

ggml-ci

Revert "Swap grid dims of nsamples and nrows"

de4cbd0

This reverts commit 43be2d6.

restore not required changes

d5d39b5

ggml-ci

qnixsynapse force-pushed the sycl/non_cont_norms branch from 4f9b1bc to d5d39b5 Compare May 19, 2025 11:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SYCL: Add non contiguous support in RMS_NORM and NORM kernels #13611

SYCL: Add non contiguous support in RMS_NORM and NORM kernels #13611

qnixsynapse commented May 18, 2025 •

edited

Loading

qnixsynapse commented May 19, 2025

Rbiessy left a comment

qnixsynapse commented May 19, 2025

SYCL: Add non contiguous support in RMS_NORM and NORM kernels #13611

Are you sure you want to change the base?

SYCL: Add non contiguous support in RMS_NORM and NORM kernels #13611

Conversation

qnixsynapse commented May 18, 2025 • edited Loading

qnixsynapse commented May 19, 2025

Rbiessy left a comment

Choose a reason for hiding this comment

qnixsynapse commented May 19, 2025

qnixsynapse commented May 18, 2025 •

edited

Loading