Add roll kernels #1380

oleksandr-pavlyk · 2023-08-30T15:28:49Z

This PR supersedes gh-1341.

This PR removes shift parameter from _copy_usm_ndarray_for_reshape and introduces _copy_usm_ndarray_for_roll_1d (for axis=None) and _copy_usm_ndarray_for_roll_nd functions where shifts argument is expected to be a tuple specifying shift size for every axis.

This dedicated function _copy_usm_ndarray_for_roll_nd is used to implement multi-axis roll function instead of a sequence of concurrent copy operations.

This PR will make input dpt.roll(x, shift=(1,1)) raise a TypeError exception (since array API spec mandates that when shift is a tuple, an axis must also be a tuple).

Have you provided a meaningful PR description?
Have you added a test, reproducer or referred to an issue with a reproducer?
Have you tested your changes locally for CPU and GPU devices?
Have you made sure that new changes do not introduce compiler warnings?
Have you checked performance impact of proposed changes?
If this PR is a work in progress, are you opening the PR as a draft?

`copy_usm_ndarray_for_reshape` allowed shift parameter which allowed to double its use for implementing `roll` function. It was suboptimal though, since for `roll` both source and destination array have the same shape, and stride simplification applies. It also makes sense to create dedicated kernel to implement `roll` for contiguous inputs, makings computations measurably faster. This PR removes support for `shift` parameter from _tensor_impl._copy_usm_ndarray_for_reshape and introduces _tensor_impl._copy_usm_ndarray_for_roll. This latter function ensures same shape, applies stride simplification and dispatches to specialized kernels for contiguous inputs. Even for strided inputs less metadata should be copied for the kernel to use (the shape is common, unlike in reshape). The result of this change is that _copy_usm_ndarray_for_roll runs about 4.5x faster in an input array with about a million elements than priovious call to _copy_usm_ndarray_for_reshape with shift parameter set: ``` In [1]: import numpy as np, dpctl.tensor as dpt In [2]: a = np.ones((3,4,5,6,7,8)) In [3]: b = dpt.ones((3,4,5,6,7,8)) In [4]: w = dpt.empty_like(b) In [5]: import dpctl.tensor._tensor_impl as ti In [6]: %timeit ti._copy_usm_ndarray_for_roll(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait() 161 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [7]: b.size Out[7]: 20160 In [8]: b = dpt.ones((30,40,5,6,7,80)) In [9]: w = dpt.empty_like(b) In [10]: %timeit ti._copy_usm_ndarray_for_roll(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait() 4.91 ms ± 90.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [11]: a = np.ones(b.shape) In [12]: %timeit np.roll(a,2) 23 ms ± 367 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` Previously: ``` In [8]: %timeit ti._copy_usm_ndarray_for_reshape(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait() 20.1 ms ± 492 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [9]: %timeit ti._copy_usm_ndarray_for_reshape(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait() 19.9 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [10]: %timeit ti._copy_usm_ndarray_for_reshape(src=b, dst=w, shift=0, sycl_queue=b.sycl_queue)[0].wait() 19.7 ms ± 488 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [11]: b.shape Out[11]: (30, 40, 5, 6, 7, 80) ```

Remove use of `shift=0` argument to `_copy_usm_ndarray_for_reshape` in _reshape.py Used `_copy_usm_ndarray_for_roll` in `roll` implementation.

This is used to compute displacement for a[(i0 - shifts[0]) % shape[0], (i1 - shifts[1]) % shape[1], ... ]

Function for flattened rolling is renamed: _copy_usm_ndarray_for_roll -> _copy_usm_ndarray_for_roll_1d _copy_usm_ndarray_for_roll_1d has the same signature: _copy_usm_ndarray_for_roll_1d( src : usm_ndarray, dst : usm_ndarray, shift: Int, sycl_queue: dpctl.SyclQueue) -> Tuple[dpctl.SyclEvent, dpctl.SyclEvent] Introduced _copy_usm_ndarray_for_roll_nd( src : usm_ndarray, dst : usm_ndarray, shifts: Tuple[Int], sycl_queue: dpctl.SyclQueue) -> Tuple[dpctl.SyclEvent, dpctl.SyclEvent] The length of shifts tuple must be the same as the dimensionality of src and dst arrays, which are supposed to have the same shape and the same data type.

github-actions · 2023-08-30T15:56:44Z

View rendered docs @ https://intelpython.github.io/dpctl/pulls/1380/index.html

coveralls · 2023-08-30T16:15:28Z

coverage: 85.618% (-0.02%) from 85.635% when pulling 64fa95e on dedicated-roll-kernel into 9f98baf on master.

github-actions · 2023-08-30T16:43:38Z

Array API standard conformance tests for dpctl=0.14.6dev4=py310ha25a700_42 ran successfully.
Passed: 916
Failed: 84
Skipped: 119

github-actions · 2023-09-07T23:55:06Z

Array API standard conformance tests for dpctl=0.14.6dev4=py310ha25a700_62 ran successfully.
Passed: 916
Failed: 84
Skipped: 119

- Will cover lines missed by test suite

ndgrigorian

@oleksandr-pavlyk
I've added a test just to cover a couple of lines in the coverage CI.

I tested out the branch and reviewed the code, it looks good to me.
With the changes to the indexers, we may want to go back and comb through other kernels, make sure there isn't some unnecessary casting. But maybe that can wait for when we replace py::ssize_t with ssize_t.

github-actions · 2023-09-08T06:25:05Z

Array API standard conformance tests for dpctl=0.14.6dev4=py310ha25a700_63 ran successfully.
Passed: 916
Failed: 84
Skipped: 119

github-actions · 2023-09-08T11:53:28Z

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞

oleksandr-pavlyk added 7 commits August 24, 2023 09:44

Deploy _copy_usm_ndarray_for_roll

67cab69

Remove use of `shift=0` argument to `_copy_usm_ndarray_for_reshape` in _reshape.py Used `_copy_usm_ndarray_for_roll` in `roll` implementation.

Call operator of all indexers must return py::ssize_t

2c3f748

Add method CIndexer_vector::get_left_rolled_displacement

aea79dd

This is used to compute displacement for a[(i0 - shifts[0]) % shape[0], (i1 - shifts[1]) % shape[1], ... ]

Change name of _tensor_impl function in roll implementation

987af9e

Deployed _copy_usm_ndarray_for_roll_nd in dpt.roll

64fa95e

oleksandr-pavlyk marked this pull request as ready for review September 5, 2023 15:49

oleksandr-pavlyk requested review from ndgrigorian and vlad-perevezentsev September 5, 2023 15:49

Merge remote-tracking branch 'origin/master' into dedicated-roll-kernel

8c6e99c

Added a test for roll input validation

ac331bb

- Will cover lines missed by test suite

ndgrigorian approved these changes Sep 8, 2023

View reviewed changes

oleksandr-pavlyk merged commit a6cb5db into master Sep 8, 2023

oleksandr-pavlyk deleted the dedicated-roll-kernel branch September 8, 2023 11:52

This was referenced Sep 8, 2023

Shift tuple support when axis=None for dpctl.tensor.roll #1341

Closed

Merge 0.14.6dev5 into gold/2021 #1396

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add roll kernels #1380

Add roll kernels #1380

oleksandr-pavlyk commented Aug 30, 2023 •

edited

Loading

github-actions bot commented Aug 30, 2023

coveralls commented Aug 30, 2023

github-actions bot commented Aug 30, 2023

github-actions bot commented Sep 7, 2023

ndgrigorian left a comment

github-actions bot commented Sep 8, 2023

github-actions bot commented Sep 8, 2023

Add roll kernels #1380

Add roll kernels #1380

Conversation

oleksandr-pavlyk commented Aug 30, 2023 • edited Loading

github-actions bot commented Aug 30, 2023

coveralls commented Aug 30, 2023

github-actions bot commented Aug 30, 2023

github-actions bot commented Sep 7, 2023

ndgrigorian left a comment

Choose a reason for hiding this comment

github-actions bot commented Sep 8, 2023

github-actions bot commented Sep 8, 2023

oleksandr-pavlyk commented Aug 30, 2023 •

edited

Loading