-
Notifications
You must be signed in to change notification settings - Fork 29
Add roll kernels #1380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add roll kernels #1380
Conversation
`copy_usm_ndarray_for_reshape` allowed shift parameter which allowed to double its use for implementing `roll` function. It was suboptimal though, since for `roll` both source and destination array have the same shape, and stride simplification applies. It also makes sense to create dedicated kernel to implement `roll` for contiguous inputs, makings computations measurably faster. This PR removes support for `shift` parameter from _tensor_impl._copy_usm_ndarray_for_reshape and introduces _tensor_impl._copy_usm_ndarray_for_roll. This latter function ensures same shape, applies stride simplification and dispatches to specialized kernels for contiguous inputs. Even for strided inputs less metadata should be copied for the kernel to use (the shape is common, unlike in reshape). The result of this change is that _copy_usm_ndarray_for_roll runs about 4.5x faster in an input array with about a million elements than priovious call to _copy_usm_ndarray_for_reshape with shift parameter set: ``` In [1]: import numpy as np, dpctl.tensor as dpt In [2]: a = np.ones((3,4,5,6,7,8)) In [3]: b = dpt.ones((3,4,5,6,7,8)) In [4]: w = dpt.empty_like(b) In [5]: import dpctl.tensor._tensor_impl as ti In [6]: %timeit ti._copy_usm_ndarray_for_roll(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait() 161 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [7]: b.size Out[7]: 20160 In [8]: b = dpt.ones((30,40,5,6,7,80)) In [9]: w = dpt.empty_like(b) In [10]: %timeit ti._copy_usm_ndarray_for_roll(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait() 4.91 ms ± 90.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [11]: a = np.ones(b.shape) In [12]: %timeit np.roll(a,2) 23 ms ± 367 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` Previously: ``` In [8]: %timeit ti._copy_usm_ndarray_for_reshape(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait() 20.1 ms ± 492 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [9]: %timeit ti._copy_usm_ndarray_for_reshape(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait() 19.9 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [10]: %timeit ti._copy_usm_ndarray_for_reshape(src=b, dst=w, shift=0, sycl_queue=b.sycl_queue)[0].wait() 19.7 ms ± 488 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [11]: b.shape Out[11]: (30, 40, 5, 6, 7, 80) ```
Remove use of `shift=0` argument to `_copy_usm_ndarray_for_reshape` in _reshape.py Used `_copy_usm_ndarray_for_roll` in `roll` implementation.
This is used to compute displacement for a[(i0 - shifts[0]) % shape[0], (i1 - shifts[1]) % shape[1], ... ]
Function for flattened rolling is renamed: _copy_usm_ndarray_for_roll -> _copy_usm_ndarray_for_roll_1d _copy_usm_ndarray_for_roll_1d has the same signature: _copy_usm_ndarray_for_roll_1d( src : usm_ndarray, dst : usm_ndarray, shift: Int, sycl_queue: dpctl.SyclQueue) -> Tuple[dpctl.SyclEvent, dpctl.SyclEvent] Introduced _copy_usm_ndarray_for_roll_nd( src : usm_ndarray, dst : usm_ndarray, shifts: Tuple[Int], sycl_queue: dpctl.SyclQueue) -> Tuple[dpctl.SyclEvent, dpctl.SyclEvent] The length of shifts tuple must be the same as the dimensionality of src and dst arrays, which are supposed to have the same shape and the same data type.
View rendered docs @ https://intelpython.github.io/dpctl/pulls/1380/index.html |
Array API standard conformance tests for dpctl=0.14.6dev4=py310ha25a700_42 ran successfully. |
Array API standard conformance tests for dpctl=0.14.6dev4=py310ha25a700_62 ran successfully. |
- Will cover lines missed by test suite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oleksandr-pavlyk
I've added a test just to cover a couple of lines in the coverage CI.
I tested out the branch and reviewed the code, it looks good to me.
With the changes to the indexers, we may want to go back and comb through other kernels, make sure there isn't some unnecessary casting. But maybe that can wait for when we replace py::ssize_t
with ssize_t
.
Array API standard conformance tests for dpctl=0.14.6dev4=py310ha25a700_63 ran successfully. |
Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞 |
This PR supersedes gh-1341.
This PR removes
shift
parameter from_copy_usm_ndarray_for_reshape
and introduces_copy_usm_ndarray_for_roll_1d
(foraxis=None
) and_copy_usm_ndarray_for_roll_nd
functions whereshifts
argument is expected to be a tuple specifying shift size for every axis.This dedicated function
_copy_usm_ndarray_for_roll_nd
is used to implement multi-axisroll
function instead of a sequence of concurrent copy operations.This PR will make input
dpt.roll(x, shift=(1,1))
raise aTypeError
exception (since array API spec mandates that when shift is a tuple, an axis must also be a tuple).