Skip to content

Add roll kernels #1380

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Sep 8, 2023
Merged

Add roll kernels #1380

merged 9 commits into from
Sep 8, 2023

Conversation

oleksandr-pavlyk
Copy link
Contributor

@oleksandr-pavlyk oleksandr-pavlyk commented Aug 30, 2023

This PR supersedes gh-1341.

This PR removes shift parameter from _copy_usm_ndarray_for_reshape and introduces _copy_usm_ndarray_for_roll_1d (for axis=None) and _copy_usm_ndarray_for_roll_nd functions where shifts argument is expected to be a tuple specifying shift size for every axis.

This dedicated function _copy_usm_ndarray_for_roll_nd is used to implement multi-axis roll function instead of a sequence of concurrent copy operations.

This PR will make input dpt.roll(x, shift=(1,1)) raise a TypeError exception (since array API spec mandates that when shift is a tuple, an axis must also be a tuple).

  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • Have you checked performance impact of proposed changes?
  • If this PR is a work in progress, are you opening the PR as a draft?

`copy_usm_ndarray_for_reshape` allowed shift parameter which allowed
to double its use for implementing `roll` function.

It was suboptimal though, since for `roll` both source and destination
array have the same shape, and stride simplification applies. It also
makes sense to create dedicated kernel to implement `roll` for contiguous
inputs, makings computations measurably faster.

This PR removes support for `shift` parameter from
_tensor_impl._copy_usm_ndarray_for_reshape and introduces
_tensor_impl._copy_usm_ndarray_for_roll.

This latter function ensures same shape, applies stride simplification
and dispatches to specialized kernels for contiguous inputs. Even for
strided inputs less metadata should be copied for the kernel to use
(the shape is common, unlike in reshape).

The result of this change is that _copy_usm_ndarray_for_roll runs about
4.5x faster in an input array with about a million elements than
priovious call to _copy_usm_ndarray_for_reshape with shift parameter set:

```
In [1]: import numpy as np, dpctl.tensor as dpt

In [2]: a = np.ones((3,4,5,6,7,8))

In [3]: b = dpt.ones((3,4,5,6,7,8))

In [4]: w = dpt.empty_like(b)

In [5]: import dpctl.tensor._tensor_impl as ti

In [6]: %timeit ti._copy_usm_ndarray_for_roll(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait()
161 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [7]: b.size
Out[7]: 20160

In [8]: b = dpt.ones((30,40,5,6,7,80))

In [9]: w = dpt.empty_like(b)

In [10]: %timeit ti._copy_usm_ndarray_for_roll(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait()
4.91 ms ± 90.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: a = np.ones(b.shape)

In [12]: %timeit np.roll(a,2)
23 ms ± 367 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

Previously:

```
In [8]: %timeit ti._copy_usm_ndarray_for_reshape(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait()
20.1 ms ± 492 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit ti._copy_usm_ndarray_for_reshape(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait()
19.9 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [10]: %timeit ti._copy_usm_ndarray_for_reshape(src=b, dst=w, shift=0, sycl_queue=b.sycl_queue)[0].wait()
19.7 ms ± 488 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: b.shape
Out[11]: (30, 40, 5, 6, 7, 80)
```
Remove use of `shift=0` argument to `_copy_usm_ndarray_for_reshape`
in _reshape.py

Used `_copy_usm_ndarray_for_roll` in `roll` implementation.
This is used to compute displacement for
   a[(i0 - shifts[0]) % shape[0],
     (i1 - shifts[1]) % shape[1], ... ]
Function for flattened rolling is renamed:
   _copy_usm_ndarray_for_roll -> _copy_usm_ndarray_for_roll_1d

_copy_usm_ndarray_for_roll_1d has the same signature:
   _copy_usm_ndarray_for_roll_1d(
         src : usm_ndarray,
	 dst : usm_ndarray,
	 shift: Int,
	 sycl_queue: dpctl.SyclQueue) ->
	      Tuple[dpctl.SyclEvent, dpctl.SyclEvent]

Introduced
   _copy_usm_ndarray_for_roll_nd(
         src : usm_ndarray,
	 dst : usm_ndarray,
	 shifts: Tuple[Int],
	 sycl_queue: dpctl.SyclQueue) ->
	      Tuple[dpctl.SyclEvent, dpctl.SyclEvent]

The length of shifts tuple must be the same as the dimensionality
of src and dst arrays, which are supposed to have the same shape
and the same data type.
@github-actions
Copy link

@coveralls
Copy link
Collaborator

Coverage Status

coverage: 85.618% (-0.02%) from 85.635% when pulling 64fa95e on dedicated-roll-kernel into 9f98baf on master.

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.6dev4=py310ha25a700_42 ran successfully.
Passed: 916
Failed: 84
Skipped: 119

@oleksandr-pavlyk oleksandr-pavlyk marked this pull request as ready for review September 5, 2023 15:49
@github-actions
Copy link

github-actions bot commented Sep 7, 2023

Array API standard conformance tests for dpctl=0.14.6dev4=py310ha25a700_62 ran successfully.
Passed: 916
Failed: 84
Skipped: 119

- Will cover lines missed by test suite
Copy link
Collaborator

@ndgrigorian ndgrigorian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleksandr-pavlyk
I've added a test just to cover a couple of lines in the coverage CI.

I tested out the branch and reviewed the code, it looks good to me.
With the changes to the indexers, we may want to go back and comb through other kernels, make sure there isn't some unnecessary casting. But maybe that can wait for when we replace py::ssize_t with ssize_t.

@github-actions
Copy link

github-actions bot commented Sep 8, 2023

Array API standard conformance tests for dpctl=0.14.6dev4=py310ha25a700_63 ran successfully.
Passed: 916
Failed: 84
Skipped: 119

@oleksandr-pavlyk oleksandr-pavlyk merged commit a6cb5db into master Sep 8, 2023
@oleksandr-pavlyk oleksandr-pavlyk deleted the dedicated-roll-kernel branch September 8, 2023 11:52
@github-actions
Copy link

github-actions bot commented Sep 8, 2023

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants