Skip to content

Merge 0.14.6dev5 into gold/2021 #1396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 66 commits into from
Sep 11, 2023
Merged

Merge 0.14.6dev5 into gold/2021 #1396

merged 66 commits into from
Sep 11, 2023

Conversation

oleksandr-pavlyk
Copy link
Contributor

This PR contains

ndgrigorian and others added 30 commits August 19, 2023 18:29
* Implements matrix_transpose
- Function wrapper for call to dpctl.tensor.usm_ndarray.mT attribute

* Add arg validation tests for matrix_transpose

* Added a test for matrix_transpose for coverage
- these properties were setting the flags of the output to the flags of the input, which is incorrect, as the output is almost never contiguous
- added tests for this behavior
Removes deprecated DPCTLDevice_GetMaxWorkItemSizes.

Added Null_DRef tests for DPCTLDevice_GetMaxWorkItemSizes1d,
DPCTLDevice_GetMaxWorkItemSizes2d, DPCTLDevice_GetMaxWorkItemSizes3d
* Implements ``types`` property for elementwise functions
- Output corresponds with Numpy's: a list with an arrow marking the domain to range type map

* Added tests for behavior of types property
…rkItemSizes

Removes deprecated DPCTLDevice_GetMaxWorkItemSizes
Improved message text in two exceptions
Since numba-dpex has dropped Py 3.8 from its matrix of builds, we can do
the same now with dpnp to follow.
Also install ninja from pip instead of apt.
Do not build dpctl for Python 3.8
`copy_usm_ndarray_for_reshape` allowed shift parameter which allowed
to double its use for implementing `roll` function.

It was suboptimal though, since for `roll` both source and destination
array have the same shape, and stride simplification applies. It also
makes sense to create dedicated kernel to implement `roll` for contiguous
inputs, makings computations measurably faster.

This PR removes support for `shift` parameter from
_tensor_impl._copy_usm_ndarray_for_reshape and introduces
_tensor_impl._copy_usm_ndarray_for_roll.

This latter function ensures same shape, applies stride simplification
and dispatches to specialized kernels for contiguous inputs. Even for
strided inputs less metadata should be copied for the kernel to use
(the shape is common, unlike in reshape).

The result of this change is that _copy_usm_ndarray_for_roll runs about
4.5x faster in an input array with about a million elements than
priovious call to _copy_usm_ndarray_for_reshape with shift parameter set:

```
In [1]: import numpy as np, dpctl.tensor as dpt

In [2]: a = np.ones((3,4,5,6,7,8))

In [3]: b = dpt.ones((3,4,5,6,7,8))

In [4]: w = dpt.empty_like(b)

In [5]: import dpctl.tensor._tensor_impl as ti

In [6]: %timeit ti._copy_usm_ndarray_for_roll(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait()
161 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [7]: b.size
Out[7]: 20160

In [8]: b = dpt.ones((30,40,5,6,7,80))

In [9]: w = dpt.empty_like(b)

In [10]: %timeit ti._copy_usm_ndarray_for_roll(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait()
4.91 ms ± 90.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: a = np.ones(b.shape)

In [12]: %timeit np.roll(a,2)
23 ms ± 367 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

Previously:

```
In [8]: %timeit ti._copy_usm_ndarray_for_reshape(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait()
20.1 ms ± 492 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit ti._copy_usm_ndarray_for_reshape(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait()
19.9 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [10]: %timeit ti._copy_usm_ndarray_for_reshape(src=b, dst=w, shift=0, sycl_queue=b.sycl_queue)[0].wait()
19.7 ms ± 488 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: b.shape
Out[11]: (30, 40, 5, 6, 7, 80)
```
Remove use of `shift=0` argument to `_copy_usm_ndarray_for_reshape`
in _reshape.py

Used `_copy_usm_ndarray_for_roll` in `roll` implementation.
We used `sycl::multi_ptr` constructor instead of
`sycl::address_space_cast` previsously, and change in
KhronosGroup/SYCL-Docs#432
introduced `sycl::access:decorated::legacy` as the default
which is deprecated in SYCL 2020 standard which highlighted
the problem.

In using `sycl::address_space_cast` we specify
`sycl::access::decorated::yes`.
```
In file included from ~/dpctl/dpctl/tensor/libtensor/source/elementwise_functions.cpp:56:
~/dpctl/dpctl/tensor/libtensor/include/kernels/elementwise_functions/expm1.hpp:118:42: warning: 'sincos' is deprecated: SYCL builtin functions with raw pointer arguments have been deprecated. Please use multi_ptr. [-Wdeprecated-declarations]
  118 |             const realT sinY_val = sycl::sincos(y, &cosY_val);
```

The resolution is to convert raw pointer to multi-pointer using `sycl::address_space_cast`.
…020-standard

Conversion from raw to multi_ptr should be done with address_space_cast
Ensure that ndarray that we converted usm_ndarray single element instance
into  is 0d before calling __int__, __float__, __complex__, __index__.
Ensured that `create_property_list` always returns an not-null unique
pointer by creating an default-constructed property_list for the
fall-through.

With this change we no longer need to branches for call to sycl::queue
constructor, since propList is always available.
This is to address NumPy 1.25 deprecation warnings.
…arning

Resolve SYCL-2020 deprecation warning
…ired_zero_dim_ndarray

Address NumPy 1.25 deprecation warnings
Moved the assertion about comparison with generic types before
other assertions.

This change is made in reference to Coverity scan issue.
This resolves two Coverity reported issues.
Turn comparison call into assertion in test_usm_ndarray_ctor::test_flags
Also adds a step to output array-api-test summary into the log
(step which works for PRs regardless whether they are opened from
a fork, or from a branch in this repo).
Addressed issue with lowercase order value in tensor copy and astype
oleksandr-pavlyk and others added 27 commits August 29, 2023 04:27
Made changes similar to those made in kernels for atomic
reduction. The WG's location change along iteration dimension
the fastest (previously along reduction dimension the fastest).

Due to this change performance of reduction increases 7-8x:

```
In [1]: import dpctl.tensor as dpt

In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f2")), (1282200, 128))

In [3]: %time y = dpt.sum(x, axis=0, dtype="f2")
CPU times: user 284 ms, sys: 3.68 ms, total: 287 ms
Wall time: 316 ms

In [4]: %time y = dpt.sum(x, axis=0, dtype="f2")
CPU times: user 18.6 ms, sys: 18.9 ms, total: 37.5 ms
Wall time: 43 ms

In [5]: quit
```

While in the main branch:

```
In [1]: import dpctl.tensor as dpt

In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f2")), (1282200, 128))

In [3]: %time y = dpt.sum(x, axis=0, dtype="f2")
CPU times: user 440 ms, sys: 129 ms, total: 569 ms
Wall time: 514 ms

In [4]: %time y = dpt.sum(x, axis=0, dtype="f2")
CPU times: user 142 ms, sys: 159 ms, total: 301 ms
Wall time: 325 ms

In [5]: %time y = dpt.sum(x, axis=0, dtype="f2")
CPU times: user 142 ms, sys: 154 ms, total: 296 ms
Wall time: 325 ms

In [6]: quit
```
This is used to compute displacement for
   a[(i0 - shifts[0]) % shape[0],
     (i1 - shifts[1]) % shape[1], ... ]
Function for flattened rolling is renamed:
   _copy_usm_ndarray_for_roll -> _copy_usm_ndarray_for_roll_1d

_copy_usm_ndarray_for_roll_1d has the same signature:
   _copy_usm_ndarray_for_roll_1d(
         src : usm_ndarray,
	 dst : usm_ndarray,
	 shift: Int,
	 sycl_queue: dpctl.SyclQueue) ->
	      Tuple[dpctl.SyclEvent, dpctl.SyclEvent]

Introduced
   _copy_usm_ndarray_for_roll_nd(
         src : usm_ndarray,
	 dst : usm_ndarray,
	 shifts: Tuple[Int],
	 sycl_queue: dpctl.SyclQueue) ->
	      Tuple[dpctl.SyclEvent, dpctl.SyclEvent]

The length of shifts tuple must be the same as the dimensionality
of src and dst arrays, which are supposed to have the same shape
and the same data type.
run_test files to output verbose listing of platform config
To create the multi_ptr from a local variable (in private memory) we should be
using address_space::private_space.
…hon_libs

Remove deprecated FindPythonLibs
The kernel is applicable if both inputs are F-contiguous, or
if the first input if F-contiguous and we are reducing to
1d C-contiguous array.

Closes gh-1391
…dress-space-cast

Correct address_space when casting ref of local variable to multi_ptr
Fixed constexpr significant bits value for double
Fix for incorrect result in reduction over axis=0
- Will cover lines missed by test suite
@oleksandr-pavlyk oleksandr-pavlyk merged commit d63f650 into gold/2021 Sep 11, 2023
@github-actions
Copy link

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞

@github-actions
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants