Merge 0.14.6dev5 into gold/2021 #1396

oleksandr-pavlyk · 2023-09-11T11:45:48Z

This PR contains

Fixed qualifier name typo #1394
Add roll kernels #1380
Fix for incorrect result in reduction over axis=0 #1392
Fixed constexpr significant bits value for double #1393
Correct address_space when casting ref of local variable to multi_ptr #1390
Remove deprecated FindPythonLibs #1389
Reduction performance #1364
Spelling fixes pointed out by codespell #1386
run_test files to output verbose listing of platform config #1385
Fix fork workflows #1377
Addressed Coverity issue about handling bools in floor_divide #1374
Addressed "Very high" Coverity issue: dereferencing null pointer #1369
Fixes boolean indexing for strided masks #1370
Addressed issue with lowercase order value in tensor copy and astype #1376
Fix for Coverity highlighted issue in scripts #1373
Turn comparison call into assertion in test_usm_ndarray_ctor::test_flags #1371
Address NumPy 1.25 deprecation warnings #1368
Resolve SYCL-2020 deprecation warning #1367
Conversion from raw to multi_ptr should be done with address_space_cast #1366
Do not build dpctl for Python 3.8 #1363
Improved message text in two exceptions #1362
Removes deprecated DPCTLDevice_GetMaxWorkItemSizes #1359
Implements types property for elementwise functions #1361
usm_ndarray.real and usm_ndarray.imag now set flags correctly #1355
Implements dpctl.tensor.matrix_transpose #1356
Have you provided a meaningful PR description?
Have you added a test, reproducer or referred to an issue with a reproducer?
Have you tested your changes locally for CPU and GPU devices?
Have you made sure that new changes do not introduce compiler warnings?
Have you checked performance impact of proposed changes?
If this PR is a work in progress, are you opening the PR as a draft?

* Implements matrix_transpose - Function wrapper for call to dpctl.tensor.usm_ndarray.mT attribute * Add arg validation tests for matrix_transpose * Added a test for matrix_transpose for coverage

- these properties were setting the flags of the output to the flags of the input, which is incorrect, as the output is almost never contiguous - added tests for this behavior

Removes deprecated DPCTLDevice_GetMaxWorkItemSizes. Added Null_DRef tests for DPCTLDevice_GetMaxWorkItemSizes1d, DPCTLDevice_GetMaxWorkItemSizes2d, DPCTLDevice_GetMaxWorkItemSizes3d

* Implements ``types`` property for elementwise functions - Output corresponds with Numpy's: a list with an arrow marking the domain to range type map * Added tests for behavior of types property

…rkItemSizes Removes deprecated DPCTLDevice_GetMaxWorkItemSizes

Improved message text in two exceptions

Since numba-dpex has dropped Py 3.8 from its matrix of builds, we can do the same now with dpnp to follow.

Also install ninja from pip instead of apt.

Do not build dpctl for Python 3.8

`copy_usm_ndarray_for_reshape` allowed shift parameter which allowed to double its use for implementing `roll` function. It was suboptimal though, since for `roll` both source and destination array have the same shape, and stride simplification applies. It also makes sense to create dedicated kernel to implement `roll` for contiguous inputs, makings computations measurably faster. This PR removes support for `shift` parameter from _tensor_impl._copy_usm_ndarray_for_reshape and introduces _tensor_impl._copy_usm_ndarray_for_roll. This latter function ensures same shape, applies stride simplification and dispatches to specialized kernels for contiguous inputs. Even for strided inputs less metadata should be copied for the kernel to use (the shape is common, unlike in reshape). The result of this change is that _copy_usm_ndarray_for_roll runs about 4.5x faster in an input array with about a million elements than priovious call to _copy_usm_ndarray_for_reshape with shift parameter set: ``` In [1]: import numpy as np, dpctl.tensor as dpt In [2]: a = np.ones((3,4,5,6,7,8)) In [3]: b = dpt.ones((3,4,5,6,7,8)) In [4]: w = dpt.empty_like(b) In [5]: import dpctl.tensor._tensor_impl as ti In [6]: %timeit ti._copy_usm_ndarray_for_roll(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait() 161 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [7]: b.size Out[7]: 20160 In [8]: b = dpt.ones((30,40,5,6,7,80)) In [9]: w = dpt.empty_like(b) In [10]: %timeit ti._copy_usm_ndarray_for_roll(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait() 4.91 ms ± 90.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [11]: a = np.ones(b.shape) In [12]: %timeit np.roll(a,2) 23 ms ± 367 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` Previously: ``` In [8]: %timeit ti._copy_usm_ndarray_for_reshape(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait() 20.1 ms ± 492 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [9]: %timeit ti._copy_usm_ndarray_for_reshape(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait() 19.9 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [10]: %timeit ti._copy_usm_ndarray_for_reshape(src=b, dst=w, shift=0, sycl_queue=b.sycl_queue)[0].wait() 19.7 ms ± 488 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [11]: b.shape Out[11]: (30, 40, 5, 6, 7, 80) ```

Remove use of `shift=0` argument to `_copy_usm_ndarray_for_reshape` in _reshape.py Used `_copy_usm_ndarray_for_roll` in `roll` implementation.

We used `sycl::multi_ptr` constructor instead of `sycl::address_space_cast` previsously, and change in KhronosGroup/SYCL-Docs#432 introduced `sycl::access:decorated::legacy` as the default which is deprecated in SYCL 2020 standard which highlighted the problem. In using `sycl::address_space_cast` we specify `sycl::access::decorated::yes`.

``` In file included from ~/dpctl/dpctl/tensor/libtensor/source/elementwise_functions.cpp:56: ~/dpctl/dpctl/tensor/libtensor/include/kernels/elementwise_functions/expm1.hpp:118:42: warning: 'sincos' is deprecated: SYCL builtin functions with raw pointer arguments have been deprecated. Please use multi_ptr. [-Wdeprecated-declarations] 118 | const realT sinY_val = sycl::sincos(y, &cosY_val); ``` The resolution is to convert raw pointer to multi-pointer using `sycl::address_space_cast`.

…020-standard Conversion from raw to multi_ptr should be done with address_space_cast

Ensure that ndarray that we converted usm_ndarray single element instance into is 0d before calling __int__, __float__, __complex__, __index__.

Ensured that `create_property_list` always returns an not-null unique pointer by creating an default-constructed property_list for the fall-through. With this change we no longer need to branches for call to sycl::queue constructor, since propList is always available.

This is to address NumPy 1.25 deprecation warnings.

…arning Resolve SYCL-2020 deprecation warning

…ired_zero_dim_ndarray Address NumPy 1.25 deprecation warnings

Moved the assertion about comparison with generic types before other assertions. This change is made in reference to Coverity scan issue.

This resolves two Coverity reported issues.

Turn comparison call into assertion in test_usm_ndarray_ctor::test_flags

Also adds a step to output array-api-test summary into the log (step which works for PRs regardless whether they are opened from a fork, or from a branch in this repo).

Addressed issue with lowercase order value in tensor copy and astype

Made changes similar to those made in kernels for atomic reduction. The WG's location change along iteration dimension the fastest (previously along reduction dimension the fastest). Due to this change performance of reduction increases 7-8x: ``` In [1]: import dpctl.tensor as dpt In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f2")), (1282200, 128)) In [3]: %time y = dpt.sum(x, axis=0, dtype="f2") CPU times: user 284 ms, sys: 3.68 ms, total: 287 ms Wall time: 316 ms In [4]: %time y = dpt.sum(x, axis=0, dtype="f2") CPU times: user 18.6 ms, sys: 18.9 ms, total: 37.5 ms Wall time: 43 ms In [5]: quit ``` While in the main branch: ``` In [1]: import dpctl.tensor as dpt In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f2")), (1282200, 128)) In [3]: %time y = dpt.sum(x, axis=0, dtype="f2") CPU times: user 440 ms, sys: 129 ms, total: 569 ms Wall time: 514 ms In [4]: %time y = dpt.sum(x, axis=0, dtype="f2") CPU times: user 142 ms, sys: 159 ms, total: 301 ms Wall time: 325 ms In [5]: %time y = dpt.sum(x, axis=0, dtype="f2") CPU times: user 142 ms, sys: 154 ms, total: 296 ms Wall time: 325 ms In [6]: quit ```

This is used to compute displacement for a[(i0 - shifts[0]) % shape[0], (i1 - shifts[1]) % shape[1], ... ]

Function for flattened rolling is renamed: _copy_usm_ndarray_for_roll -> _copy_usm_ndarray_for_roll_1d _copy_usm_ndarray_for_roll_1d has the same signature: _copy_usm_ndarray_for_roll_1d( src : usm_ndarray, dst : usm_ndarray, shift: Int, sycl_queue: dpctl.SyclQueue) -> Tuple[dpctl.SyclEvent, dpctl.SyclEvent] Introduced _copy_usm_ndarray_for_roll_nd( src : usm_ndarray, dst : usm_ndarray, shifts: Tuple[Int], sycl_queue: dpctl.SyclQueue) -> Tuple[dpctl.SyclEvent, dpctl.SyclEvent] The length of shifts tuple must be the same as the dimensionality of src and dst arrays, which are supposed to have the same shape and the same data type.

run_test files to output verbose listing of platform config

Co-authored-by: ndgrigorian <[email protected]>

Reduction performance

To create the multi_ptr from a local variable (in private memory) we should be using address_space::private_space.

Renamed variable for clarity.

…hon_libs Remove deprecated FindPythonLibs

The kernel is applicable if both inputs are F-contiguous, or if the first input if F-contiguous and we are reducing to 1d C-contiguous array. Closes gh-1391

…dress-space-cast Correct address_space when casting ref of local variable to multi_ptr

Fixed constexpr significant bits value for double

Fix for incorrect result in reduction over axis=0

- Will cover lines missed by test suite

Add roll kernels

Fixed qualifier name typo

github-actions · 2023-09-11T11:47:33Z

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞

github-actions · 2023-09-11T12:17:43Z

View rendered docs @ https://intelpython.github.io/dpctl/pulls/1396/index.html

ndgrigorian and others added 30 commits August 19, 2023 18:29

Implements dpctl.tensor.matrix_transpose (#1356)

83858f0

* Implements matrix_transpose - Function wrapper for call to dpctl.tensor.usm_ndarray.mT attribute * Add arg validation tests for matrix_transpose * Added a test for matrix_transpose for coverage

_real_view and _imag_view now set flags correctly (#1355)

abc8c80

- these properties were setting the flags of the output to the flags of the input, which is incorrect, as the output is almost never contiguous - added tests for this behavior

Closes gh-1358

bcefda7

Removes deprecated DPCTLDevice_GetMaxWorkItemSizes. Added Null_DRef tests for DPCTLDevice_GetMaxWorkItemSizes1d, DPCTLDevice_GetMaxWorkItemSizes2d, DPCTLDevice_GetMaxWorkItemSizes3d

Implements types property for elementwise functions (#1361)

4cc552f

* Implements ``types`` property for elementwise functions - Output corresponds with Numpy's: a list with an arrow marking the domain to range type map * Added tests for behavior of types property

Merge pull request #1359 from IntelPython/remove-DPCTLDevice_GetMaxWo…

c3427b3

…rkItemSizes Removes deprecated DPCTLDevice_GetMaxWorkItemSizes

Corrected messsage text in two exceptions

ddcb0ae

Merge pull request #1362 from IntelPython/improve-error-message

3cbb221

Improved message text in two exceptions

Closes gh-1295

c832602

Since numba-dpex has dropped Py 3.8 from its matrix of builds, we can do the same now with dpnp to follow.

Using Py 3.11 in OS-LLVM-SYCL-BUILD workflow

68ad054

Also install ninja from pip instead of apt.

Use older version of Sphinx, see it fixes documentation build

6f22fa0

Merge pull request #1363 from IntelPython/drop-py-3.8

d85e130

Do not build dpctl for Python 3.8

Deploy _copy_usm_ndarray_for_roll

67cab69

Remove use of `shift=0` argument to `_copy_usm_ndarray_for_reshape` in _reshape.py Used `_copy_usm_ndarray_for_roll` in `roll` implementation.

Merge pull request #1366 from IntelPython/create-multi_ptr-per-sycl-2…

8eab04b

…020-standard Conversion from raw to multi_ptr should be done with address_space_cast

Address NumPy 1.25 deprecation warnings

29fd9e5

Ensure that ndarray that we converted usm_ndarray single element instance into is 0d before calling __int__, __float__, __complex__, __index__.

Avoid calling int/float/complex/operator.index on 1d ndarray

a653eb3

This is to address NumPy 1.25 deprecation warnings.

Merge pull request #1367 from IntelPython/fix-sycl-2020-deprecation-w…

1595dce

…arning Resolve SYCL-2020 deprecation warning

Merge pull request #1368 from IntelPython/scalar_special_methods_requ…

cf4660d

…ired_zero_dim_ndarray Address NumPy 1.25 deprecation warnings

Turn comparison call into assertion

55bb70f

Moved the assertion about comparison with generic types before other assertions. This change is made in reference to Coverity scan issue.

Initialized env to empty dictionary instead of None

02cfdb4

This resolves two Coverity reported issues.

Merge pull request #1371 from IntelPython/tweak-test-flags

1993eae

Turn comparison call into assertion in test_usm_ndarray_ctor::test_flags

Addressed coverity issue about handling bools in floor_divide

29c52b4

Add support for lowercase order in tensor.copy and tensor.astype

b2cd5c1

Refactored

5390456

Merge pull request #1373 from IntelPython/coverity-scripts

a4fe4ad

Only attempt to leave comments on PRs from this repo

8a9ebe6

Also adds a step to output array-api-test summary into the log (step which works for PRs regardless whether they are opened from a fork, or from a branch in this repo).

Merge pull request #1376 from PiotrekB416/feature/lowercase-tensor-copy

526c46c

Addressed issue with lowercase order value in tensor copy and astype

oleksandr-pavlyk and others added 27 commits August 29, 2023 04:27

Call operator of all indexers must return py::ssize_t

2c3f748

Add method CIndexer_vector::get_left_rolled_displacement

aea79dd

This is used to compute displacement for a[(i0 - shifts[0]) % shape[0], (i1 - shifts[1]) % shape[1], ... ]

Change name of _tensor_impl function in roll implementation

987af9e

Deployed _copy_usm_ndarray_for_roll_nd in dpt.roll

64fa95e

run_test files to output verbose listing of platform config

3f6d78e

Merge pull request #1385 from IntelPython/update-run-test-bat

8ccbee1

run_test files to output verbose listing of platform config

Spelling fixes pointed out by codespell

7997fd5

Update libsyclinterface/include/dpctl_data_types.h

2f36893

Co-authored-by: ndgrigorian <[email protected]>

Merge pull request #1386 from IntelPython/spelling-fixes

0512214

Merge pull request #1364 from IntelPython/reduction-performance

62e38de

Reduction performance

Remove depricated FindPythonLibs

845b4bb

address_space_cast<address_space::private_space,...> on local variable

03e7f26

To create the multi_ptr from a local variable (in private memory) we should be using address_space::private_space.

Fixed constexpr significant bits value for double

fed0e9b

Renamed variable for clarity.

Merge pull request #1389 from ZzEeKkAa/fix/remove_deprecated_find_pyt…

01e6c3d

…hon_libs Remove deprecated FindPythonLibs

Restricted use of reduce_over_axis0 special kernels

36bc04a

The kernel is applicable if both inputs are F-contiguous, or if the first input if F-contiguous and we are reducing to 1d C-contiguous array. Closes gh-1391

Add test based on gh-1391

8d8ef0b

Merge pull request #1390 from IntelPython/correct-address-space-in-ad…

87d1c13

…dress-space-cast Correct address_space when casting ref of local variable to multi_ptr

Fixed qualifier name typo

172394f

Merge pull request #1393 from IntelPython/fix-typo-in-sqrt

a91fc55

Fixed constexpr significant bits value for double

Made except pattern more specific in the test

49002cc

Merge pull request #1392 from IntelPython/fix-gh-1391

a9064ee

Fix for incorrect result in reduction over axis=0

Merge remote-tracking branch 'origin/master' into dedicated-roll-kernel

8c6e99c

Added a test for roll input validation

ac331bb

- Will cover lines missed by test suite

Merge pull request #1380 from IntelPython/dedicated-roll-kernel

a6cb5db

Add roll kernels

Merge pull request #1394 from IntelPython/fix-qualifier-typos

51d994a

Fixed qualifier name typo

oleksandr-pavlyk merged commit d63f650 into gold/2021 Sep 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge 0.14.6dev5 into gold/2021 #1396

Merge 0.14.6dev5 into gold/2021 #1396

oleksandr-pavlyk commented Sep 11, 2023

github-actions bot commented Sep 11, 2023

github-actions bot commented Sep 11, 2023

Merge 0.14.6dev5 into gold/2021 #1396

Merge 0.14.6dev5 into gold/2021 #1396

Conversation

oleksandr-pavlyk commented Sep 11, 2023

github-actions bot commented Sep 11, 2023

github-actions bot commented Sep 11, 2023