-
Notifications
You must be signed in to change notification settings - Fork 30
Merge 0.14.6dev5 into gold/2021 #1396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Implements matrix_transpose - Function wrapper for call to dpctl.tensor.usm_ndarray.mT attribute * Add arg validation tests for matrix_transpose * Added a test for matrix_transpose for coverage
- these properties were setting the flags of the output to the flags of the input, which is incorrect, as the output is almost never contiguous - added tests for this behavior
* Implements ``types`` property for elementwise functions - Output corresponds with Numpy's: a list with an arrow marking the domain to range type map * Added tests for behavior of types property
…rkItemSizes Removes deprecated DPCTLDevice_GetMaxWorkItemSizes
Improved message text in two exceptions
Also install ninja from pip instead of apt.
Do not build dpctl for Python 3.8
`copy_usm_ndarray_for_reshape` allowed shift parameter which allowed to double its use for implementing `roll` function. It was suboptimal though, since for `roll` both source and destination array have the same shape, and stride simplification applies. It also makes sense to create dedicated kernel to implement `roll` for contiguous inputs, makings computations measurably faster. This PR removes support for `shift` parameter from _tensor_impl._copy_usm_ndarray_for_reshape and introduces _tensor_impl._copy_usm_ndarray_for_roll. This latter function ensures same shape, applies stride simplification and dispatches to specialized kernels for contiguous inputs. Even for strided inputs less metadata should be copied for the kernel to use (the shape is common, unlike in reshape). The result of this change is that _copy_usm_ndarray_for_roll runs about 4.5x faster in an input array with about a million elements than priovious call to _copy_usm_ndarray_for_reshape with shift parameter set: ``` In [1]: import numpy as np, dpctl.tensor as dpt In [2]: a = np.ones((3,4,5,6,7,8)) In [3]: b = dpt.ones((3,4,5,6,7,8)) In [4]: w = dpt.empty_like(b) In [5]: import dpctl.tensor._tensor_impl as ti In [6]: %timeit ti._copy_usm_ndarray_for_roll(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait() 161 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [7]: b.size Out[7]: 20160 In [8]: b = dpt.ones((30,40,5,6,7,80)) In [9]: w = dpt.empty_like(b) In [10]: %timeit ti._copy_usm_ndarray_for_roll(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait() 4.91 ms ± 90.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [11]: a = np.ones(b.shape) In [12]: %timeit np.roll(a,2) 23 ms ± 367 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` Previously: ``` In [8]: %timeit ti._copy_usm_ndarray_for_reshape(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait() 20.1 ms ± 492 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [9]: %timeit ti._copy_usm_ndarray_for_reshape(src=b, dst=w, shift=2, sycl_queue=b.sycl_queue)[0].wait() 19.9 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [10]: %timeit ti._copy_usm_ndarray_for_reshape(src=b, dst=w, shift=0, sycl_queue=b.sycl_queue)[0].wait() 19.7 ms ± 488 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [11]: b.shape Out[11]: (30, 40, 5, 6, 7, 80) ```
Remove use of `shift=0` argument to `_copy_usm_ndarray_for_reshape` in _reshape.py Used `_copy_usm_ndarray_for_roll` in `roll` implementation.
We used `sycl::multi_ptr` constructor instead of `sycl::address_space_cast` previsously, and change in KhronosGroup/SYCL-Docs#432 introduced `sycl::access:decorated::legacy` as the default which is deprecated in SYCL 2020 standard which highlighted the problem. In using `sycl::address_space_cast` we specify `sycl::access::decorated::yes`.
``` In file included from ~/dpctl/dpctl/tensor/libtensor/source/elementwise_functions.cpp:56: ~/dpctl/dpctl/tensor/libtensor/include/kernels/elementwise_functions/expm1.hpp:118:42: warning: 'sincos' is deprecated: SYCL builtin functions with raw pointer arguments have been deprecated. Please use multi_ptr. [-Wdeprecated-declarations] 118 | const realT sinY_val = sycl::sincos(y, &cosY_val); ``` The resolution is to convert raw pointer to multi-pointer using `sycl::address_space_cast`.
…020-standard Conversion from raw to multi_ptr should be done with address_space_cast
Ensure that ndarray that we converted usm_ndarray single element instance into is 0d before calling __int__, __float__, __complex__, __index__.
Ensured that `create_property_list` always returns an not-null unique pointer by creating an default-constructed property_list for the fall-through. With this change we no longer need to branches for call to sycl::queue constructor, since propList is always available.
This is to address NumPy 1.25 deprecation warnings.
…arning Resolve SYCL-2020 deprecation warning
…ired_zero_dim_ndarray Address NumPy 1.25 deprecation warnings
Moved the assertion about comparison with generic types before other assertions. This change is made in reference to Coverity scan issue.
This resolves two Coverity reported issues.
Turn comparison call into assertion in test_usm_ndarray_ctor::test_flags
Also adds a step to output array-api-test summary into the log (step which works for PRs regardless whether they are opened from a fork, or from a branch in this repo).
Addressed issue with lowercase order value in tensor copy and astype
Made changes similar to those made in kernels for atomic reduction. The WG's location change along iteration dimension the fastest (previously along reduction dimension the fastest). Due to this change performance of reduction increases 7-8x: ``` In [1]: import dpctl.tensor as dpt In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f2")), (1282200, 128)) In [3]: %time y = dpt.sum(x, axis=0, dtype="f2") CPU times: user 284 ms, sys: 3.68 ms, total: 287 ms Wall time: 316 ms In [4]: %time y = dpt.sum(x, axis=0, dtype="f2") CPU times: user 18.6 ms, sys: 18.9 ms, total: 37.5 ms Wall time: 43 ms In [5]: quit ``` While in the main branch: ``` In [1]: import dpctl.tensor as dpt In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f2")), (1282200, 128)) In [3]: %time y = dpt.sum(x, axis=0, dtype="f2") CPU times: user 440 ms, sys: 129 ms, total: 569 ms Wall time: 514 ms In [4]: %time y = dpt.sum(x, axis=0, dtype="f2") CPU times: user 142 ms, sys: 159 ms, total: 301 ms Wall time: 325 ms In [5]: %time y = dpt.sum(x, axis=0, dtype="f2") CPU times: user 142 ms, sys: 154 ms, total: 296 ms Wall time: 325 ms In [6]: quit ```
This is used to compute displacement for a[(i0 - shifts[0]) % shape[0], (i1 - shifts[1]) % shape[1], ... ]
Function for flattened rolling is renamed: _copy_usm_ndarray_for_roll -> _copy_usm_ndarray_for_roll_1d _copy_usm_ndarray_for_roll_1d has the same signature: _copy_usm_ndarray_for_roll_1d( src : usm_ndarray, dst : usm_ndarray, shift: Int, sycl_queue: dpctl.SyclQueue) -> Tuple[dpctl.SyclEvent, dpctl.SyclEvent] Introduced _copy_usm_ndarray_for_roll_nd( src : usm_ndarray, dst : usm_ndarray, shifts: Tuple[Int], sycl_queue: dpctl.SyclQueue) -> Tuple[dpctl.SyclEvent, dpctl.SyclEvent] The length of shifts tuple must be the same as the dimensionality of src and dst arrays, which are supposed to have the same shape and the same data type.
run_test files to output verbose listing of platform config
Co-authored-by: ndgrigorian <[email protected]>
Reduction performance
To create the multi_ptr from a local variable (in private memory) we should be using address_space::private_space.
Renamed variable for clarity.
…hon_libs Remove deprecated FindPythonLibs
The kernel is applicable if both inputs are F-contiguous, or if the first input if F-contiguous and we are reducing to 1d C-contiguous array. Closes gh-1391
…dress-space-cast Correct address_space when casting ref of local variable to multi_ptr
Fixed constexpr significant bits value for double
Fix for incorrect result in reduction over axis=0
- Will cover lines missed by test suite
Add roll kernels
Fixed qualifier name typo
Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞 |
View rendered docs @ https://intelpython.github.io/dpctl/pulls/1396/index.html |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains
Fixed qualifier name typo #1394
Add roll kernels #1380
Fix for incorrect result in reduction over axis=0 #1392
Fixed constexpr significant bits value for double #1393
Correct address_space when casting ref of local variable to multi_ptr #1390
Remove deprecated FindPythonLibs #1389
Reduction performance #1364
Spelling fixes pointed out by codespell #1386
run_test files to output verbose listing of platform config #1385
Fix fork workflows #1377
Addressed Coverity issue about handling bools in floor_divide #1374
Addressed "Very high" Coverity issue: dereferencing null pointer #1369
Fixes boolean indexing for strided masks #1370
Addressed issue with lowercase order value in tensor copy and astype #1376
Fix for Coverity highlighted issue in scripts #1373
Turn comparison call into assertion in test_usm_ndarray_ctor::test_flags #1371
Address NumPy 1.25 deprecation warnings #1368
Resolve SYCL-2020 deprecation warning #1367
Conversion from raw to multi_ptr should be done with address_space_cast #1366
Do not build dpctl for Python 3.8 #1363
Improved message text in two exceptions #1362
Removes deprecated DPCTLDevice_GetMaxWorkItemSizes #1359
Implements
types
property for elementwise functions #1361usm_ndarray.real and usm_ndarray.imag now set flags correctly #1355
Implements dpctl.tensor.matrix_transpose #1356
Have you provided a meaningful PR description?
Have you added a test, reproducer or referred to an issue with a reproducer?
Have you tested your changes locally for CPU and GPU devices?
Have you made sure that new changes do not introduce compiler warnings?
Have you checked performance impact of proposed changes?
If this PR is a work in progress, are you opening the PR as a draft?