REF: handle 2D in tslibs.vectorized #46886

jbrockmendel · 2022-04-27T21:31:15Z

closes #xxxx (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

jbrockmendel · 2022-04-27T21:31:41Z

pandas/_libs/tslibs/vectorized.pyx

+        #  like we do for the other functions because result is known C-contiguous
+        #  and is the first argument to PyArray_MultiIterNew2.  The usual pattern
+        #  does not seem to work with object dtype.
+        res_flat[i] = res_val


@seberg is there a more idiomatic way to do this __setitem__?

While I want to replace it (it is not quite ideal, NumPy uses a different functionality internally now). This looks like it should be fine to use PyArray_SETITEM(arr, pointer, res_val) as the "idiomatic" way.

Thanks. To clarify though, I was hoping for something that didn't rely on setting into res_flat, as that requires knowledge about the inner workings of the cnp.broadcast object that i only kinda-get, and that only because you and bashtage explained it in the comments to one of his PRs.

You are not trying to avoid Cython memory-views due to the buffer export overhead for smaller arrays, right?

In that case, I would suggest using (untested):

object res_flat[::1] = PyArray_BYTES(arr) # if that works, or the ravel()

That way, cython can deal with the correct assignment, and you know that object is right here. (You can avoid the memoryview, but I think you may have to Py_INCREF manually then.)

There could be faster/better ways to iterate the other array as well, but I suppose since this works with Python objects, it won't matter much. However, I think you can still replace the MultiIter with a single array iterator in any case (just make sure that one iterates in C order). There is no reason to iteraterate res, since that is effectively unused.

It would be really nice to have a "ufunc-like" pattern for this type of thing, but I would need more dedicated time to figure that out.
(The point being, that you would define a single "strided loop" for 1-D arrays, and than add machinery/code generation to enable to for the N-D case)

You are not trying to avoid Cython memory-views due to the buffer export overhead for smaller arrays, right?

Never crossed my mind.

The question may be clarified by describing the failure mode I am looking to avoid:

In 6 months another contributor goes to make the analogous change to another function, uses this pattern as a template, but instead of mi = cnp.PyArray_MultiIterNew2(result, stamps) reverses the arguments as mi = cnp.PyArray_MultiIterNew2(stamps, result), so setting res_flat[i] = res_val is no longer necessarily correct.

i.e. using res_flat at all is implicitly relying on knowledge of MultiIter internals that AFAIK is not documented. It just seems like a code smell. Is that clearer?

but I think you may have to Py_INCREF manually then

Definitely don't want to be in that business.

You are only relying on the C-order iteration I think. Which I believe MultiIter guarantees (but I am not sure). In any case, you do not need MultiIter, you only need PyArray_IterNew with a comment that it iterates in C-order and is thus compatible with the res_flat.
Then you can use object res_flat[::1] = res.ravel() for assignment, relying on the fact that the typed memoryview is typed correctly.

OK, I think I was a bit too fuzzy about what I meant, so here is the working diff in the details.
So long the result is allocated as C-order, the same pattern could also be used elsewhere for small speedups ("iterating" one instead of two arrays), but I think here it may be particularly worthwhile, because I am not certain that cython can optimize the item assignment otherwise.

It would be cool if cython had a way to use a more "ufunc-like" pattern for these iterations, but I guess that that would be a pretty big new thing...

diff --git a/pandas/_libs/tslibs/vectorized.pyx b/pandas/_libs/tslibs/vectorized.pyx index 3f0f06d874..6764127552 100644 --- a/pandas/_libs/tslibs/vectorized.pyx +++ b/pandas/_libs/tslibs/vectorized.pyx @@ -135,9 +135,11 @@ def ints_to_pydatetime( bint use_date = False, use_time = False, use_ts = False, use_pydt = False object res_val + # Note that `result` (and thus `result_flat`) is C-order and + # `stamps_iter` iterates C-order as well, so the iteration matches: ndarray result = cnp.PyArray_EMPTY(stamps.ndim, stamps.shape, cnp.NPY_OBJECT, 0) - ndarray res_flat = result.ravel() # should NOT be a copy - cnp.broadcast mi = cnp.PyArray_MultiIterNew2(result, stamps) + object[::1] res_flat = result.ravel() # should NOT be a copy + cnp.flatiter stamps_iter = cnp.PyArray_IterNew(stamps) if box == "date": assert (tz is None), "tz should be None when converting to date" @@ -155,7 +157,7 @@ def ints_to_pydatetime( for i in range(n): # Analogous to: utc_val = stamps[i] - utc_val = (<int64_t*>cnp.PyArray_MultiIter_DATA(mi, 1))[0] + utc_val = (<int64_t*>cnp.PyArray_ITER_DATA(stamps_iter))[0] new_tz = tz @@ -183,15 +185,11 @@ def ints_to_pydatetime( else: res_val = time(dts.hour, dts.min, dts.sec, dts.us, new_tz, fold=fold) - # Note: we can index result directly instead of using PyArray_MultiIter_DATA - # like we do for the other functions because result is known C-contiguous - # and is the first argument to PyArray_MultiIterNew2. The usual pattern - # does not seem to work with object dtype. - # See discussion at - # github.com/pandas-dev/pandas/pull/46886#discussion_r860261305 + # Note: We use a typed memory-view here to use cython for handling + # need for object ownership handling (reference counting). res_flat[i] = res_val - cnp.PyArray_MultiIter_NEXT(mi) + cnp.PyArray_ITER_NEXT(stamps_iter) return result

you only need PyArray_IterNew with a comment that it iterates in C-order

huh i guess i assumed it iterated in a cheapest-available order, so would be F-order if the array is F-contiguous. Good to know, thanks!

jbrockmendel · 2022-05-02T21:04:05Z

Added a comment pointing back to the discussion here; i think that's the best we'll be able to do for the forseeable future. cc @mroeschke

REF: handle 2D in tslibs.vectorized

552d420

jbrockmendel commented Apr 27, 2022

View reviewed changes

jreback added Refactor Internal refactoring of code Datetime Datetime data dtype labels Apr 30, 2022

Merge branch 'main' into ref-tzconversion-7

bab9d7e

mroeschke approved these changes May 2, 2022

View reviewed changes

mroeschke added this to the 1.5 milestone May 2, 2022

jbrockmendel added 2 commits May 4, 2022 13:19

Merge branch 'main' into ref-tzconversion-7

52c5e2e

Use flatiter

403f3b0

jreback merged commit d86e200 into pandas-dev:main May 6, 2022

jbrockmendel deleted the ref-tzconversion-7 branch May 6, 2022 22:25

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022

REF: handle 2D in tslibs.vectorized (pandas-dev#46886)

029c893

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: handle 2D in tslibs.vectorized #46886

REF: handle 2D in tslibs.vectorized #46886

jbrockmendel commented Apr 27, 2022

jbrockmendel Apr 27, 2022

seberg Apr 28, 2022 •

edited

Loading

jbrockmendel Apr 28, 2022

seberg Apr 29, 2022

jbrockmendel Apr 30, 2022

seberg Apr 30, 2022

seberg May 3, 2022

jbrockmendel May 4, 2022

jbrockmendel commented May 2, 2022

REF: handle 2D in tslibs.vectorized #46886

REF: handle 2D in tslibs.vectorized #46886

Conversation

jbrockmendel commented Apr 27, 2022

jbrockmendel Apr 27, 2022

Choose a reason for hiding this comment

seberg Apr 28, 2022 • edited Loading

Choose a reason for hiding this comment

jbrockmendel Apr 28, 2022

Choose a reason for hiding this comment

seberg Apr 29, 2022

Choose a reason for hiding this comment

jbrockmendel Apr 30, 2022

Choose a reason for hiding this comment

seberg Apr 30, 2022

Choose a reason for hiding this comment

seberg May 3, 2022

Choose a reason for hiding this comment

jbrockmendel May 4, 2022

Choose a reason for hiding this comment

jbrockmendel commented May 2, 2022

seberg Apr 28, 2022 •

edited

Loading