Skip to content

REF: Implement core._algos #32767

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 19, 2020
Merged

Conversation

jbrockmendel
Copy link
Member

ATM core.algorithms and core.nanops are a mish-mash in terms of what inputs they expect. This implements core._algos directory intended for guaranteed-ndarray/EA-only implementations.

For the first function to move I de-duplicated a shift method. Need suggestions for what to call this module.

@jreback jreback added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Mar 17, 2020
@@ -0,0 +1,9 @@
"""
core._algos is for algorithms that operate on ndarray and ExtensionArray.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use algos instead of _algos here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ambivalent on this, as I do like the non-private name, but we have a bunch of import algorithms as algos and i want to avoid ambiguity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then let’s just create core.algorithms as a subdir

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's a good point. The underscore is still only a small difference though. So maybe we should think of yet another name (or actually make algorithms.py into a module and rethink it more broader)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yes, similar as what @jreback said.

But Brock, you wanted to better distinguish pure array algos, that don't need to deal with series/dataframe etc (which I think is a good goal). But then making algorithms.py into a submodule (and use that for the function in this PR) defeats that purpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then making algorithms.py into a submodule (and use that for the function in this PR) defeats that purpose?

That's my thought too.

So maybe we should think of yet another name

Suggestions?

- Assume that any Index, Series, or DataFrame objects have already been unwrapped.
- Assume that any list arguments have already been cast to ndarray/EA.
- Not depend on Index, Series, or DataFrame, nor import any of these.
- May dispatch to ExtensionArray methods, but should not import from core.arrays.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what else you are planning to put here, but it could also be pure-numpy algos? That would also have value in being very clear.
Eg the shift you moved here now is strictly numpy arrays.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah, I have no problem with some functions in here being explicitly ndarray-only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't speaking about some functions, but the full file.

But so, what other functions do you envision to put here? (as for a single function it's not worth creating a module I think)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't speaking about some functions, but the full file.

That may end up being a reasonable organization, but it isn't obvious to me. e.g. if we end up with a shift that dispatches to shift_ea and shift_ndarray, would we want those to all be in the same file or split across multiple files?

@@ -39,6 +39,7 @@
from pandas.core.dtypes.missing import is_valid_nat_for_dtype, isna

from pandas.core import missing, nanops, ops
from pandas.core._algos.transforms import shift
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you import from "pandas.core._algos" instead?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see
comments

@jbrockmendel
Copy link
Member Author

what other functions do you envision to put here?

  1. EA-compatible variants of np.broadcast_to, np.tile, (maybe np.putmask?), etc will likely have their own file.
  2. Things like shift where we can pull the meat of the method out of internals.
  3. Some parts of algorithms and nanops that are array-specific
    • Some of the functions in these files seem like they are doing Index/Series unwrapping haphazardly. By separating out the array-specific parts, we can ensure we do those unwrappings just once.

@jbrockmendel
Copy link
Member Author

Renamed "_algos" -> "array_algos"

@jreback jreback added this to the 1.1 milestone Mar 19, 2020
@jreback jreback added the Refactor Internal refactoring of code label Mar 19, 2020
@jreback jreback merged commit 91512a8 into pandas-dev:master Mar 19, 2020
@jreback
Copy link
Contributor

jreback commented Mar 19, 2020

thanks @jbrockmendel the above sounds like a good plan

SeeminSyed pushed a commit to CSCD01-team01/pandas that referenced this pull request Mar 22, 2020
@jbrockmendel jbrockmendel deleted the ref-algos-2 branch March 23, 2020 16:13
jbrockmendel added a commit to jbrockmendel/pandas that referenced this pull request Mar 23, 2020
jbrockmendel added a commit to jbrockmendel/pandas that referenced this pull request Mar 25, 2020
@jbrockmendel
Copy link
Member Author

@jorisvandenbossche another candidate for array_algos would be most of what is currently in dtypes.concat, which you and i agreed was a weird place for those functions

@jorisvandenbossche
Copy link
Member

Yep, now another option for those concat things would be to move each of them to their respective array implementation (but that might depend a bit on how we rework the EA interface regarding concatting)

@jbrockmendel
Copy link
Member Author

move each of them to their respective array implementation

I like that idea quite a bit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants