Skip to content

API: Add pandas.api.typing #48577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rhshadrach opened this issue Sep 16, 2022 · 25 comments · Fixed by #48578
Closed

API: Add pandas.api.typing #48577

rhshadrach opened this issue Sep 16, 2022 · 25 comments · Fixed by #48578
Assignees
Labels
API Design Enhancement Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@rhshadrach
Copy link
Member

rhshadrach commented Sep 16, 2022

Revision 2:

Currently many intermediate classes can only be found in pandas.core, for example DataFrameGroupBy and SeriesGroupBy. These classes should not be instantiated directly, but may be necessary to access for users who use type-hinting. I think we should add these classes to a new API submodule, pandas.api.aux as they are auxiliary classes. Users shouldn't need to know e.g. they are defined in pandas.core.groupby.generic.

Somewhat related: #6944, #19302

cc @pandas-dev/pandas-core @pandas-dev/pandas-triage for feedback.

Original Currently one cannot run `from pandas import DataFrameGroupBy, SeriesGroupBy`. We should add these classes to the top level as they are particularly important for type-hinting in user code. Users shouldn't need to know they are defined in `pandas.core.groupby.generic`.
Revision 1 Currently one cannot run `from pandas import DataFrameGroupBy, SeriesGroupBy`. We should add these classes to a new API submodule, `pandas.api.trying` as they are particularly important for type-hinting in user code. Users shouldn't need to know they are defined in `pandas.core.groupby.generic`.
@jbrockmendel
Copy link
Member

makes sense for typing purposes, but does it make sense for users to use these directly?

@rhshadrach
Copy link
Member Author

rhshadrach commented Sep 17, 2022

I would say no - they should only be invoked via DataFrame.groupby or Series.groupby. We could introduce code that raises when third party code calls their __init__ by inspecting the stack, but I would prefer adding the classes to the API docs with the warning that they should not be instantiated directly.

@topper-123
Copy link
Contributor

I agree this is useful for typing purposes but otherwise not, so I'm -1 on putting them is the main namespace. I suggest they go into pd.api.types namespace, either directly or in a is_groupby function.

@rhshadrach
Copy link
Member Author

Thanks @topper-123. I don't believe is_groupby is sufficient for users when it comes to type-hinting, e.g. def foo(gb: DataFrameGroupBy) -> pd.DataFrame: .... The only downside I see to putting them in pd.api.types is it seems harder for users to discover, but perhaps that will change overtime. I also don't see the benefit of putting them here as opposed to top-level, can you comment on that @topper-123?

@rhshadrach rhshadrach added the Needs Discussion Requires discussion from core team before further action label Sep 19, 2022
@rhshadrach rhshadrach removed this from the 1.6 milestone Sep 19, 2022
@topper-123
Copy link
Contributor

Hi @rhshadrach. I'm positive to adding the groupby classes to the public namespace and my only hesitance is whether to put them in the main namespace or not.

My feeling is that the main namespace is very crowded already and so should be added to only sparingly, and these two classes will not be instantiated directly from the main namespace, so it may be confusing having them there. pd.api.types seems like that kind where we keep specialized typing functionality, so could fit there.

@rhshadrach
Copy link
Member Author

rhshadrach commented Sep 24, 2022

I think the idea of having the main namespace be "objects / functions users should create / use" is a meaningful one, so +1 on putting this somewhere else. Looking through the current entries, the only potential class I see that a user should likely not be constructing themselves is Flags.

Currently, pd.api.types contains classes and functions for working with dtypes. This would be an expansion of that focus which makes me hesitant here. I am wondering if it would be more appropriate to introduce pd.api.typing for type-hinting related objects, namely classes that should not be directly instantiated and aliases.

@rhshadrach rhshadrach added the Typing type annotations, mypy/pyright type checking label Sep 24, 2022
@topper-123
Copy link
Contributor

+1 on somewhere in pd.api, I don't have a very strong preference on pd.api.types or pd.api.typing. Maybe just make update the PR with your preferred solution (pd.api.typing?), remove the draft marker and see the response?

There are probably a few other classes that could be exposed publicly, e..g. Resampler is one I can think of, but there are probably others also.

@rhshadrach rhshadrach changed the title API: Add DataFrameGroupBy and SeriesGroupBy to the top level API API: Add pandas.api.typing Sep 28, 2022
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 29, 2022

We should be careful here with respect to what we want to do in the pandas code and encouraging people to use pandas-stubs.

The following code works and type checks just fine if you have pandas-stubs installed:

import pandas as pd
from pandas.core.groupby.generic import DataFrameGroupBy


def myfun(dfgb: DataFrameGroupBy):
    return dfgb.sum()


df = pd.DataFrame(
    {
        "Animal": ["Falcon", "Falcon", "Parrot", "Parrot"],
        "Max Speed": [380.0, 370.0, 24.0, 26.0],
    }
)
dfg: DataFrameGroupBy = df.groupby("Animal")
print(myfun(dfg))

So isn't this just an issue of making sure that types like DataFrameGroupby and SeriesGroupby that show up in the docs are properly documented and then people know how to do the imports?

It's also worth mentioning that in the example above, if you were to do a reveal_type(dfg) with pandas-stubs, you would get _DataFrameGroupByScalar, because we do things in the stubs to return different "classes" based on the arguments passed to groupby().

I don't think this is a typing issue. I think this is more about that we have docs that return classes that are not fully documented, and if you wanted to use those classes in your own type declarations, then you have to know where to import them from.

@rhshadrach
Copy link
Member Author

rhshadrach commented Sep 29, 2022

So isn't this just an issue of making sure that types like DataFrameGroupby and SeriesGroupby that show up in the docs are properly documented and then people know how to do the imports?

While this works, I do not think the current state is a good solution.

  • It'd be better if users didn't have to go through the depths of the pandas code to find the objects they want to type-hint. Having a single place to pull these classes from for type-hinting is more user friendly.
  • By having users import from various places in the pandas code, the locations become part of the API and therefore must be deprecated prior to moving. Having all users import from pandas.api.typing gives stability.
  • It is not clear to users which objects in e.g. pandas.core are public. In particular, if we ever want deprecate pandas.core in favor of pandas._core, the current location of these classes is not viable.

@rhshadrach
Copy link
Member Author

It's also worth mentioning that in the example above, if you were to do a reveal_type(dfg) with pandas-stubs, you would get _DataFrameGroupByScalar, because we do things in the stubs to return different "classes" based on the arguments passed to groupby().

Indeed, thanks, I do find that interesting. But I don't see the connection to this issue - are there additional problems incurred if the user was to import DataFrameGroupBy from pandas.api.typing rather than pandas.core.groupby.generic?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 29, 2022

While this works, I do not think the current state is a good solution.

  • It'd be better if users didn't have to go through the depths of the pandas code to find the objects they want to type-hint. Having a single place to pull these classes from for type-hinting is more user friendly.
  • By having users import from various places in the pandas code, the locations become part of the API and therefore must be deprecated prior to moving. Having all users import from pandas.api.typing gives stability.
  • It is not clear to users which objects in e.g. pandas.core are public. In particular, if we ever want deprecate pandas.core in favor of pandas._core, the current location of these classes is not viable.

I agree with you here, except I'm not convinced that this is only needed for typing, so I'm not sure that calling it pandas.api.typing is correct. Maybe pandas.api.auxclasses (or something like that) to indicate that these are auxiliary classes meant to hold some kind of intermediate results.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 29, 2022

It's also worth mentioning that in the example above, if you were to do a reveal_type(dfg) with pandas-stubs, you would get _DataFrameGroupByScalar, because we do things in the stubs to return different "classes" based on the arguments passed to groupby().

Indeed, thanks, I do find that interesting. But I don't see the connection to this issue - are there additional problems incurred if the user was to import DataFrameGroupBy from pandas.api.typing rather than pandas.core.groupby.generic?

I don't think so, but if you put these classes under pandas.api.typing, there might be an expectation that other classes in the stubs should go there as well. I think it's fine that we have classes in the stubs that are not in the pandas source. I just don't think you should be putting these in a typing category, because it might cause confusion.

@rhshadrach
Copy link
Member Author

Thanks @Dr-Irv; I agree with your comments. I've updated the linked PR with aux rather than your suggested auxclasses as it seemed more concise. How does that sound?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 28, 2022

Thanks @Dr-Irv; I agree with your comments. I've updated the linked PR with aux rather than your suggested auxclasses as it seemed more concise. How does that sound?

Works for me.

@rhshadrach rhshadrach changed the title API: Add pandas.api.typing API: Add pandas.api.aux Oct 28, 2022
@rhshadrach rhshadrach removed Groupby Typing type annotations, mypy/pyright type checking labels Oct 28, 2022
@rhshadrach
Copy link
Member Author

rhshadrach commented Nov 9, 2022

I wrote a script to take the classes in pandas that occur in the documentation but aren't in the public API. There are likely some missing classes as well as false positives here, I've crossed off the ones that I can identify and put the ones I'm not familiar enough with in bold.

In addition to the top-level and pandas.api I'm considering the following as already being public: pandas.arrays, pandas.tseries, pandas.tseries.offesets.

Code
import pandas as pd
import glob
import os
import importlib

print(pd.__file__)

public_objects = (
    set(dir(pd))
    | set(dir(pd.api.extensions))
    | set(dir(pd.api.indexers))
    | set(dir(pd.api.interchange))
    | set(dir(pd.api.types))
    | set(dir(pd.arrays))
    | set(dir(pd.tseries))
    | set(dir(pd.tseries.offsets))
    | set(dir(pd.errors))
)

directory = os.path.dirname(pd.__file__)
files = glob.glob(os.path.join(directory, '**/*.py'), recursive=True)
files = [e[len(directory):] for e in files if not e.startswith('/home/richard/dev/pandas/pandas/tests/')]
files = ['pandas' + e.replace('/', '.') for e in files]
files = [e[:-len('.py')] for e in files]

def get_classes(path):
    module = importlib.import_module(path)
    names = [e for e in dir(module) if not e.startswith('_') and not e.startswith('ABC')]
    
    result = []
    for name in names:
        obj = getattr(module, name)
        if getattr(obj, '__module__', None) is None:
            continue
        if not obj.__module__.startswith('pandas'):
            continue
        if not isinstance(obj, type):
            continue
        result.append(name)
    return result

objects = set()
for file in files:
    objects |= set(get_classes(file))

nonpublic = sorted(objects - public_objects)

directory = os.path.abspath(os.path.dirname(os.path.join(pd.__file__, '../../')))
doc_files = glob.glob(os.path.join(directory, 'doc/**/*.rst'), recursive=True)
docs = ''
for filename in doc_files:
    if '/whatsnew/' in filename:
        continue
    docs += '\n'.join(open(filename, 'r').readlines())

for obj in nonpublic:
    if f' {str(obj)} ' in docs or f' {str(obj)}.' in docs:
        print('-', str(obj))
  • AbstractHolidayCalendar; this is in pandas.tseries.holiday
  • Apply
  • BlockManager
  • Column
  • DataFrameGroupBy
  • Expanding
  • ExponentialMovingWindow
  • GroupBy
  • JsonReader
  • Parser This doesn't appear to be returned by any methods; in particular it doesn't appear in pandas-stubs
  • Properties Same as line above
  • Resampler
  • Rolling
  • SeriesGroupBy
  • StataReader
  • StataWriter
  • Styler
  • TimeGrouper
  • Version
  • Window
  • XlsxWriter
  • disallow

@mroeschke
Copy link
Member

ParserError
ParserWarning

These are documented in doc/source/reference/testing.rst and exposed in pandas/errors/__init__.py

Column

This is an object from the dataframe interchange protocol that I think should be an implementation detail.

@jorisvandenbossche
Copy link
Member

You can also consider pandas.errors as public, which should remove ParserError/Warning.

For XlsxWriter, pd.ExcelWriter base class is already public, is that sufficient?

And pandas.io is maybe also semi public (we have some usages of that in the docs)

@rhshadrach
Copy link
Member Author

Thanks @mroeschke / @jorisvandenbossche - I've updated the script to include pandas.errors and remove the ParserError/Warning.

For XlsxWriter, pd.ExcelWriter base class is already public, is that sufficient?

If we agree that subclasses of ExcelWriter should not be allowed to add any public methods, then it would be sufficient. That looks to currently be the case, and I think that is a reasonable restriction on the subclasses. I've crossed out XlsxWriter and opened #49602.

And pandas.io is maybe also semi public (we have some usages of that in the docs)

The original motivation of this issue is to start making more clear to users what is / isn't public, so unless pandas.io is fully public, I think it would be helpful to keep those items on the list above. This doesn't necessarily mean they get added to pandas.api.aux or pandas.api.typing, perhaps there is a different way to handle them.

@rhshadrach
Copy link
Member Author

rhshadrach commented Dec 15, 2022

I've edited the list above, crossing off a few classes that are purely internal.

From the November dev call, I recall there was support for pandas.api.typing with the main purpose of exposing these classes for users typing their code as well as the similarity to numpy.typing. There was also concern that this may cause users confusion with pandas-stubs and the typing classes there (as in #48577 (comment)).

My take: while I agree there is potential for some confusion, I don't believe it to be very high, and would prefer pandas.api.typing. However I'm also okay with pandas.api.aux.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 15, 2022

From the November dev call, I recall there was support for pandas.api.typing with the main purpose of exposing these classes for users typing their code as well as the similarity to numpy.typing. There was also concern that this may cause users confusion with pandas-stubs and the typing classes there (as in #48577 (comment)).

unfortunately I missed the November dev call. I disagree with the comparison to numpy.typing because all typing for numpy is included in the package. For pandas, we are providing typing in a separate package.

My take: while I agree there is potential for some confusion, I don't believe it to be very high, and would prefer pandas.api.typing. However I'm also okay with pandas.api.aux.

I’m concerned about setting a precedent if we use pandas.api.typing.

@bashtage
Copy link
Contributor

If the purpose is really only for typing, than it seems most natural to have it at either pd.typing or pd.api.typing. This intended use doesn't really fit the definition of auxiliary, which sounds like a place for some helper routines.

@topper-123
Copy link
Contributor

A thought: I never really liked that the type checking functions are located in pd.api.types. Could an idea be to:

  • move all the type checking functions (is_*/infer_dtype) to a new pandas.api.type_checkers
  • Add the proposes types in this issue to pandas.api.types and use that for actual types

The functions is pandas.api.types would of course go through a deprecation process, but we could hide them in a __dir__ function in the pandas.api.types module, to make having deprecated functions there less annoying.

@rhshadrach
Copy link
Member Author

I disagree with the comparison to numpy.typing because all typing for numpy is included in the package. For pandas, we are providing typing in a separate package.

I would differentiate the two main (at least, in my opinion) purposes of type-hinting: (a) is increasing readability and (b) static type checking. You can accomplish (a) without e.g. pandas-stubs. In this sense, both numpy.typing and pandas.api.typing would contain classes for the user to add type hints to their code, regardless of if that user is going to use static type-checking.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 16, 2022

In this sense, both numpy.typing and pandas.api.typing would contain classes for the user to add type hints to their code, regardless of if that user is going to use static type-checking.

I see your point. As long as we document it that way, I won't stand in the way of calling it pandas.api.typing .

@rhshadrach rhshadrach changed the title API: Add pandas.api.aux API: Add pandas.api.typing Jan 23, 2023
@rhshadrach rhshadrach added this to the 2.1 milestone Feb 26, 2023
@simonjayhawkins
Copy link
Member

@TomAugspurger commented #27522 (comment) that if we're disciplined
about making private modules private with a leading underscore,
there's no need for a .api namespace.

Now if #48578 is merged we have effectively removed some barriers to making pandas.core private. #27522 and I think in general there is support to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants