API: Add pandas.api.typing #48577

rhshadrach · 2022-09-16T01:26:53Z

Revision 2:

Currently many intermediate classes can only be found in pandas.core, for example DataFrameGroupBy and SeriesGroupBy. These classes should not be instantiated directly, but may be necessary to access for users who use type-hinting. I think we should add these classes to a new API submodule, pandas.api.aux as they are auxiliary classes. Users shouldn't need to know e.g. they are defined in pandas.core.groupby.generic.

Somewhat related: #6944, #19302

cc @pandas-dev/pandas-core @pandas-dev/pandas-triage for feedback.

Original

Currently one cannot run `from pandas import DataFrameGroupBy, SeriesGroupBy`. We should add these classes to the top level as they are particularly important for type-hinting in user code. Users shouldn't need to know they are defined in `pandas.core.groupby.generic`.

Revision 1

Currently one cannot run `from pandas import DataFrameGroupBy, SeriesGroupBy`. We should add these classes to a new API submodule, `pandas.api.trying` as they are particularly important for type-hinting in user code. Users shouldn't need to know they are defined in `pandas.core.groupby.generic`.

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2022-09-16T16:39:13Z

makes sense for typing purposes, but does it make sense for users to use these directly?

rhshadrach · 2022-09-17T01:05:25Z

I would say no - they should only be invoked via DataFrame.groupby or Series.groupby. We could introduce code that raises when third party code calls their __init__ by inspecting the stack, but I would prefer adding the classes to the API docs with the warning that they should not be instantiated directly.

topper-123 · 2022-09-18T05:54:52Z

I agree this is useful for typing purposes but otherwise not, so I'm -1 on putting them is the main namespace. I suggest they go into pd.api.types namespace, either directly or in a is_groupby function.

rhshadrach · 2022-09-18T21:47:32Z

Thanks @topper-123. I don't believe is_groupby is sufficient for users when it comes to type-hinting, e.g. def foo(gb: DataFrameGroupBy) -> pd.DataFrame: .... The only downside I see to putting them in pd.api.types is it seems harder for users to discover, but perhaps that will change overtime. I also don't see the benefit of putting them here as opposed to top-level, can you comment on that @topper-123?

topper-123 · 2022-09-20T19:48:17Z

Hi @rhshadrach. I'm positive to adding the groupby classes to the public namespace and my only hesitance is whether to put them in the main namespace or not.

My feeling is that the main namespace is very crowded already and so should be added to only sparingly, and these two classes will not be instantiated directly from the main namespace, so it may be confusing having them there. pd.api.types seems like that kind where we keep specialized typing functionality, so could fit there.

rhshadrach · 2022-09-24T14:30:17Z

I think the idea of having the main namespace be "objects / functions users should create / use" is a meaningful one, so +1 on putting this somewhere else. Looking through the current entries, the only potential class I see that a user should likely not be constructing themselves is Flags.

Currently, pd.api.types contains classes and functions for working with dtypes. This would be an expansion of that focus which makes me hesitant here. I am wondering if it would be more appropriate to introduce pd.api.typing for type-hinting related objects, namely classes that should not be directly instantiated and aliases.

topper-123 · 2022-09-25T18:43:22Z

+1 on somewhere in pd.api, I don't have a very strong preference on pd.api.types or pd.api.typing. Maybe just make update the PR with your preferred solution (pd.api.typing?), remove the draft marker and see the response?

There are probably a few other classes that could be exposed publicly, e..g. Resampler is one I can think of, but there are probably others also.

Dr-Irv · 2022-09-29T18:24:12Z

We should be careful here with respect to what we want to do in the pandas code and encouraging people to use pandas-stubs.

The following code works and type checks just fine if you have pandas-stubs installed:

import pandas as pd
from pandas.core.groupby.generic import DataFrameGroupBy


def myfun(dfgb: DataFrameGroupBy):
    return dfgb.sum()


df = pd.DataFrame(
    {
        "Animal": ["Falcon", "Falcon", "Parrot", "Parrot"],
        "Max Speed": [380.0, 370.0, 24.0, 26.0],
    }
)
dfg: DataFrameGroupBy = df.groupby("Animal")
print(myfun(dfg))

So isn't this just an issue of making sure that types like DataFrameGroupby and SeriesGroupby that show up in the docs are properly documented and then people know how to do the imports?

It's also worth mentioning that in the example above, if you were to do a reveal_type(dfg) with pandas-stubs, you would get _DataFrameGroupByScalar, because we do things in the stubs to return different "classes" based on the arguments passed to groupby().

I don't think this is a typing issue. I think this is more about that we have docs that return classes that are not fully documented, and if you wanted to use those classes in your own type declarations, then you have to know where to import them from.

rhshadrach · 2022-09-29T21:18:37Z

So isn't this just an issue of making sure that types like DataFrameGroupby and SeriesGroupby that show up in the docs are properly documented and then people know how to do the imports?

While this works, I do not think the current state is a good solution.

It'd be better if users didn't have to go through the depths of the pandas code to find the objects they want to type-hint. Having a single place to pull these classes from for type-hinting is more user friendly.
By having users import from various places in the pandas code, the locations become part of the API and therefore must be deprecated prior to moving. Having all users import from pandas.api.typing gives stability.
It is not clear to users which objects in e.g. pandas.core are public. In particular, if we ever want deprecate pandas.core in favor of pandas._core, the current location of these classes is not viable.

rhshadrach · 2022-09-29T21:22:56Z

It's also worth mentioning that in the example above, if you were to do a reveal_type(dfg) with pandas-stubs, you would get _DataFrameGroupByScalar, because we do things in the stubs to return different "classes" based on the arguments passed to groupby().

Indeed, thanks, I do find that interesting. But I don't see the connection to this issue - are there additional problems incurred if the user was to import DataFrameGroupBy from pandas.api.typing rather than pandas.core.groupby.generic?

Dr-Irv · 2022-09-29T21:42:03Z

While this works, I do not think the current state is a good solution.

It'd be better if users didn't have to go through the depths of the pandas code to find the objects they want to type-hint. Having a single place to pull these classes from for type-hinting is more user friendly.

By having users import from various places in the pandas code, the locations become part of the API and therefore must be deprecated prior to moving. Having all users import from pandas.api.typing gives stability.

It is not clear to users which objects in e.g. pandas.core are public. In particular, if we ever want deprecate pandas.core in favor of pandas._core, the current location of these classes is not viable.

I agree with you here, except I'm not convinced that this is only needed for typing, so I'm not sure that calling it pandas.api.typing is correct. Maybe pandas.api.auxclasses (or something like that) to indicate that these are auxiliary classes meant to hold some kind of intermediate results.

Dr-Irv · 2022-09-29T21:43:35Z

It's also worth mentioning that in the example above, if you were to do a reveal_type(dfg) with pandas-stubs, you would get _DataFrameGroupByScalar, because we do things in the stubs to return different "classes" based on the arguments passed to groupby().

Indeed, thanks, I do find that interesting. But I don't see the connection to this issue - are there additional problems incurred if the user was to import DataFrameGroupBy from pandas.api.typing rather than pandas.core.groupby.generic?

I don't think so, but if you put these classes under pandas.api.typing, there might be an expectation that other classes in the stubs should go there as well. I think it's fine that we have classes in the stubs that are not in the pandas source. I just don't think you should be putting these in a typing category, because it might cause confusion.

rhshadrach · 2022-10-23T14:12:29Z

Thanks @Dr-Irv; I agree with your comments. I've updated the linked PR with aux rather than your suggested auxclasses as it seemed more concise. How does that sound?

Dr-Irv · 2022-10-28T19:13:44Z

Thanks @Dr-Irv; I agree with your comments. I've updated the linked PR with aux rather than your suggested auxclasses as it seemed more concise. How does that sound?

Works for me.

rhshadrach · 2022-11-09T19:39:36Z

I wrote a script to take the classes in pandas that occur in the documentation but aren't in the public API. There are likely some missing classes as well as false positives here, I've crossed off the ones that I can identify and put the ones I'm not familiar enough with in bold.

In addition to the top-level and pandas.api I'm considering the following as already being public: pandas.arrays, pandas.tseries, pandas.tseries.offesets.

Code

import pandas as pd
import glob
import os
import importlib

print(pd.__file__)

public_objects = (
    set(dir(pd))
    | set(dir(pd.api.extensions))
    | set(dir(pd.api.indexers))
    | set(dir(pd.api.interchange))
    | set(dir(pd.api.types))
    | set(dir(pd.arrays))
    | set(dir(pd.tseries))
    | set(dir(pd.tseries.offsets))
    | set(dir(pd.errors))
)

directory = os.path.dirname(pd.__file__)
files = glob.glob(os.path.join(directory, '**/*.py'), recursive=True)
files = [e[len(directory):] for e in files if not e.startswith('/home/richard/dev/pandas/pandas/tests/')]
files = ['pandas' + e.replace('/', '.') for e in files]
files = [e[:-len('.py')] for e in files]

def get_classes(path):
    module = importlib.import_module(path)
    names = [e for e in dir(module) if not e.startswith('_') and not e.startswith('ABC')]
    
    result = []
    for name in names:
        obj = getattr(module, name)
        if getattr(obj, '__module__', None) is None:
            continue
        if not obj.__module__.startswith('pandas'):
            continue
        if not isinstance(obj, type):
            continue
        result.append(name)
    return result

objects = set()
for file in files:
    objects |= set(get_classes(file))

nonpublic = sorted(objects - public_objects)

directory = os.path.abspath(os.path.dirname(os.path.join(pd.__file__, '../../')))
doc_files = glob.glob(os.path.join(directory, 'doc/**/*.rst'), recursive=True)
docs = ''
for filename in doc_files:
    if '/whatsnew/' in filename:
        continue
    docs += '\n'.join(open(filename, 'r').readlines())

for obj in nonpublic:
    if f' {str(obj)} ' in docs or f' {str(obj)}.' in docs:
        print('-', str(obj))

~~AbstractHolidayCalendar~~; this is in pandas.tseries.holiday
~~Apply~~
~~BlockManager~~
~~Column~~
DataFrameGroupBy
Expanding
ExponentialMovingWindow
~~GroupBy~~
JsonReader
~~Parser~~ This doesn't appear to be returned by any methods; in particular it doesn't appear in pandas-stubs
~~Properties~~ Same as line above
Resampler
Rolling
SeriesGroupBy
StataReader
StataWriter
Styler
TimeGrouper
~~Version~~
Window
~~XlsxWriter~~
~~disallow~~

mroeschke · 2022-11-09T19:48:02Z

ParserError
ParserWarning

These are documented in doc/source/reference/testing.rst and exposed in pandas/errors/__init__.py

Column

This is an object from the dataframe interchange protocol that I think should be an implementation detail.

jorisvandenbossche · 2022-11-09T19:48:07Z

You can also consider pandas.errors as public, which should remove ParserError/Warning.

For XlsxWriter, pd.ExcelWriter base class is already public, is that sufficient?

And pandas.io is maybe also semi public (we have some usages of that in the docs)

rhshadrach · 2022-11-09T20:02:46Z

Thanks @mroeschke / @jorisvandenbossche - I've updated the script to include pandas.errors and remove the ParserError/Warning.

For XlsxWriter, pd.ExcelWriter base class is already public, is that sufficient?

If we agree that subclasses of ExcelWriter should not be allowed to add any public methods, then it would be sufficient. That looks to currently be the case, and I think that is a reasonable restriction on the subclasses. I've crossed out XlsxWriter and opened #49602.

And pandas.io is maybe also semi public (we have some usages of that in the docs)

The original motivation of this issue is to start making more clear to users what is / isn't public, so unless pandas.io is fully public, I think it would be helpful to keep those items on the list above. This doesn't necessarily mean they get added to pandas.api.aux or pandas.api.typing, perhaps there is a different way to handle them.

rhshadrach · 2022-12-15T02:28:04Z

I've edited the list above, crossing off a few classes that are purely internal.

From the November dev call, I recall there was support for pandas.api.typing with the main purpose of exposing these classes for users typing their code as well as the similarity to numpy.typing. There was also concern that this may cause users confusion with pandas-stubs and the typing classes there (as in #48577 (comment)).

My take: while I agree there is potential for some confusion, I don't believe it to be very high, and would prefer pandas.api.typing. However I'm also okay with pandas.api.aux.

Dr-Irv · 2022-12-15T14:58:52Z

From the November dev call, I recall there was support for pandas.api.typing with the main purpose of exposing these classes for users typing their code as well as the similarity to numpy.typing. There was also concern that this may cause users confusion with pandas-stubs and the typing classes there (as in #48577 (comment)).

unfortunately I missed the November dev call. I disagree with the comparison to numpy.typing because all typing for numpy is included in the package. For pandas, we are providing typing in a separate package.

My take: while I agree there is potential for some confusion, I don't believe it to be very high, and would prefer pandas.api.typing. However I'm also okay with pandas.api.aux.

I’m concerned about setting a precedent if we use pandas.api.typing.

bashtage · 2022-12-15T17:38:53Z

If the purpose is really only for typing, than it seems most natural to have it at either pd.typing or pd.api.typing. This intended use doesn't really fit the definition of auxiliary, which sounds like a place for some helper routines.

topper-123 · 2022-12-15T22:50:40Z

A thought: I never really liked that the type checking functions are located in pd.api.types. Could an idea be to:

move all the type checking functions (is_*/infer_dtype) to a new pandas.api.type_checkers
Add the proposes types in this issue to pandas.api.types and use that for actual types

The functions is pandas.api.types would of course go through a deprecation process, but we could hide them in a __dir__ function in the pandas.api.types module, to make having deprecated functions there less annoying.

rhshadrach · 2022-12-16T03:07:32Z

I disagree with the comparison to numpy.typing because all typing for numpy is included in the package. For pandas, we are providing typing in a separate package.

I would differentiate the two main (at least, in my opinion) purposes of type-hinting: (a) is increasing readability and (b) static type checking. You can accomplish (a) without e.g. pandas-stubs. In this sense, both numpy.typing and pandas.api.typing would contain classes for the user to add type hints to their code, regardless of if that user is going to use static type-checking.

Dr-Irv · 2022-12-16T19:56:04Z

In this sense, both numpy.typing and pandas.api.typing would contain classes for the user to add type hints to their code, regardless of if that user is going to use static type-checking.

I see your point. As long as we document it that way, I won't stand in the way of calling it pandas.api.typing .

simonjayhawkins · 2023-03-08T13:56:17Z

@TomAugspurger commented #27522 (comment) that if we're disciplined
about making private modules private with a leading underscore,
there's no need for a .api namespace.

Now if #48578 is merged we have effectively removed some barriers to making pandas.core private. #27522 and I think in general there is support to do that.

rhshadrach added Enhancement Groupby API Design labels Sep 16, 2022

rhshadrach self-assigned this Sep 16, 2022

rhshadrach mentioned this issue Sep 16, 2022

API: Add pandas.api.typing #48578

Merged

5 tasks

rhshadrach added this to the 1.6 milestone Sep 16, 2022

rhshadrach added the Needs Discussion Requires discussion from core team before further action label Sep 19, 2022

rhshadrach removed this from the 1.6 milestone Sep 19, 2022

rhshadrach added the Typing type annotations, mypy/pyright type checking label Sep 24, 2022

rhshadrach changed the title ~~API: Add DataFrameGroupBy and SeriesGroupBy to the top level API~~ API: Add pandas.api.typing Sep 28, 2022

rhshadrach changed the title ~~API: Add pandas.api.typing~~ API: Add pandas.api.aux Oct 28, 2022

rhshadrach removed Groupby Typing type annotations, mypy/pyright type checking labels Oct 28, 2022

rhshadrach mentioned this issue Nov 9, 2022

TST: Subclasses of ExcelWriter should not be allowed to add new public attributes #49602

Closed

rhshadrach changed the title ~~API: Add pandas.api.aux~~ API: Add pandas.api.typing Jan 23, 2023

mroeschke mentioned this issue Feb 1, 2023

Avoid pd.core import in dask utils dask/dask#9907

Merged

3 tasks

rhshadrach added this to the 2.1 milestone Feb 26, 2023

mroeschke closed this as completed in #48578 Apr 17, 2023

twoertwein mentioned this issue Aug 5, 2023

ENH: Expose pandas._libs.missing.NAType #48388

Closed

3 tasks

twoertwein mentioned this issue Sep 22, 2023

ENH: Export (a subset of?) pandas._typing for type checking #55231

Open

3 tasks

rhshadrach mentioned this issue Dec 14, 2023

DOC: Provide a public place for users to link to our documentation #55632

Open

1 task

lucbouge mentioned this issue Aug 10, 2024

pyright cannot resolve pandas.api.typing microsoft/pyright#8732

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Add pandas.api.typing #48577

API: Add pandas.api.typing #48577

rhshadrach commented Sep 16, 2022 •

edited

Loading

jbrockmendel commented Sep 16, 2022

rhshadrach commented Sep 17, 2022 •

edited

Loading

topper-123 commented Sep 18, 2022

rhshadrach commented Sep 18, 2022

topper-123 commented Sep 20, 2022

rhshadrach commented Sep 24, 2022 •

edited

Loading

topper-123 commented Sep 25, 2022

Dr-Irv commented Sep 29, 2022

rhshadrach commented Sep 29, 2022 •

edited

Loading

rhshadrach commented Sep 29, 2022

Dr-Irv commented Sep 29, 2022

Dr-Irv commented Sep 29, 2022

rhshadrach commented Oct 23, 2022

Dr-Irv commented Oct 28, 2022

rhshadrach commented Nov 9, 2022 •

edited

Loading

mroeschke commented Nov 9, 2022

jorisvandenbossche commented Nov 9, 2022

rhshadrach commented Nov 9, 2022

rhshadrach commented Dec 15, 2022 •

edited

Loading

Dr-Irv commented Dec 15, 2022

bashtage commented Dec 15, 2022

topper-123 commented Dec 15, 2022

rhshadrach commented Dec 16, 2022

Dr-Irv commented Dec 16, 2022

simonjayhawkins commented Mar 8, 2023

API: Add pandas.api.typing #48577

API: Add pandas.api.typing #48577

Comments

rhshadrach commented Sep 16, 2022 • edited Loading

jbrockmendel commented Sep 16, 2022

rhshadrach commented Sep 17, 2022 • edited Loading

topper-123 commented Sep 18, 2022

rhshadrach commented Sep 18, 2022

topper-123 commented Sep 20, 2022

rhshadrach commented Sep 24, 2022 • edited Loading

topper-123 commented Sep 25, 2022

Dr-Irv commented Sep 29, 2022

rhshadrach commented Sep 29, 2022 • edited Loading

rhshadrach commented Sep 29, 2022

Dr-Irv commented Sep 29, 2022

Dr-Irv commented Sep 29, 2022

rhshadrach commented Oct 23, 2022

Dr-Irv commented Oct 28, 2022

rhshadrach commented Nov 9, 2022 • edited Loading

mroeschke commented Nov 9, 2022

jorisvandenbossche commented Nov 9, 2022

rhshadrach commented Nov 9, 2022

rhshadrach commented Dec 15, 2022 • edited Loading

Dr-Irv commented Dec 15, 2022

bashtage commented Dec 15, 2022

topper-123 commented Dec 15, 2022

rhshadrach commented Dec 16, 2022

Dr-Irv commented Dec 16, 2022

simonjayhawkins commented Mar 8, 2023

rhshadrach commented Sep 16, 2022 •

edited

Loading

rhshadrach commented Sep 17, 2022 •

edited

Loading

rhshadrach commented Sep 24, 2022 •

edited

Loading

rhshadrach commented Sep 29, 2022 •

edited

Loading

rhshadrach commented Nov 9, 2022 •

edited

Loading

rhshadrach commented Dec 15, 2022 •

edited

Loading