Skip to content

ENH: Specify how pandas infers dtype on objects #41848

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mocquin opened this issue Jun 7, 2021 · 3 comments
Open

ENH: Specify how pandas infers dtype on objects #41848

mocquin opened this issue Jun 7, 2021 · 3 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.

Comments

@mocquin
Copy link

mocquin commented Jun 7, 2021

Hello there

Is your feature request related to a problem?

[this should provide a description of what the problem is, e.g. "I wish I could use pandas to do [...]"]
Context : I am creating a package to handle physical units (yes, another one), and I started working on the pandas interface implementation. I looked into pandas extension page, as well as what pint did with pint-pandas. I am pretty satisfied with the result, except for one thing : When creating pandas objects (Series of DataFrame), I have to explicitly specify what dtype (using my DtypeExtension for my "Quantity" class) pandas should use to cast my Quantity object to the correspond QuantityArrayExtension. Categorical objects kinda exhibit the same problem :

# create indeed a Categorical dtype
s = pd.Series(["a", "b", "c", "a"], dtype="category")
# use "object" as dtype
pd.Series(["a", "b", "c", "a"])

from physipy import m # import the "meter" object
from physipy import QuantityDtype # import the DtypeExtension for Quantity object
# create indeed a QuantityDtype serie
s = pd.Series([1, 2, 3]*m, dtype=DtypeExtension)

# casts into integers, dropping the "unit" (because bypasses my object by accessing its "array" value directly
pd.Series([1, 2, 3]*m)

Now, I understand that for the Categorical example, it is not obvious what kind of dtype pandas should use, but for my custom class, I would like to be able to tell pandas how to behave.

Describe the solution you'd like

I would expect some interface like this :

import pandas as pd
from physipy import Quantity, QuantityDtype

# tell pandas to use QuantityDtype when a Quantity object is passed
pd.dtype_lut[Quantity] = QuantityDtype

# then a series can be created directly 
my_quantity_object = [1, 2, 3]*m # this is a Quantity object
s = pd.Series(my_quantity_object)) # note the absence of dtype specification

Here, pandas admits it doesn't know the passed object's type, and so check in its dtype_lut if a corresponding dtype is set.

Another interface would be to add a method, pandas-specifically named, to Quantity that does this look-up table :

# into my Quantity object
class Quantity:
    ....

    def pd_dtype(self):
        return QuantityDtype

so that when pandas encounters an unknown object type, it first tries to get its Dtype using "obj.pd_type()"

Cheers

@mocquin mocquin added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 7, 2021
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 7, 2021

The supported way to do this is to place your objects in your ExtensionArray class and pass that to Series. So if you create an ExtensionArray subclass called QuantityArray, then you would do:

from physipy import m
s = pd.Series(QuantityArray([1, 2, 3]*m))

pandas will then set the dtype of your series to QuantityDtype, since that is the dtype of QuantityArray.

Will this work for you?

@Dr-Irv Dr-Irv added Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 7, 2021
@mocquin
Copy link
Author

mocquin commented Jun 7, 2021

This does work, but I find it heavier than what I hoped for : semanticaly, my quantity object [1, 2, 3]*m is basically a 1D array of values with metadata, so the need for QuantityArray feels questionnable for the user point of view. I understand the purpose of the QuantityArray class is to wrap an object to exibit proper behavior with pandas interface. But I was hoping for a ligther way to create the serie : with one setup somewhere that maps the 3 objects Quantity <-> QuantityDtype <-> QuantityArray.

For reference, matplotib's unit interface does something like this. Basically, you define the conversion interface (kinda privately, developper-side), then register the class with its interface :

# Finally we register our object type with the Matplotlib units registry.
units.registry[datetime.date] = DateConverter()

Then any user can use the plotting interface for a datetime.date, without worrying about the interface. To be even more explicit about what I mean, you can do :

plt.plot(datetime.date.today())

as opposed to a heavier :

plt.plot(ConversionInterfaceDatetime(datetime.date.today()))

For example, would it be possible to extend infer_dtype, to create the proper mapping between a Quantity object, and the corresponding QuantityArray/QuantityDtype ?

Maybe my problem is that I don't see (yet, probably) the added value of the "wrapper" ExtensionArray (it feels like my base object plus the ExenstionType would suffice), but I definitely don't have a broad view over the subject. I should say that my base object, like [1, 2, 3]*m, already behaves like a numpy array/series on several aspects, so the additionnal explicit Array wrapper feels heavy. See this notebook.

@jbrockmendel
Copy link
Member

xref #27462 for the analogous issue for isna.

The solution here is going to involve having a check in infer_dtype along the lines of

for dtype in pd.core.dtypes.base._registry:
    if dtype.is_unambiguous_scalar(obj):
        return str(dtype)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants