BUG: Inconsistent MultiIndex level dtypes when merging non-monotonic index with category dtype #45317
Closed
2 of 3 tasks
Labels
Bug
Categorical
Categorical Data Type
Duplicate Report
Duplicate issue or pull request
MultiIndex
Reshaping
Concat, Merge/Join, Stack/Unstack, Explode
setops
union, intersection, difference, symmetric_difference
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
Issue Description
When merging DataFrames which have MultiIndexes which contain categorical levels, the resultant DataFrame will sometimes have categorical levels and sometimes not. Specifically if the input DataFrames are monotonic categories appear in the output, otherwise they don't.
In the provided example we merge 2 data frames, once where the indexes are monotonic and once when they are not. When they are monotonic the output gives "category" for the dtype at level 1. However, if the they are not monotonic, the output gives "int64".
I'm not sure which of these is correct, but it is very surprising that it depends on the order of the input dataframes' rows.
The problem I think is similar but not the same as #38502 - note that I've tested in 1.3.5 which already has the fix for that bug.
Expected Behavior
The datatypes in the result should not depend on the order of the input rows. Either the categories should always be preserved or they should never, probably.
Installed Versions
INSTALLED VERSIONS
commit : 66e3805
python : 3.10.1.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19042
machine : AMD64
processor : AMD64 Family 25 Model 80 Stepping 0, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United Kingdom.1252
pandas : 1.3.5
numpy : 1.21.5
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.1.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.5.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: