BUG: Incomplete join with categorical MultiIndex #38502
Labels
Bug
Categorical
Categorical Data Type
Regression
Functionality that used to work in a prior pandas version
Reshaping
Concat, Merge/Join, Stack/Unstack, Explode
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
When we join two dataframes with similar MultiIndexes, where one level is a
CategoricalDtype
, some of the rows are not properly joined:Code Sample
Problem description
The code sample above gives us
df1
:and
df2
:If we left-join the two dataframes index-on-index, we expect the first two rows of
df2
to be matched against the two rows ofdf1
(seeExpected Output
below). Instead, we find that the second row ofdf2
is ignored:Clues
I am way out of my depth here, but these are some things I noticed while trying to find a minimal example:
minor
index level in alphabetical order (i.e.dtype=pd.CategoricalDtype(['X', 'Y'])
instead of['Y', 'X']
).ordered
parameter ofCategoricalDtype
has no influence.df2
, although dropped during the left-join, is crucial. Joiningdf1
anddf2.iloc[:2]
yields the expected output.outer
joinExpected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : b5958ee
python : 3.9.0.final.0
python-bits : 64
OS : Linux
OS-release : 5.9.13-arch1-1
Version : #1 SMP PREEMPT Tue, 08 Dec 2020 12:09:55 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.utf8
LOCALE : en_US.UTF-8
pandas : 1.1.5
numpy : 1.19.4
pytz : 2019.3
dateutil : 2.8.1
pip : 20.3.1
setuptools : 51.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : None
IPython : 7.19.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: