Skip to content

[SPARK-39807][PYTHON][PS] Respect Series.concat sort parameter to follow 1.4.3 behavior #37217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions python/pyspark/pandas/namespace.py
Original file line number Diff line number Diff line change
Expand Up @@ -2621,9 +2621,8 @@ def resolve_func(psdf, this_column_labels, that_column_labels):

assert len(merged_columns) > 0

# If sort is True, always sort when there are more than two Series,
# and if there is only one Series, never sort to follow pandas 1.4+ behavior.
if sort and num_series != 1:
# If sort is True, always sort
if sort:
# FIXME: better ordering
merged_columns = sorted(merged_columns, key=name_like_string)

Expand Down
20 changes: 11 additions & 9 deletions python/pyspark/pandas/tests/test_namespace.py
Original file line number Diff line number Diff line change
Expand Up @@ -334,19 +334,21 @@ def test_concat_index_axis(self):
([psdf.reset_index(), psdf], [pdf.reset_index(), pdf]),
([psdf, psdf[["C", "A"]]], [pdf, pdf[["C", "A"]]]),
([psdf[["C", "A"]], psdf], [pdf[["C", "A"]], pdf]),
# only one Series
([psdf, psdf["C"]], [pdf, pdf["C"]]),
([psdf["C"], psdf], [pdf["C"], pdf]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. I believe we can keep the test coverage to prevent a future regression at Series.concat.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reivew.

Yes, actually I also moved this to L347-L348, that means we will always check all case with latest pandas, to avoid regression. I will also bump infra pandas version to 1.4.3 after all fixes complete.

For pandas<1.4.3, these two cases will failed because pandas on Spark only follow the latest pandas behaviors, so I just skip them.

If you have any other concern, feel free to comments. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, fiar enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect. Thanks.

# more than two Series
([psdf["C"], psdf, psdf["A"]], [pdf["C"], pdf, pdf["A"]]),
]

if LooseVersion(pd.__version__) >= LooseVersion("1.4"):
# more than two Series
psdfs, pdfs = ([psdf, psdf["C"], psdf["A"]], [pdf, pdf["C"], pdf["A"]])
for ignore_index, join, sort in itertools.product(ignore_indexes, joins, sorts):
# See also https://github.com/pandas-dev/pandas/issues/47127
if (join, sort) != ("outer", True):
# See also https://github.com/pandas-dev/pandas/issues/47127
if LooseVersion(pd.__version__) >= LooseVersion("1.4.3"):
series_objs = [
# more than two Series
([psdf, psdf["C"], psdf["A"]], [pdf, pdf["C"], pdf["A"]]),
# only one Series
([psdf, psdf["C"]], [pdf, pdf["C"]]),
([psdf["C"], psdf], [pdf["C"], pdf]),
]
for psdfs, pdfs in series_objs:
for ignore_index, join, sort in itertools.product(ignore_indexes, joins, sorts):
self.assert_eq(
ps.concat(psdfs, ignore_index=ignore_index, join=join, sort=sort),
pd.concat(pdfs, ignore_index=ignore_index, join=join, sort=sort),
Expand Down