Skip to content

Commit dcccbf4

Browse files
YikunHyukjinKwon
authored andcommitted
[SPARK-39807][PYTHON][PS] Respect Series.concat sort parameter to follow 1.4.3 behavior
### What changes were proposed in this pull request? Respect Series.concat sort parameter when `num_series == 1` to follow 1.4.3 behavior. ### Why are the changes needed? In #36711, we follow the pandas 1.4.2 behaviors to respect Series.concat sort parameter except `num_series == 1` case. In [pandas 1.4.3](https://github.com/pandas-dev/pandas/releases/tag/v1.4.3), fix the issue pandas-dev/pandas#47127. The bug of `num_series == 1` is also fixed, so we add this PR to follow panda 1.4.3 behavior. ### Does this PR introduce _any_ user-facing change? Yes, we already cover this case in: https://github.com/apache/spark/blob/master/python/docs/source/migration_guide/pyspark_3.3_to_3.4.rst ``` In Spark 3.4, the Series.concat sort parameter will be respected to follow pandas 1.4 behaviors. ``` ### How was this patch tested? - CI passed - test_concat_index_axis passed with panda 1.3.5, 1.4.2, 1.4.3. Closes #37217 from Yikun/SPARK-39807. Authored-by: Yikun Jiang <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
1 parent 88380b6 commit dcccbf4

File tree

2 files changed

+13
-12
lines changed

2 files changed

+13
-12
lines changed

python/pyspark/pandas/namespace.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2621,9 +2621,8 @@ def resolve_func(psdf, this_column_labels, that_column_labels):
26212621

26222622
assert len(merged_columns) > 0
26232623

2624-
# If sort is True, always sort when there are more than two Series,
2625-
# and if there is only one Series, never sort to follow pandas 1.4+ behavior.
2626-
if sort and num_series != 1:
2624+
# If sort is True, always sort
2625+
if sort:
26272626
# FIXME: better ordering
26282627
merged_columns = sorted(merged_columns, key=name_like_string)
26292628

python/pyspark/pandas/tests/test_namespace.py

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -334,19 +334,21 @@ def test_concat_index_axis(self):
334334
([psdf.reset_index(), psdf], [pdf.reset_index(), pdf]),
335335
([psdf, psdf[["C", "A"]]], [pdf, pdf[["C", "A"]]]),
336336
([psdf[["C", "A"]], psdf], [pdf[["C", "A"]], pdf]),
337-
# only one Series
338-
([psdf, psdf["C"]], [pdf, pdf["C"]]),
339-
([psdf["C"], psdf], [pdf["C"], pdf]),
340337
# more than two Series
341338
([psdf["C"], psdf, psdf["A"]], [pdf["C"], pdf, pdf["A"]]),
342339
]
343340

344-
if LooseVersion(pd.__version__) >= LooseVersion("1.4"):
345-
# more than two Series
346-
psdfs, pdfs = ([psdf, psdf["C"], psdf["A"]], [pdf, pdf["C"], pdf["A"]])
347-
for ignore_index, join, sort in itertools.product(ignore_indexes, joins, sorts):
348-
# See also https://github.com/pandas-dev/pandas/issues/47127
349-
if (join, sort) != ("outer", True):
341+
# See also https://github.com/pandas-dev/pandas/issues/47127
342+
if LooseVersion(pd.__version__) >= LooseVersion("1.4.3"):
343+
series_objs = [
344+
# more than two Series
345+
([psdf, psdf["C"], psdf["A"]], [pdf, pdf["C"], pdf["A"]]),
346+
# only one Series
347+
([psdf, psdf["C"]], [pdf, pdf["C"]]),
348+
([psdf["C"], psdf], [pdf["C"], pdf]),
349+
]
350+
for psdfs, pdfs in series_objs:
351+
for ignore_index, join, sort in itertools.product(ignore_indexes, joins, sorts):
350352
self.assert_eq(
351353
ps.concat(psdfs, ignore_index=ignore_index, join=join, sort=sort),
352354
pd.concat(pdfs, ignore_index=ignore_index, join=join, sort=sort),

0 commit comments

Comments
 (0)