[SPARK-39807][PYTHON][PS] Respect Series.concat sort parameter to follow 1.4.3 behavior

Yikun · HyukjinKwon · commit dcccbf4f9ddd · 2022-07-19T09:34:32.000+09:00
### What changes were proposed in this pull request? Respect Series.concat sort parameter when `num_series == 1` to follow 1.4.3 behavior. ### Why are the changes needed? In #36711, we follow the pandas 1.4.2 behaviors to respect Series.concat sort parameter except `num_series == 1` case. In [pandas 1.4.3](https://github.com/pandas-dev/pandas/releases/tag/v1.4.3), fix the issue pandas-dev/pandas#47127. The bug of `num_series == 1` is also fixed, so we add this PR to follow panda 1.4.3 behavior. ### Does this PR introduce _any_ user-facing change? Yes, we already cover this case in: https://github.com/apache/spark/blob/master/python/docs/source/migration_guide/pyspark_3.3_to_3.4.rst ``` In Spark 3.4, the Series.concat sort parameter will be respected to follow pandas 1.4 behaviors. ``` ### How was this patch tested? - CI passed - test_concat_index_axis passed with panda 1.3.5, 1.4.2, 1.4.3. Closes #37217 from Yikun/SPARK-39807. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
diff --git a/python/pyspark/pandas/namespace.py b/python/pyspark/pandas/namespace.py
@@ -2621,9 +2621,8 @@ def resolve_func(psdf, this_column_labels, that_column_labels):
 
             assert len(merged_columns) > 0
 
-            # If sort is True, always sort when there are more than two Series,
-            # and if there is only one Series, never sort to follow pandas 1.4+ behavior.
-            if sort and num_series != 1:
+            # If sort is True, always sort
+            if sort:
                 # FIXME: better ordering
                 merged_columns = sorted(merged_columns, key=name_like_string)
 
diff --git a/python/pyspark/pandas/tests/test_namespace.py b/python/pyspark/pandas/tests/test_namespace.py
@@ -334,19 +334,21 @@ def test_concat_index_axis(self):
             ([psdf.reset_index(), psdf], [pdf.reset_index(), pdf]),
             ([psdf, psdf[["C", "A"]]], [pdf, pdf[["C", "A"]]]),
             ([psdf[["C", "A"]], psdf], [pdf[["C", "A"]], pdf]),
-            # only one Series
-            ([psdf, psdf["C"]], [pdf, pdf["C"]]),
-            ([psdf["C"], psdf], [pdf["C"], pdf]),
             # more than two Series
             ([psdf["C"], psdf, psdf["A"]], [pdf["C"], pdf, pdf["A"]]),
         ]
 
-        if LooseVersion(pd.__version__) >= LooseVersion("1.4"):
-            # more than two Series
-            psdfs, pdfs = ([psdf, psdf["C"], psdf["A"]], [pdf, pdf["C"], pdf["A"]])
-            for ignore_index, join, sort in itertools.product(ignore_indexes, joins, sorts):
-                # See also https://github.com/pandas-dev/pandas/issues/47127
-                if (join, sort) != ("outer", True):
+        # See also https://github.com/pandas-dev/pandas/issues/47127
+        if LooseVersion(pd.__version__) >= LooseVersion("1.4.3"):
+            series_objs = [
+                # more than two Series
+                ([psdf, psdf["C"], psdf["A"]], [pdf, pdf["C"], pdf["A"]]),
+                # only one Series
+                ([psdf, psdf["C"]], [pdf, pdf["C"]]),
+                ([psdf["C"], psdf], [pdf["C"], pdf]),
+            ]
+            for psdfs, pdfs in series_objs:
+                for ignore_index, join, sort in itertools.product(ignore_indexes, joins, sorts):
                     self.assert_eq(
                         ps.concat(psdfs, ignore_index=ignore_index, join=join, sort=sort),
                         pd.concat(pdfs, ignore_index=ignore_index, join=join, sort=sort),