[SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns #28743

moskvax · 2020-06-06T15:02:41Z

What changes were proposed in this pull request?

Use pa.infer_type over pa.Schema.from_pandas to infer Arrow types for conversion, as it handles pandas extension types and can ignore pd.NA values,
Check for the implementation of __arrow_array__ in series' backing arrays, and if present, use it to convert pandas DataFrame columns to Arrow arrays during serialisation.

Why are the changes needed?

These changes allow usage of pandas DataFrames which contain ExtensionDtype columns that are backed by arrays that implement __arrow_array__. DataFrames containing such columns will be returned when specifying an ExtensionDtype-extending pandas type in the dtype parameter when constructed, and can also be created via calling convert_dtypes on an existing DataFrame.

Does this PR introduce any user-facing change?

Yes. Users will be able to convert a wider variety of pandas DataFrames into Spark DataFrames using any currently released pyarrow version > 0.15.1. Prior to this fix, neither the Arrow conversion path nor the fallback path would work with these DataFrames.

How was this patch tested?

Tests were added to cover the cases of converting from pandas DataFrames with IntegerArray and StringArray backed columns. A typo was also fixed in a recently added test.

…array__ columns

maropu · 2020-06-07T00:12:46Z

ok to test

maropu · 2020-06-07T00:12:55Z

cc: @HyukjinKwon @viirya

SparkQA · 2020-06-07T00:30:34Z

Test build #123593 has finished for PR 28743 at commit 04a15f6.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-06-07T00:54:31Z

Thanks for your work, @moskvax! The failures looks valid, so could you fix them first?

* Use infer_type over Schema.from_pandas for arrow type inference, as it can better handle extension types and pd.NA values * Call __arrow_array__ directly if it is present to exit create_array early in _create_batch * Add pandas version checks where required for tests * Add tests covering pd.NA and BooleanDtype conversion

SparkQA · 2020-06-08T16:09:52Z

Test build #123640 has finished for PR 28743 at commit e60e2d4.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-08T16:27:53Z

Test build #123641 has finished for PR 28743 at commit 4476771.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-08T17:18:04Z

Test build #123642 has finished for PR 28743 at commit 406347d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

moskvax · 2020-06-10T03:12:00Z

@HyukjinKwon @viirya Please review when you've got a moment. Thank you.

HyukjinKwon · 2020-06-10T03:46:05Z

python/pyspark/sql/pandas/conversion.py

-            for name, field in zip(schema, arrow_schema):
-                struct.add(name, from_arrow_type(field.type), nullable=field.nullable)
+            for name, t in zip(schema, inferred_types):
+                struct.add(name, from_arrow_type(t), nullable=True)


Why don't we follow nullability anymore?

infer_type only returns a type, not a field, which would supposedly have nullability information. But it appears that in the implementation of Schema.from_pandas (link), inferring nullability was not actually done and the default nullable=True would always be returned. So this change is just following the existing behaviour of Schema.from_pandas.

Let's add a comment here to explain it?

Sounds good, will update with a comment.

Alternatively, any(s.isna()) could be checked if we wanted to actively infer nullability here. This would change existing behavior as well as being inconsistent with the non-Arrow path, though, which similarly defaults to inferred types being nullable:

spark/python/pyspark/sql/types.py

Line 1069 in 43063e2

fields = [StructField(k, _infer_type(v), True) for k, v in items]

HyukjinKwon · 2020-06-10T03:46:16Z

cc @BryanCutler FYI

python/pyspark/sql/tests/test_arrow.py

@HyukjinKwon

thanks @HyukjinKwon

moskvax · 2020-06-10T05:44:03Z

python/pyspark/sql/pandas/serializers.py

-            elif type(s.dtype) == pd.CategoricalDtype:
+            elif is_categorical_dtype(s.dtype):


By the way, this change was made as CategoricalDtype is only imported into the root pandas namespace after pandas 0.24.0, which was causing AttributeError when testing with earlier versions.

SparkQA · 2020-06-10T05:44:47Z

Test build #123725 has finished for PR 28743 at commit 403f579.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-06-10T06:43:04Z

python/pyspark/sql/pandas/conversion.py

@@ -394,10 +394,11 @@ def _create_from_pandas_with_arrow(self, pdf, schema, timezone):

        # Create the Spark schema from list of names passed in with Arrow types
        if isinstance(schema, (list, tuple)):
-            arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
+            inferred_types = [pa.infer_type(s, mask=s.isna(), from_pandas=True)


So without this change, pa.Schema.from_pandas cannot handle pandas extension types and pd.NA values?

pyarrow < 0.17.0 cannot handle either (ARROW-8159). pyarrow 0.17.x works as long as the columns that contain pd.NA values are not object-dtyped, which is the case by default as of pandas 1.0.4 (cf pandas-dev/pandas#32931). pa.infer_type can take a mask and thus avoids trying to infer the type of pd.NA values, which is what causes pa.Schema.from_pandas to fail here.

pa.Schema.from_pandas returns different types from pa.infer_type in two cases:

Categorical arrays

pa.Schema.from_pandas returns a DictionaryType

pa.infer_type returns the value_type of the DictionaryType, which is what is already used to determine the Spark type of the resulting column

__arrow_array__-implementing arrays which return a specialised Arrow type (IntervalArray, PeriodArray)

pa.Schema.from_pandas returns the type of the array returned by __arrow_array__

pa.infer_type does not check for __arrow_array__ and thus fails with these arrays, however these types cannot currently be converted to Spark types anyway

Neither of these cases cause regressions, which is why I propose replacing pa.Schema.from_pandas with pa.infer_type here.

For the second case above, so pa.Schema.from_pandas returns correct types in the case? When pa.infer_type infers the specified array types, will it just throw error or return a wrong array type?

pa.Schema.from_pandas will return a type that is a subclass of pa.ExtensionType. From that instance, there is a storage_type that is defined, which could then be checked as a Spark supported type. Assuming the Pandas extension array implemented __arrow_array__, which is recommended, see https://arrow.apache.org/docs/python/extending_types.html#controlling-conversion-to-pyarrow-array-with-the-arrow-array-protocol.

For the second case above, so pa.Schema.from_pandas returns correct types in the case? When pa.infer_type infers the specified array types, will it just throw error or return a wrong array type?

pa.infer_type will throw an error for these arrays.

viirya · 2020-06-10T06:50:46Z

python/pyspark/sql/pandas/serializers.py

+                mask = s.isnull()
+                # pass _ndarray_values to avoid potential failed type checks from pandas array types


Is there any test case for this?

This is a workaround for IntegerArray in pre-1.0.0 pandas, which did not yet implement __arrow_array__, so pyarrow expects it to be a NumPy array:

>>> import pandas as pd >>> import pyarrow as pa >>> print(pd.__version__, pa.__version__) 0.25.0 0.17.1 >>> s = pd.Series(range(3), dtype=pd.Int64Dtype()) >>> pa.Array.from_pandas(s) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "pyarrow/array.pxi", line 805, in pyarrow.lib.Array.from_pandas File "pyarrow/array.pxi", line 265, in pyarrow.lib.array File "pyarrow/types.pxi", line 76, in pyarrow.lib._datatype_to_pep3118 File "pyarrow/array.pxi", line 64, in pyarrow.lib._ndarray_to_type File "pyarrow/error.pxi", line 108, in pyarrow.lib.check_status pyarrow.lib.ArrowTypeError: Did not pass numpy.dtype object >>> pa.Array.from_pandas(s, type=pa.int64()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "pyarrow/array.pxi", line 805, in pyarrow.lib.Array.from_pandas File "pyarrow/array.pxi", line 265, in pyarrow.lib.array File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Input object was not a NumPy array >>> pa.Array.from_pandas(s._ndarray_values, type=pa.int64()) <pyarrow.lib.Int64Array object at 0x7fb88007a980> [ 0, 1, 2 ] >>>

I'll update the comment to mention this.

@viirya

thanks @viirya

SparkQA · 2020-06-10T12:56:49Z

Test build #123762 has finished for PR 28743 at commit 07d7f2a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

Thanks @moskvax , adding support for extension types would be great! I'm not sure using pa.infer_type is the way to go though, I think it's better to handle these cases explicitly by getting the pa.ExtensionType from pa.Schema.from_pandas and then extracting the storage_type from there. Would that be possible?

BryanCutler · 2020-06-10T23:45:12Z

python/pyspark/sql/pandas/conversion.py

@@ -394,10 +394,11 @@ def _create_from_pandas_with_arrow(self, pdf, schema, timezone):

        # Create the Spark schema from list of names passed in with Arrow types
        if isinstance(schema, (list, tuple)):
-            arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
+            inferred_types = [pa.infer_type(s, mask=s.isna(), from_pandas=True)


pa.Schema.from_pandas will return a type that is a subclass of pa.ExtensionType. From that instance, there is a storage_type that is defined, which could then be checked as a Spark supported type. Assuming the Pandas extension array implemented __arrow_array__, which is recommended, see https://arrow.apache.org/docs/python/extending_types.html#controlling-conversion-to-pyarrow-array-with-the-arrow-array-protocol.

@BryanCutler

thanks @BryanCutler

# Conflicts: # python/pyspark/sql/pandas/serializers.py

moskvax · 2020-06-11T14:37:33Z

Thanks @moskvax , adding support for extension types would be great! I'm not sure using pa.infer_type is the way to go though, I think it's better to handle these cases explicitly by getting the pa.ExtensionType from pa.Schema.from_pandas and then extracting the storage_type from there. Would that be possible?

The goal of this PR was to allow conversion for __arrow_array__-implementing arrays of ExtensionDtype values where the underlying type can be directly converted to primitive Arrow and Spark types, so I wasn't focusing on this case at first, but I've looked into it today following the approach you described.

The storage_type of the pa.ExtensionType of PeriodArray is int64, which can be converted to a Spark column using the PeriodArray's _ndarray_values. However, without the PeriodDtype.freq, the period information cannot be reconstructed and the result in Spark is an arbitrary-looking sequence of integers:

>>> periods = pd.period_range('2020-01-01', freq='M', periods=6)
>>> pdf = pd.DataFrame({'A': pd.Series(periods)})
>>> pdf
         A
0  2020-01
1  2020-02
2  2020-03
3  2020-04
4  2020-05
5  2020-06
>>> pdf.dtypes
A    period[M]
dtype: object
>>> df = spark.createDataFrame(pdf)
>>> df.show()
+---+
|  A|
+---+
|600|
|601|
|602|
|603|
|604|
|605|
+---+

>>> df.schema
StructType(List(StructField(A,LongType,true)))

IntervalArray has an Arrow extension type with a storage_type of StructType(struct<left: timestamp[ns], right: timestamp[ns]>), which could be converted to a Spark StructType column if StructType conversion were supported by the Arrow conversion path, however the closed information would still be missing using this schema.

So, in the cases where it is possible to convert using the storage_type, I think there should be a warning that the results may be unexpected as any type metadata that may be required to meaningfully interpret the type values is being discarded. Additionally, the round-trip back to pandas won't be possible for these types.

As for pa.Schema.from_pandas, it's most useful over pa.infer_type for the purposes of Spark conversion when the array it is processing implements __arrow_array__ and thus can immediately and unambiguously return its own Arrow type. I've updated the PR to firstly try using __arrow_array__ to determine a type, then falling back on pa.infer_type. What do you think of this approach?

SparkQA · 2020-06-11T15:19:23Z

Test build #123852 has finished for PR 28743 at commit 01fb6a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait TimestampFormatterHelper extends TimeZoneAwareExpression
case class ProcessingTimeTrigger(intervalMs: Long) extends Trigger
case class ContinuousTrigger(intervalMs: Long) extends Trigger

AmplabJenkins · 2020-07-20T14:18:44Z

Can one of the admins verify this patch?

github-actions · 2020-10-29T00:59:09Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Pverheijen · 2021-06-15T12:17:02Z

Can this be pulled?

careyhay · 2021-10-27T08:31:54Z

Any way this can be revived and pulled?!

howardcornwell · 2021-10-27T15:22:24Z

Started hitting this issue today. Can this be reviewed and pulled?

probot-autolabeler bot added PYTHON SQL labels Jun 6, 2020

[SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_…

04a15f6

…array__ columns

moskvax force-pushed the SPARK-31920 branch from bff4209 to 04a15f6 Compare June 6, 2020 15:14

moskvax changed the title ~~[SPARK-31920] Fix pandas conversion using Arrow with __arrow_array__ columns~~ [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns Jun 6, 2020

moskvax changed the title ~~[SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns~~ [SPARK-31920][PYTHON][WIP] Fix pandas conversion using Arrow with __arrow_array__ columns Jun 8, 2020

moskvax marked this pull request as draft June 8, 2020 13:21

moskvax marked this pull request as ready for review June 8, 2020 16:03

moskvax changed the title ~~[SPARK-31920][PYTHON][WIP] Fix pandas conversion using Arrow with __arrow_array__ columns~~ [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns Jun 8, 2020

remove blank line

4476771

always define pandas_version to satsify pypy

406347d

HyukjinKwon reviewed Jun 10, 2020

View reviewed changes

python/pyspark/sql/tests/test_arrow.py Outdated Show resolved Hide resolved

Update how pandas_version is defined for pypy

403f579

thanks @HyukjinKwon

moskvax commented Jun 10, 2020

View reviewed changes

viirya reviewed Jun 10, 2020

View reviewed changes

Add comments

07d7f2a

thanks @viirya

BryanCutler reviewed Jun 10, 2020

View reviewed changes

moskvax added 2 commits June 12, 2020 00:25

Try __arrow_array__ result before pa.infer_schema

8c09766

thanks @BryanCutler

Merge remote-tracking branch 'origin/master' into SPARK-31920

01fb6a4

# Conflicts: # python/pyspark/sql/pandas/serializers.py

github-actions bot added the Stale label Oct 29, 2020

github-actions bot closed this Oct 30, 2020

HyukjinKwon mentioned this pull request Dec 7, 2021

[SPARK-34521][PYTHON][SQL] Fix spark.createDataFrame when using pandas with StringDtype #34509

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns #28743

[SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns #28743

moskvax commented Jun 6, 2020 •

edited

Loading

maropu commented Jun 7, 2020

maropu commented Jun 7, 2020

SparkQA commented Jun 7, 2020

maropu commented Jun 7, 2020

SparkQA commented Jun 8, 2020

SparkQA commented Jun 8, 2020

SparkQA commented Jun 8, 2020

moskvax commented Jun 10, 2020

HyukjinKwon Jun 10, 2020

moskvax Jun 10, 2020

viirya Jun 10, 2020

moskvax Jun 10, 2020

HyukjinKwon commented Jun 10, 2020

moskvax Jun 10, 2020

SparkQA commented Jun 10, 2020

viirya Jun 10, 2020

moskvax Jun 10, 2020

viirya Jun 10, 2020

BryanCutler Jun 10, 2020

moskvax Jun 11, 2020

viirya Jun 10, 2020

moskvax Jun 10, 2020 •

edited

Loading

SparkQA commented Jun 10, 2020

BryanCutler left a comment

BryanCutler Jun 10, 2020

moskvax commented Jun 11, 2020

SparkQA commented Jun 11, 2020

AmplabJenkins commented Jul 20, 2020

github-actions bot commented Oct 29, 2020

Pverheijen commented Jun 15, 2021

careyhay commented Oct 27, 2021

howardcornwell commented Oct 27, 2021

		elif type(s.dtype) == pd.CategoricalDtype:
		elif is_categorical_dtype(s.dtype):

		mask = s.isnull()
		# pass _ndarray_values to avoid potential failed type checks from pandas array types

[SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns #28743

[SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns #28743

Conversation

moskvax commented Jun 6, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

maropu commented Jun 7, 2020

maropu commented Jun 7, 2020

SparkQA commented Jun 7, 2020

maropu commented Jun 7, 2020

SparkQA commented Jun 8, 2020

SparkQA commented Jun 8, 2020

SparkQA commented Jun 8, 2020

moskvax commented Jun 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jun 10, 2020

Choose a reason for hiding this comment

SparkQA commented Jun 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

moskvax Jun 10, 2020 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Jun 10, 2020

BryanCutler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

moskvax commented Jun 11, 2020

SparkQA commented Jun 11, 2020

AmplabJenkins commented Jul 20, 2020

github-actions bot commented Oct 29, 2020

Pverheijen commented Jun 15, 2021

careyhay commented Oct 27, 2021

howardcornwell commented Oct 27, 2021

moskvax commented Jun 6, 2020 •

edited

Loading

moskvax Jun 10, 2020 •

edited

Loading