-
Notifications
You must be signed in to change notification settings - Fork 28.5k
[SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns #28743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ok to test |
cc: @HyukjinKwon @viirya |
Test build #123593 has finished for PR 28743 at commit
|
Thanks for your work, @moskvax! The failures looks valid, so could you fix them first? |
* Use infer_type over Schema.from_pandas for arrow type inference, as it can better handle extension types and pd.NA values * Call __arrow_array__ directly if it is present to exit create_array early in _create_batch * Add pandas version checks where required for tests * Add tests covering pd.NA and BooleanDtype conversion
Test build #123640 has finished for PR 28743 at commit
|
Test build #123641 has finished for PR 28743 at commit
|
Test build #123642 has finished for PR 28743 at commit
|
@HyukjinKwon @viirya Please review when you've got a moment. Thank you. |
for name, field in zip(schema, arrow_schema): | ||
struct.add(name, from_arrow_type(field.type), nullable=field.nullable) | ||
for name, t in zip(schema, inferred_types): | ||
struct.add(name, from_arrow_type(t), nullable=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we follow nullability anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
infer_type
only returns a type, not a field
, which would supposedly have nullability information. But it appears that in the implementation of Schema.from_pandas
(link), inferring nullability was not actually done and the default nullable=True
would always be returned. So this change is just following the existing behaviour of Schema.from_pandas
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a comment here to explain it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, will update with a comment.
Alternatively, any(s.isna())
could be checked if we wanted to actively infer nullability here. This would change existing behavior as well as being inconsistent with the non-Arrow path, though, which similarly defaults to inferred types being nullable:
spark/python/pyspark/sql/types.py
Line 1069 in 43063e2
fields = [StructField(k, _infer_type(v), True) for k, v in items] |
cc @BryanCutler FYI |
elif type(s.dtype) == pd.CategoricalDtype: | ||
elif is_categorical_dtype(s.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, this change was made as CategoricalDtype
is only imported into the root pandas namespace after pandas 0.24.0, which was causing AttributeError
when testing with earlier versions.
Test build #123725 has finished for PR 28743 at commit
|
@@ -394,10 +394,11 @@ def _create_from_pandas_with_arrow(self, pdf, schema, timezone): | |||
|
|||
# Create the Spark schema from list of names passed in with Arrow types | |||
if isinstance(schema, (list, tuple)): | |||
arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False) | |||
inferred_types = [pa.infer_type(s, mask=s.isna(), from_pandas=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So without this change, pa.Schema.from_pandas
cannot handle pandas extension types and pd.NA values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyarrow < 0.17.0 cannot handle either (ARROW-8159). pyarrow 0.17.x works as long as the columns that contain pd.NA
values are not object
-dtyped, which is the case by default as of pandas 1.0.4 (cf pandas-dev/pandas#32931). pa.infer_type
can take a mask and thus avoids trying to infer the type of pd.NA
values, which is what causes pa.Schema.from_pandas
to fail here.
pa.Schema.from_pandas
returns different types from pa.infer_type
in two cases:
Categorical
arrayspa.Schema.from_pandas
returns aDictionaryType
pa.infer_type
returns thevalue_type
of theDictionaryType
, which is what is already used to determine the Spark type of the resulting column
__arrow_array__
-implementing arrays which return a specialised Arrow type (IntervalArray
,PeriodArray
)pa.Schema.from_pandas
returns the type of the array returned by__arrow_array__
pa.infer_type
does not check for__arrow_array__
and thus fails with these arrays, however these types cannot currently be converted to Spark types anyway
Neither of these cases cause regressions, which is why I propose replacing pa.Schema.from_pandas
with pa.infer_type
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the second case above, so pa.Schema.from_pandas
returns correct types in the case? When pa.infer_type
infers the specified array types, will it just throw error or return a wrong array type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pa.Schema.from_pandas
will return a type that is a subclass of pa.ExtensionType
. From that instance, there is a storage_type
that is defined, which could then be checked as a Spark supported type. Assuming the Pandas extension array implemented __arrow_array__
, which is recommended, see https://arrow.apache.org/docs/python/extending_types.html#controlling-conversion-to-pyarrow-array-with-the-arrow-array-protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the second case above, so
pa.Schema.from_pandas
returns correct types in the case? Whenpa.infer_type
infers the specified array types, will it just throw error or return a wrong array type?
pa.infer_type
will throw an error for these arrays.
mask = s.isnull() | ||
# pass _ndarray_values to avoid potential failed type checks from pandas array types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any test case for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a workaround for IntegerArray
in pre-1.0.0 pandas, which did not yet implement __arrow_array__
, so pyarrow expects it to be a NumPy array:
>>> import pandas as pd
>>> import pyarrow as pa
>>> print(pd.__version__, pa.__version__)
0.25.0 0.17.1
>>> s = pd.Series(range(3), dtype=pd.Int64Dtype())
>>> pa.Array.from_pandas(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/array.pxi", line 805, in pyarrow.lib.Array.from_pandas
File "pyarrow/array.pxi", line 265, in pyarrow.lib.array
File "pyarrow/types.pxi", line 76, in pyarrow.lib._datatype_to_pep3118
File "pyarrow/array.pxi", line 64, in pyarrow.lib._ndarray_to_type
File "pyarrow/error.pxi", line 108, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Did not pass numpy.dtype object
>>> pa.Array.from_pandas(s, type=pa.int64())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/array.pxi", line 805, in pyarrow.lib.Array.from_pandas
File "pyarrow/array.pxi", line 265, in pyarrow.lib.array
File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Input object was not a NumPy array
>>> pa.Array.from_pandas(s._ndarray_values, type=pa.int64())
<pyarrow.lib.Int64Array object at 0x7fb88007a980>
[
0,
1,
2
]
>>>
I'll update the comment to mention this.
thanks @viirya
Test build #123762 has finished for PR 28743 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @moskvax , adding support for extension types would be great! I'm not sure using pa.infer_type
is the way to go though, I think it's better to handle these cases explicitly by getting the pa.ExtensionType
from pa.Schema.from_pandas
and then extracting the storage_type
from there. Would that be possible?
@@ -394,10 +394,11 @@ def _create_from_pandas_with_arrow(self, pdf, schema, timezone): | |||
|
|||
# Create the Spark schema from list of names passed in with Arrow types | |||
if isinstance(schema, (list, tuple)): | |||
arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False) | |||
inferred_types = [pa.infer_type(s, mask=s.isna(), from_pandas=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pa.Schema.from_pandas
will return a type that is a subclass of pa.ExtensionType
. From that instance, there is a storage_type
that is defined, which could then be checked as a Spark supported type. Assuming the Pandas extension array implemented __arrow_array__
, which is recommended, see https://arrow.apache.org/docs/python/extending_types.html#controlling-conversion-to-pyarrow-array-with-the-arrow-array-protocol.
# Conflicts: # python/pyspark/sql/pandas/serializers.py
The goal of this PR was to allow conversion for The >>> periods = pd.period_range('2020-01-01', freq='M', periods=6)
>>> pdf = pd.DataFrame({'A': pd.Series(periods)})
>>> pdf
A
0 2020-01
1 2020-02
2 2020-03
3 2020-04
4 2020-05
5 2020-06
>>> pdf.dtypes
A period[M]
dtype: object
>>> df = spark.createDataFrame(pdf)
>>> df.show()
+---+
| A|
+---+
|600|
|601|
|602|
|603|
|604|
|605|
+---+
>>> df.schema
StructType(List(StructField(A,LongType,true)))
So, in the cases where it is possible to convert using the As for |
Test build #123852 has finished for PR 28743 at commit
|
Can one of the admins verify this patch? |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
Can this be pulled? |
Any way this can be revived and pulled?! |
Started hitting this issue today. Can this be reviewed and pulled? |
What changes were proposed in this pull request?
pa.infer_type
overpa.Schema.from_pandas
to infer Arrow types for conversion, as it handles pandas extension types and can ignorepd.NA
values,__arrow_array__
in series' backing arrays, and if present, use it to convert pandas DataFrame columns to Arrow arrays during serialisation.Why are the changes needed?
These changes allow usage of pandas DataFrames which contain ExtensionDtype columns that are backed by arrays that implement
__arrow_array__
. DataFrames containing such columns will be returned when specifying an ExtensionDtype-extending pandas type in thedtype
parameter when constructed, and can also be created via callingconvert_dtypes
on an existing DataFrame.Does this PR introduce any user-facing change?
Yes. Users will be able to convert a wider variety of pandas DataFrames into Spark DataFrames using any currently released pyarrow version > 0.15.1. Prior to this fix, neither the Arrow conversion path nor the fallback path would work with these DataFrames.
How was this patch tested?
Tests were added to cover the cases of converting from pandas DataFrames with
IntegerArray
andStringArray
backed columns. A typo was also fixed in a recently added test.