-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-34521][PYTHON][SQL] Fix spark.createDataFrame when using pandas with StringDtype #34509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… a string dtype column
0ce7f97
to
dc16643
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Not a big deal, but could you add a simple example to PR description what issue is resolved with before/after ?? e.g. Before: >>> spark.createDataFrame(pd.DataFrame({"A": ['a', 'b', 'c']}, dtype="string"))
.../spark/python/pyspark/sql/pandas/conversion.py:402: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
Cannot specify a mask or a size when passing an object that is converted with the __arrow_array__ protocol.
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
warnings.warn(msg)
DataFrame[A: string] After: >>> spark.createDataFrame(pd.DataFrame({"A": ['a', 'b', 'c']}, dtype="string"))
DataFrame[A: string] |
@itholic Thanks. I've updated the description to reflect the change. |
Hey @itholic can we merge this? |
It's LGTM but let's wait until other members to verify this change since I don't have permission to merge 😅 cc @HyukjinKwon @ueshin Could you take a look at this one when you find some time ? |
@@ -169,6 +169,8 @@ def create_array(s, t): | |||
elif is_categorical_dtype(s.dtype): | |||
# Note: This can be removed once minimum pyarrow version is >= 0.16.1 | |||
s = s.astype(s.dtypes.categories.dtype) | |||
elif t is not None and pa.types.is_string(t): | |||
s = s.astype(str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so are you saying that the type of strings from pandas can produce different type in arrow? cc @BryanCutler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while I understand that we can work around,
Pandas stores string columns in two different ways: using a numpy
ndarray
or using a customStringArray
. TheStringArray
version is used when specifing thedtype=string
. When that happens, spark cannot serialize the column to arrow.
This sounds like an issue in Arrow side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think it again, adding support for the StringArray
in arrow is the real solution, this change is just a workaround. I don't know much about the arrow internals or if they can add some metadata to support two different string types. If you feel this is better done in arrow upstream feel free to close the PR and I'll investigate it from the arrow side. Otherwise, we can leave this patch adding a comment and keep investigating in arrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave it to @BryanCutler (whos a maintainer for both PySpark and PyArrow).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid I don't think this is a right fix.
Actually this happens with Int64
, Int32
, or other extension dtypes as well.
>>> spark.createDataFrame(pd.DataFrame({"A": [1, 2, 3]}, dtype="Int64"))
.../python/pyspark/sql/pandas/conversion.py:422: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
Cannot specify a mask or a size when passing an object that is converted with the __arrow_array__ protocol.
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
warnings.warn(msg)
DataFrame[A: bigint]
The error message says:
Cannot specify a mask or a size when passing an object that is converted with the __arrow_array__ protocol.
, so we should follow the message.
How about at line 163,
if hasattr(s.values, '__arrow_array__'):
mask = None
else:
mask = s.isnull()
cc @BryanCutler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @ueshin , the correct thing to do would be to continue to use pyarrow to convert the column in this case with mask = None
, that way the implementation is sure to do the conversion most efficiently. I think for this case it should then produce a standard arrow UTF-8 array that PySpark could handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin Makes a lot of sense. I wasn't aware this was a problem for other dtypes. Removing the mask works, but writing a quick test:
with self.sql_conf({"spark.sql.execution.arrow.pyspark.enabled": True}):
pandas_df = pd.DataFrame({"col": [1, 2, 3, None]}, dtype="Int64")
df = self.spark.createDataFrame(pandas_df)
assert_frame_equal(pandas_df, df.toPandas())
fails with:
Attribute "dtype" are different
[left]: Int64
[right]: float64
I'm going to take a look where the conversion is failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine as long as the df
's data type is LongType
for dtype="Int64"
because df.toPandas()
won't keep the extension dtypes.
So the dtype of df.toPandas()
will be int64
, or float64
if the column contains None
.
ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, except for the lint-python error.
@nicolasazrak Could you run ./dev/reformat-python
to fix the linter error?
@BryanCutler Could you double check the fix is right? I'd leave it to you.
Test build #145964 has finished for PR 34509 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #145966 has finished for PR 34509 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
@@ -215,7 +215,10 @@ def _create_batch(self, series): | |||
series = ((s, None) if not isinstance(s, (list, tuple)) else s for s in series) | |||
|
|||
def create_array(s, t): | |||
mask = s.isnull() | |||
if hasattr(s.values, "__arrow_array__"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this actually reminds me of #28743 how is it related to each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really similar. I've added my tests to that branch and they passed. (However, lot of other tests failed but probably for other reasons).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's preferred to use s.array
instead of s.values
, would you mind giving that a try?
The difference with #28743 is that was trying to deal with pyarrow extension types. For a pandas extension type the This PR is a step in the right direction, so I think it's ok to merge. This will add support for any pandas extension types that are backed by a standard arrow array, although I don't think it will be able to convert it back to pandas as the original extension type. To fully support pandas/pyarrow extension types we would need to propagate the extension type info through spark so that when it is worked on again in python, the extension part can be loaded back up. I'm not exactly sure how difficult that might be to do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me, I just had a minor request to use s.array
instead.
@@ -215,7 +215,10 @@ def _create_batch(self, series): | |||
series = ((s, None) if not isinstance(s, (list, tuple)) else s for s in series) | |||
|
|||
def create_array(s, t): | |||
mask = s.isnull() | |||
if hasattr(s.values, "__arrow_array__"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's preferred to use s.array
instead of s.values
, would you mind giving that a try?
Test build #146194 has finished for PR 34509 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #146197 has finished for PR 34509 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @BryanCutler ! |
merged to master, thanks @nicolasazrak . Do you have an apache.org account so I could assign the issue to you? |
@BryanCutler Yes, my username is |
What changes were proposed in this pull request?
This change fixes
SPARK-34521
. It allows creating a spark DataFrame from a pandas DataFrame that is using aStringDtype
column and arrow pyspark enabled.Why are the changes needed?
Pandas stores string columns in two different ways: using a numpy
ndarray
or using a customStringArray
. TheStringArray
version is used when specifing thedtype=string
. When that happens, spark cannot serialize the column to arrow. Converting theSeries
before fixes this problem.However, due to the different ways to handle string columns, doing
spark.createDataFrame(pandas_dataframe).toPandas()
might not equal topandas_dataframe
. The column dtype could be different.More info: https://pandas.pydata.org/docs/user_guide/text.html
Does this PR introduce any user-facing change?
Trying to create a spark
DataFrame
from a pandasDataFrame
using a string dtype and"spark.sql.execution.arrow.pyspark.enabled"
now doesn't throw an exception and returns the expected dataframe.Before:
spark.createDataFrame(pd.DataFrame({"A": ['a', 'b', 'c']}, dtype="string"))
After:
spark.createDataFrame(pd.DataFrame({"A": ['a', 'b', 'c']}, dtype="string"))
How was this patch tested?
Using the
test_createDataFrame_with_string_dtype
test.