-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-34463][PYSPARK][DOCS] Document caveats of Arrow selfDestruct #31738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CC @WeichenXu123 and @BryanCutler. |
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
ok to test |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #135793 has finished for PR 31738 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #135815 has finished for PR 31738 at commit
|
This option is experimental, and some operations may fail on the resulting Pandas dataframe due to immutable backing arrays. | ||
Typically, you would see the error ``ValueError: buffer source array is read-only``. | ||
Newer versions of Pandas may fix these errors by improving support for such cases. | ||
Additionally, this conversion may be slower because it is single-threaded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we explicitly say which version pandas will trigger the bug ?
Currently my test show that pandas version > 1.0.5 will trigger the bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I haven't fully explained the nature of this - it's not any single issue in Pandas, nor is it specific to any particular version. Instead, it's just that depending on how each Pandas operation was implemented underneath, it may or may not have been declared to accept an immutable backing array, independently of whether that operation could be implemented on an immutable array. So whether you see this will depend on what exactly you do with the dataframe, and there's no one version range we can list or one issue we can link to. And indeed, you could see this error see this without this Arrow option enabled; it's just much less likely, since there will be few cases that Arrow can perform a zero-copy conversion in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two questions.
- When can we remove this
Experimental
tag? - Can we hold on this PR until we make a branch for Apache Spark 3.2.0?
It's hard to say, but once it sees some usage, we can see how many such cases in Pandas need fixing. It might be the case that most Pandas operations work; even the one in the linked issue is already fixed upstream.
No objections here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a minor suggestion to maybe include a workaround in the doc. I'll try to keep an eye out for the 3.2.0 branch and then merge if not done already.
|
||
Since Spark 3.2, the Spark configuration ``spark.sql.execution.arrow.pyspark.selfDestruct.enabled`` can be used to enable PyArrow's ``self_destruct`` feature, which can save memory when creating a Pandas dataframe via ``toPandas`` by freeing Arrow-allocated memory while building the Pandas dataframe. | ||
This option is experimental, and some operations may fail on the resulting Pandas dataframe due to immutable backing arrays. | ||
Typically, you would see the error ``ValueError: buffer source array is read-only``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be good to say a workaround is to make a copy of the column(s) used in the operation? I suppose they could just disable the setting is most cases though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, but still worth a brief mention.
Test build #136261 has started for PR 31738 at commit |
Kubernetes integration test starting |
Kubernetes integration test status failure |
I am okay with this too. |
Co-authored-by: Hyukjin Kwon <[email protected]>
Test build #136650 has finished for PR 31738 at commit
|
Merged to master. |
Thank you both for the review! |
What changes were proposed in this pull request?
As a followup for #29818, document caveats of using the Arrow selfDestruct option in toPandas, which include:
Why are the changes needed?
This will hopefully reduce user confusion as with SPARK-34463.
Does this PR introduce any user-facing change?
Yes - documentation is updated and a config setting description is updated to clearly indicate the config is experimental.
How was this patch tested?
This is a documentation-only change.