Skip to content

[SPARK-34463][PYSPARK][DOCS] Document caveats of Arrow selfDestruct #31738

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

lidavidm
Copy link
Member

@lidavidm lidavidm commented Mar 4, 2021

What changes were proposed in this pull request?

As a followup for #29818, document caveats of using the Arrow selfDestruct option in toPandas, which include:

  • toPandas() may be slower;
  • the resulting dataframe may not support some Pandas operations due to immutable backing arrays.

Why are the changes needed?

This will hopefully reduce user confusion as with SPARK-34463.

Does this PR introduce any user-facing change?

Yes - documentation is updated and a config setting description is updated to clearly indicate the config is experimental.

How was this patch tested?

This is a documentation-only change.

@lidavidm
Copy link
Member Author

lidavidm commented Mar 4, 2021

CC @WeichenXu123 and @BryanCutler.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Mar 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40375/

@SparkQA
Copy link

SparkQA commented Mar 5, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40375/

@SparkQA
Copy link

SparkQA commented Mar 5, 2021

Test build #135793 has finished for PR 31738 at commit af26e25.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40397/

@SparkQA
Copy link

SparkQA commented Mar 5, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40397/

@SparkQA
Copy link

SparkQA commented Mar 5, 2021

Test build #135815 has finished for PR 31738 at commit b231ac6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

This option is experimental, and some operations may fail on the resulting Pandas dataframe due to immutable backing arrays.
Typically, you would see the error ``ValueError: buffer source array is read-only``.
Newer versions of Pandas may fix these errors by improving support for such cases.
Additionally, this conversion may be slower because it is single-threaded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we explicitly say which version pandas will trigger the bug ?

Currently my test show that pandas version > 1.0.5 will trigger the bug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I haven't fully explained the nature of this - it's not any single issue in Pandas, nor is it specific to any particular version. Instead, it's just that depending on how each Pandas operation was implemented underneath, it may or may not have been declared to accept an immutable backing array, independently of whether that operation could be implemented on an immutable array. So whether you see this will depend on what exactly you do with the dataframe, and there's no one version range we can list or one issue we can link to. And indeed, you could see this error see this without this Arrow option enabled; it's just much less likely, since there will be few cases that Arrow can perform a zero-copy conversion in that case.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two questions.

  • When can we remove this Experimental tag?
  • Can we hold on this PR until we make a branch for Apache Spark 3.2.0?

@lidavidm
Copy link
Member Author

Just two questions.

  • When can we remove this Experimental tag?

It's hard to say, but once it sees some usage, we can see how many such cases in Pandas need fixing. It might be the case that most Pandas operations work; even the one in the linked issue is already fixed upstream.

  • Can we hold on this PR until we make a branch for Apache Spark 3.2.0?

No objections here.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a minor suggestion to maybe include a workaround in the doc. I'll try to keep an eye out for the 3.2.0 branch and then merge if not done already.


Since Spark 3.2, the Spark configuration ``spark.sql.execution.arrow.pyspark.selfDestruct.enabled`` can be used to enable PyArrow's ``self_destruct`` feature, which can save memory when creating a Pandas dataframe via ``toPandas`` by freeing Arrow-allocated memory while building the Pandas dataframe.
This option is experimental, and some operations may fail on the resulting Pandas dataframe due to immutable backing arrays.
Typically, you would see the error ``ValueError: buffer source array is read-only``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be good to say a workaround is to make a copy of the column(s) used in the operation? I suppose they could just disable the setting is most cases though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, but still worth a brief mention.

@SparkQA
Copy link

SparkQA commented Mar 19, 2021

Test build #136261 has started for PR 31738 at commit 19183a0.

@SparkQA
Copy link

SparkQA commented Mar 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40843/

@SparkQA
Copy link

SparkQA commented Mar 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40843/

@HyukjinKwon
Copy link
Member

I am okay with this too.

@SparkQA
Copy link

SparkQA commented Mar 29, 2021

Test build #136650 has finished for PR 31738 at commit b0115e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@lidavidm
Copy link
Member Author

Thank you both for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants