[SPARK-34463][PYSPARK][DOCS] Document caveats of Arrow selfDestruct #31738

lidavidm · 2021-03-04T16:37:07Z

What changes were proposed in this pull request?

As a followup for #29818, document caveats of using the Arrow selfDestruct option in toPandas, which include:

toPandas() may be slower;
the resulting dataframe may not support some Pandas operations due to immutable backing arrays.

Why are the changes needed?

This will hopefully reduce user confusion as with SPARK-34463.

Does this PR introduce any user-facing change?

Yes - documentation is updated and a config setting description is updated to clearly indicate the config is experimental.

How was this patch tested?

This is a documentation-only change.

lidavidm · 2021-03-04T16:38:00Z

CC @WeichenXu123 and @BryanCutler.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

python/docs/source/user_guide/arrow_pandas.rst

HyukjinKwon · 2021-03-05T04:55:12Z

ok to test

SparkQA · 2021-03-05T05:50:27Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40375/

SparkQA · 2021-03-05T05:59:10Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40375/

SparkQA · 2021-03-05T09:30:49Z

Test build #135793 has finished for PR 31738 at commit af26e25.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-05T16:12:34Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40397/

SparkQA · 2021-03-05T16:47:30Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40397/

SparkQA · 2021-03-05T19:51:44Z

Test build #135815 has finished for PR 31738 at commit b231ac6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2021-03-08T09:43:49Z

python/docs/source/user_guide/arrow_pandas.rst

+This option is experimental, and some operations may fail on the resulting Pandas dataframe due to immutable backing arrays.
+Typically, you would see the error ``ValueError: buffer source array is read-only``.
+Newer versions of Pandas may fix these errors by improving support for such cases.
+Additionally, this conversion may be slower because it is single-threaded.


Could we explicitly say which version pandas will trigger the bug ?

Currently my test show that pandas version > 1.0.5 will trigger the bug.

I think I haven't fully explained the nature of this - it's not any single issue in Pandas, nor is it specific to any particular version. Instead, it's just that depending on how each Pandas operation was implemented underneath, it may or may not have been declared to accept an immutable backing array, independently of whether that operation could be implemented on an immutable array. So whether you see this will depend on what exactly you do with the dataframe, and there's no one version range we can list or one issue we can link to. And indeed, you could see this error see this without this Arrow option enabled; it's just much less likely, since there will be few cases that Arrow can perform a zero-copy conversion in that case.

dongjoon-hyun

Just two questions.

When can we remove this Experimental tag?
Can we hold on this PR until we make a branch for Apache Spark 3.2.0?

lidavidm · 2021-03-12T18:13:51Z

Just two questions.

When can we remove this Experimental tag?

It's hard to say, but once it sees some usage, we can see how many such cases in Pandas need fixing. It might be the case that most Pandas operations work; even the one in the linked issue is already fixed upstream.

Can we hold on this PR until we make a branch for Apache Spark 3.2.0?

No objections here.

BryanCutler

LGTM, just a minor suggestion to maybe include a workaround in the doc. I'll try to keep an eye out for the 3.2.0 branch and then merge if not done already.

BryanCutler · 2021-03-18T21:05:32Z

python/docs/source/user_guide/arrow_pandas.rst

+
+Since Spark 3.2, the Spark configuration ``spark.sql.execution.arrow.pyspark.selfDestruct.enabled`` can be used to enable PyArrow's ``self_destruct`` feature, which can save memory when creating a Pandas dataframe via ``toPandas`` by freeing Arrow-allocated memory while building the Pandas dataframe.
+This option is experimental, and some operations may fail on the resulting Pandas dataframe due to immutable backing arrays.
+Typically, you would see the error ``ValueError: buffer source array is read-only``.


Would it be good to say a workaround is to make a copy of the column(s) used in the operation? I suppose they could just disable the setting is most cases though.

Probably, but still worth a brief mention.

…truct

SparkQA · 2021-03-19T15:03:03Z

Test build #136261 has started for PR 31738 at commit 19183a0.

SparkQA · 2021-03-19T15:52:24Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40843/

SparkQA · 2021-03-19T16:04:50Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40843/

python/docs/source/user_guide/arrow_pandas.rst

HyukjinKwon · 2021-03-29T08:16:03Z

I am okay with this too.

Co-authored-by: Hyukjin Kwon <[email protected]>

SparkQA · 2021-03-29T17:58:00Z

Test build #136650 has finished for PR 31738 at commit b0115e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-03-30T04:30:20Z

Merged to master.

lidavidm · 2021-03-30T11:45:29Z

Thank you both for the review!

github-actions bot added PYTHON SQL labels Mar 4, 2021

HyukjinKwon reviewed Mar 5, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Mar 5, 2021

View reviewed changes

python/docs/source/user_guide/arrow_pandas.rst Show resolved Hide resolved

HyukjinKwon reviewed Mar 5, 2021

View reviewed changes

python/docs/source/user_guide/arrow_pandas.rst Outdated Show resolved Hide resolved

lidavidm force-pushed the spark-34463 branch from af26e25 to b231ac6 Compare March 5, 2021 14:59

WeichenXu123 reviewed Mar 8, 2021

View reviewed changes

dongjoon-hyun reviewed Mar 11, 2021

View reviewed changes

BryanCutler approved these changes Mar 18, 2021

View reviewed changes

[SPARK-34463][PYSPARK][DOCS][MINOR] Document caveats of Arrow selfDes…

19183a0

…truct

lidavidm force-pushed the spark-34463 branch from b231ac6 to 19183a0 Compare March 19, 2021 15:00

HyukjinKwon reviewed Mar 29, 2021

View reviewed changes

python/docs/source/user_guide/arrow_pandas.rst Outdated Show resolved Hide resolved

HyukjinKwon reviewed Mar 29, 2021

View reviewed changes

python/docs/source/user_guide/arrow_pandas.rst Outdated Show resolved Hide resolved

Apply suggestions from code review

b0115e5

Co-authored-by: Hyukjin Kwon <[email protected]>

HyukjinKwon approved these changes Mar 29, 2021

View reviewed changes

HyukjinKwon closed this in 1237124 Mar 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34463][PYSPARK][DOCS] Document caveats of Arrow selfDestruct #31738

[SPARK-34463][PYSPARK][DOCS] Document caveats of Arrow selfDestruct #31738

lidavidm commented Mar 4, 2021

lidavidm commented Mar 4, 2021

HyukjinKwon commented Mar 5, 2021

SparkQA commented Mar 5, 2021

SparkQA commented Mar 5, 2021

SparkQA commented Mar 5, 2021

SparkQA commented Mar 5, 2021

SparkQA commented Mar 5, 2021

SparkQA commented Mar 5, 2021

WeichenXu123 Mar 8, 2021

lidavidm Mar 8, 2021

dongjoon-hyun left a comment

lidavidm commented Mar 12, 2021

BryanCutler left a comment

BryanCutler Mar 18, 2021

lidavidm Mar 19, 2021

SparkQA commented Mar 19, 2021

SparkQA commented Mar 19, 2021

SparkQA commented Mar 19, 2021

HyukjinKwon commented Mar 29, 2021

SparkQA commented Mar 29, 2021

HyukjinKwon commented Mar 30, 2021

lidavidm commented Mar 30, 2021

[SPARK-34463][PYSPARK][DOCS] Document caveats of Arrow selfDestruct #31738

[SPARK-34463][PYSPARK][DOCS] Document caveats of Arrow selfDestruct #31738

Conversation

lidavidm commented Mar 4, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

lidavidm commented Mar 4, 2021

HyukjinKwon commented Mar 5, 2021

SparkQA commented Mar 5, 2021

SparkQA commented Mar 5, 2021

SparkQA commented Mar 5, 2021

SparkQA commented Mar 5, 2021

SparkQA commented Mar 5, 2021

SparkQA commented Mar 5, 2021

WeichenXu123 Mar 8, 2021

Choose a reason for hiding this comment

lidavidm Mar 8, 2021

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

lidavidm commented Mar 12, 2021

BryanCutler left a comment

Choose a reason for hiding this comment

BryanCutler Mar 18, 2021

Choose a reason for hiding this comment

lidavidm Mar 19, 2021

Choose a reason for hiding this comment

SparkQA commented Mar 19, 2021

SparkQA commented Mar 19, 2021

SparkQA commented Mar 19, 2021

HyukjinKwon commented Mar 29, 2021

SparkQA commented Mar 29, 2021

HyukjinKwon commented Mar 30, 2021

lidavidm commented Mar 30, 2021