Skip to content

Commit 1237124

Browse files
lidavidmHyukjinKwon
authored andcommitted
[SPARK-34463][PYSPARK][DOCS] Document caveats of Arrow selfDestruct
### What changes were proposed in this pull request? As a followup for #29818, document caveats of using the Arrow selfDestruct option in toPandas, which include: - toPandas() may be slower; - the resulting dataframe may not support some Pandas operations due to immutable backing arrays. ### Why are the changes needed? This will hopefully reduce user confusion as with SPARK-34463. ### Does this PR introduce _any_ user-facing change? Yes - documentation is updated and a config setting description is updated to clearly indicate the config is experimental. ### How was this patch tested? This is a documentation-only change. Closes #31738 from lidavidm/spark-34463. Authored-by: David Li <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
1 parent 7158e7f commit 1237124

File tree

2 files changed

+11
-2
lines changed

2 files changed

+11
-2
lines changed

python/docs/source/user_guide/arrow_pandas.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -410,3 +410,12 @@ described in `SPARK-29367 <https://issues.apache.org/jira/browse/SPARK-29367>`_
410410
``pandas_udf``\s or :meth:`DataFrame.toPandas` with Arrow enabled. More information about the Arrow IPC change can
411411
be read on the Arrow 0.15.0 release `blog <https://arrow.apache.org/blog/2019/10/06/0.15.0-release/#columnar-streaming-protocol-change-since-0140>`_.
412412

413+
Setting Arrow ``self_destruct`` for memory savings
414+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
415+
416+
Since Spark 3.2, the Spark configuration ``spark.sql.execution.arrow.pyspark.selfDestruct.enabled`` can be used to enable PyArrow's ``self_destruct`` feature, which can save memory when creating a Pandas DataFrame via ``toPandas`` by freeing Arrow-allocated memory while building the Pandas DataFrame.
417+
This option is experimental, and some operations may fail on the resulting Pandas DataFrame due to immutable backing arrays.
418+
Typically, you would see the error ``ValueError: buffer source array is read-only``.
419+
Newer versions of Pandas may fix these errors by improving support for such cases.
420+
You can work around this error by copying the column(s) beforehand.
421+
Additionally, this conversion may be slower because it is single-threaded.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2049,8 +2049,8 @@ object SQLConf {
20492049

20502050
val ARROW_PYSPARK_SELF_DESTRUCT_ENABLED =
20512051
buildConf("spark.sql.execution.arrow.pyspark.selfDestruct.enabled")
2052-
.doc("When true, make use of Apache Arrow's self-destruct and split-blocks options " +
2053-
"for columnar data transfers in PySpark, when converting from Arrow to Pandas. " +
2052+
.doc("(Experimental) When true, make use of Apache Arrow's self-destruct and split-blocks " +
2053+
"options for columnar data transfers in PySpark, when converting from Arrow to Pandas. " +
20542054
"This reduces memory usage at the cost of some CPU time. " +
20552055
"This optimization applies to: pyspark.sql.DataFrame.toPandas " +
20562056
"when 'spark.sql.execution.arrow.pyspark.enabled' is set.")

0 commit comments

Comments
 (0)