[SPARK-34463][PYSPARK][DOCS] Document caveats of Arrow selfDestruct

lidavidm · HyukjinKwon · commit 1237124062a5 · 2021-03-30T13:30:27.000+09:00
### What changes were proposed in this pull request? As a followup for #29818, document caveats of using the Arrow selfDestruct option in toPandas, which include: - toPandas() may be slower; - the resulting dataframe may not support some Pandas operations due to immutable backing arrays. ### Why are the changes needed? This will hopefully reduce user confusion as with SPARK-34463. ### Does this PR introduce _any_ user-facing change? Yes - documentation is updated and a config setting description is updated to clearly indicate the config is experimental. ### How was this patch tested? This is a documentation-only change. Closes #31738 from lidavidm/spark-34463. Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
diff --git a/python/docs/source/user_guide/arrow_pandas.rst b/python/docs/source/user_guide/arrow_pandas.rst
@@ -410,3 +410,12 @@ described in `SPARK-29367 <https://issues.apache.org/jira/browse/SPARK-29367>`_
 ``pandas_udf``\s or :meth:`DataFrame.toPandas` with Arrow enabled. More information about the Arrow IPC change can
 be read on the Arrow 0.15.0 release `blog <https://arrow.apache.org/blog/2019/10/06/0.15.0-release/#columnar-streaming-protocol-change-since-0140>`_.
 
+Setting Arrow ``self_destruct`` for memory savings
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Since Spark 3.2, the Spark configuration ``spark.sql.execution.arrow.pyspark.selfDestruct.enabled`` can be used to enable PyArrow's ``self_destruct`` feature, which can save memory when creating a Pandas DataFrame via ``toPandas`` by freeing Arrow-allocated memory while building the Pandas DataFrame.
+This option is experimental, and some operations may fail on the resulting Pandas DataFrame due to immutable backing arrays.
+Typically, you would see the error ``ValueError: buffer source array is read-only``.
+Newer versions of Pandas may fix these errors by improving support for such cases.
+You can work around this error by copying the column(s) beforehand.
+Additionally, this conversion may be slower because it is single-threaded.
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -2049,8 +2049,8 @@ object SQLConf {
 
   val ARROW_PYSPARK_SELF_DESTRUCT_ENABLED =
     buildConf("spark.sql.execution.arrow.pyspark.selfDestruct.enabled")
-      .doc("When true, make use of Apache Arrow's self-destruct and split-blocks options " +
-        "for columnar data transfers in PySpark, when converting from Arrow to Pandas. " +
+      .doc("(Experimental) When true, make use of Apache Arrow's self-destruct and split-blocks " +
+        "options for columnar data transfers in PySpark, when converting from Arrow to Pandas. " +
         "This reduces memory usage at the cost of some CPU time. " +
         "This optimization applies to: pyspark.sql.DataFrame.toPandas " +
         "when 'spark.sql.execution.arrow.pyspark.enabled' is set.")