|
| 1 | +.. Licensed to the Apache Software Foundation (ASF) under one |
| 2 | +.. or more contributor license agreements. See the NOTICE file |
| 3 | +.. distributed with this work for additional information |
| 4 | +.. regarding copyright ownership. The ASF licenses this file |
| 5 | +.. to you under the Apache License, Version 2.0 (the |
| 6 | +.. "License"); you may not use this file except in compliance |
| 7 | +.. with the License. You may obtain a copy of the License at |
| 8 | +
|
| 9 | +.. http://www.apache.org/licenses/LICENSE-2.0 |
| 10 | +
|
| 11 | +.. Unless required by applicable law or agreed to in writing, |
| 12 | +.. software distributed under the License is distributed on an |
| 13 | +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 14 | +.. KIND, either express or implied. See the License for the |
| 15 | +.. specific language governing permissions and limitations |
| 16 | +.. under the License. |
| 17 | +
|
| 18 | +Python Extensions |
| 19 | +================= |
| 20 | + |
| 21 | +The DataFusion in Python project is designed to allow users to extend its functionality in a few core |
| 22 | +areas. Ideally many users would like to package their extensions as a Python package and easily |
| 23 | +integrate that package with this project. This page serves to describe some of the challenges we face |
| 24 | +when doing these integrations and the approach our project uses. |
| 25 | + |
| 26 | +The Primary Issue |
| 27 | +----------------- |
| 28 | + |
| 29 | +Suppose you wish to use DataFusion and you have a custom data source that can produce tables that |
| 30 | +can then be queried against, similar to how you can register a :ref:`CSV <io_csv>` or |
| 31 | +:ref:`Parquet <io_parquet>` file. In DataFusion terminology, you likely want to implement a |
| 32 | +:ref:`Custom Table Provider <io_custom_table_provider>`. In an effort to make your data source |
| 33 | +as performant as possible and to utilize the features of DataFusion, you may decide to write |
| 34 | +your source in Rust and then expose it through `PyO3 <https://pyo3.rs>`_ as a Python library. |
| 35 | + |
| 36 | +At first glance, it may appear the best way to do this is to add the ``datafusion-python`` |
| 37 | +crate as a dependency, provide a ``PyTable``, and then to register it with the |
| 38 | +``SessionContext``. Unfortunately, this will not work. |
| 39 | + |
| 40 | +When you produce your code as a Python library and it needs to interact with the DataFusion |
| 41 | +library, at the lowest level they communicate through an Application Binary Interface (ABI). |
| 42 | +The acronym sounds similar to API (Application Programming Interface), but it is distinctly |
| 43 | +different. |
| 44 | + |
| 45 | +The ABI sets the standard for how these libraries can share data and functions between each |
| 46 | +other. One of the key differences between Rust and other programming languages is that Rust |
| 47 | +does not have a stable ABI. What this means in practice is that if you compile a Rust library |
| 48 | +with one version of the ``rustc`` compiler and I compile another library to interface with it |
| 49 | +but I use a different version of the compiler, there is no guarantee the interface will be |
| 50 | +the same. |
| 51 | + |
| 52 | +In practice, this means that a Python library built with ``datafusion-python`` as a Rust |
| 53 | +dependency will generally **not** be compatible with the DataFusion Python package, even |
| 54 | +if they reference the same version of ``datafusion-python``. If you attempt to do this, it may |
| 55 | +work on your local computer if you have built both packages with the same optimizations. |
| 56 | +This can sometimes lead to a false expectation that the code will work, but it frequently |
| 57 | +breaks the moment you try to use your package against the released packages. |
| 58 | + |
| 59 | +You can find more information about the Rust ABI in their |
| 60 | +`online documentation <https://doc.rust-lang.org/reference/abi.html>`_. |
| 61 | + |
| 62 | +The FFI Approach |
| 63 | +---------------- |
| 64 | + |
| 65 | +Rust supports interacting with other programming languages through it's Foreign Function |
| 66 | +Interface (FFI). The advantage of using the FFI is that it enables you to write data structures |
| 67 | +and functions that have a stable ABI. The allows you to use Rust code with C, Python, and |
| 68 | +other languages. In fact, the `PyO3 <https://pyo3.rs>`_ library uses the FFI to share data |
| 69 | +and functions between Python and Rust. |
| 70 | + |
| 71 | +The approach we are taking in the DataFusion in Python project is to incrementally expose |
| 72 | +more portions of the DataFusion project via FFI interfaces. This allows users to write Rust |
| 73 | +code that does **not** require the ``datafusion-python`` crate as a dependency, expose their |
| 74 | +code in Python via PyO3, and have it interact with the DataFusion Python package. |
| 75 | + |
| 76 | +Early adopters of this approach include `delta-rs <https://delta-io.github.io/delta-rs/>`_ |
| 77 | +who has adapted their Table Provider for use in ```datafusion-python``` with only a few lines |
| 78 | +of code. Also, the DataFusion Python project uses the existing definitions from |
| 79 | +`Apache Arrow CStream Interface <https://arrow.apache.org/docs/format/CStreamInterface.html>`_ |
| 80 | +to support importing **and** exporting tables. Any Python package that supports reading |
| 81 | +the Arrow C Stream interface can work with DataFusion Python out of the box! You can read |
| 82 | +more about working with Arrow sources in the :ref:`Data Sources <user_guide_data_sources>` |
| 83 | +page. |
| 84 | + |
| 85 | +To learn more about the Foreign Function Interface in Rust, the |
| 86 | +`Rustonomicon <https://doc.rust-lang.org/nomicon/ffi.html>`_ is a good resource. |
| 87 | + |
| 88 | +Inspiration from Arrow |
| 89 | +---------------------- |
| 90 | + |
| 91 | +DataFusion is built upon `Apache Arrow <https://arrow.apache.org/>`_. The canonical Python |
| 92 | +Arrow implementation, `pyarrow <https://arrow.apache.org/docs/python/index.html>`_ provides |
| 93 | +an excellent way to share Arrow data between Python projects without performing any copy |
| 94 | +operations on the data. They do this by using a well defined set of interfaces. You can |
| 95 | +find the details about their stream interface |
| 96 | +`here <https://arrow.apache.org/docs/format/CStreamInterface.html>`_. The |
| 97 | +`Rust Arrow Implementation <https://github.com/apache/arrow-rs>`_ also supports these |
| 98 | +``C`` style definitions via the Foreign Function Interface. |
| 99 | + |
| 100 | +In addition to using these interfaces to transfer Arrow data between libraries, ``pyarrow`` |
| 101 | +goes one step further to make sharing the interfaces easier in Python. They do this |
| 102 | +by exposing PyCapsules that contain the expected functionality. |
| 103 | + |
| 104 | +You can learn more about PyCapsules from the official |
| 105 | +`Python online documentation <https://docs.python.org/3/c-api/capsule.html>`_. PyCapsules |
| 106 | +have excellent support in PyO3 already. The |
| 107 | +`PyO3 online documentation <https://pyo3.rs/main/doc/pyo3/types/struct.pycapsule>`_ is a good source |
| 108 | +for more details on using PyCapsules in Rust. |
| 109 | + |
| 110 | +Two lessons we leverage from the Arrow project in DataFusion Python are: |
| 111 | + |
| 112 | +- We reuse the existing Arrow FFI functionality wherever possible. |
| 113 | +- We expose PyCapsules that contain a FFI stable struct. |
| 114 | + |
| 115 | +Implementation Details |
| 116 | +---------------------- |
| 117 | + |
| 118 | +The bulk of the code necessary to perform our FFI operations is in the upstream |
| 119 | +`DataFusion <https://datafusion.apache.org/>`_ core repository. You can review the code and |
| 120 | +documentation in the `datafusion-ffi`_ crate. |
| 121 | + |
| 122 | +Our FFI implementation is narrowly focused at sharing data and functions with Rust backed |
| 123 | +libraries. This allows us to use the `abi_stable crate <https://crates.io/crates/abi_stable>`_. |
| 124 | +This is an excellent crate that allows for easy conversion between Rust native types |
| 125 | +and FFI-safe alternatives. For example, if you needed to pass a ``Vec<String>`` via FFI, |
| 126 | +you can simply convert it to a ``RVec<RString>`` in an intuitive manner. It also supports |
| 127 | +features like ``RResult`` and ``ROption`` that do not have an obvious translation to a |
| 128 | +C equivalent. |
| 129 | + |
| 130 | +The `datafusion-ffi`_ crate has been designed to make it easy to convert from DataFusion |
| 131 | +traits into their FFI counterparts. For example, if you have defined a custom |
| 132 | +`TableProvider <https://docs.rs/datafusion/45.0.0/datafusion/catalog/trait.TableProvider.html>`_ |
| 133 | +and you want to create a sharable FFI counterpart, you could write: |
| 134 | + |
| 135 | +.. code-block:: rust |
| 136 | +
|
| 137 | + let my_provider = MyTableProvider::default(); |
| 138 | + let ffi_provider = FFI_TableProvider::new(Arc::new(my_provider), false, None); |
| 139 | +
|
| 140 | +If you were interfacing with a library that provided the above ``FFI_TableProvider`` and |
| 141 | +you needed to turn it back into an ``TableProvider``, you can turn it into a |
| 142 | +``ForeignTableProvider`` with implements the ``TableProvider`` trait. |
| 143 | + |
| 144 | +.. code-block:: rust |
| 145 | +
|
| 146 | + let foreign_provider: ForeignTableProvider = ffi_provider.into(); |
| 147 | +
|
| 148 | +If you review the code in `datafusion-ffi`_ you will find that each of the traits we share |
| 149 | +across the boundary has two portions, one with a ``FFI_`` prefix and one with a ``Foreign`` |
| 150 | +prefix. This is used to distinguish which side of the FFI boundary that struct is |
| 151 | +designed to be used on. The structures with the ``FFI_`` prefix are to be used on the |
| 152 | +**provider** of the structure. In the example we're showing, this means the code that has |
| 153 | +written the underlying ``TableProvider`` implementation to access your custom data source. |
| 154 | +The structures with the ``Foreign`` prefix are to be used by the receiver. In this case, |
| 155 | +it is the ``datafusion-python`` library. |
| 156 | + |
| 157 | +In order to share these FFI structures, we need to wrap them in some kind of Python object |
| 158 | +that can be used to interface from one package to another. As described in the above |
| 159 | +section on our inspiration from Arrow, we use ``PyCapsule``. We can create a ``PyCapsule`` |
| 160 | +for our provider thusly: |
| 161 | + |
| 162 | +.. code-block:: rust |
| 163 | +
|
| 164 | + let name = CString::new("datafusion_table_provider")?; |
| 165 | + let my_capsule = PyCapsule::new_bound(py, provider, Some(name))?; |
| 166 | +
|
| 167 | +On the receiving side, turn this pycapsule object into the ``FFI_TableProvider``, which |
| 168 | +can then be turned into a ``ForeignTableProvider`` the associated code is: |
| 169 | + |
| 170 | +.. code-block:: rust |
| 171 | +
|
| 172 | + let capsule = capsule.downcast::<PyCapsule>()?; |
| 173 | + let provider = unsafe { capsule.reference::<FFI_TableProvider>() }; |
| 174 | +
|
| 175 | +By convention the ``datafusion-python`` library expects a Python object that has a |
| 176 | +``TableProvider`` PyCapsule to have this capsule accessible by calling a function named |
| 177 | +``__datafusion_table_provider__``. You can see a complete working example of how to |
| 178 | +share a ``TableProvider`` from one python library to DataFusion Python in the |
| 179 | +`repository examples folder <https://github.com/apache/datafusion-python/tree/main/examples/ffi-table-provider>`_. |
| 180 | + |
| 181 | +This section has been written using ``TableProvider`` as an example. It is the first |
| 182 | +extension that has been written using this approach and the most thoroughly implemented. |
| 183 | +As we continue to expose more of the DataFusion features, we intend to follow this same |
| 184 | +design pattern. |
| 185 | + |
| 186 | +Alternative Approach |
| 187 | +-------------------- |
| 188 | + |
| 189 | +Suppose you needed to expose some other features of DataFusion and you could not wait |
| 190 | +for the upstream repository to implement the FFI approach we describe. In this case |
| 191 | +you decide to create your dependency on the ``datafusion-python`` crate instead. |
| 192 | + |
| 193 | +As we discussed, this is not guaranteed to work across different compiler versions and |
| 194 | +optimization levels. If you wish to go down this route, there are two approaches we |
| 195 | +have identified you can use. |
| 196 | + |
| 197 | +#. Re-export all of ``datafusion-python`` yourself with your extensions built in. |
| 198 | +#. Carefully synchonize your software releases with the ``datafusion-python`` CI build |
| 199 | + system so that your libraries use the exact same compiler, features, and |
| 200 | + optimization level. |
| 201 | + |
| 202 | +We currently do not recommend either of these approaches as they are difficult to |
| 203 | +maintain over a long period. Additionally, they require a tight version coupling |
| 204 | +between libraries. |
| 205 | + |
| 206 | +Status of Work |
| 207 | +-------------- |
| 208 | + |
| 209 | +At the time of this writing, the FFI features are under active development. To see |
| 210 | +the latest status, we recommend reviewing the code in the `datafusion-ffi`_ crate. |
| 211 | + |
| 212 | +.. _datafusion-ffi: https://crates.io/crates/datafusion-ffi |
0 commit comments