Skip to content

Commit e6f6e66

Browse files
authored
Add user documentation for the FFI approach (#1031)
* Initial commit for FFI user documentation * Update readme to point to the online documentation. Fix a small typo. * Small text adjustments for clarity and formatting
1 parent 3584bec commit e6f6e66

File tree

3 files changed

+220
-4
lines changed

3 files changed

+220
-4
lines changed

README.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -30,10 +30,8 @@ DataFusion's Python bindings can be used as a foundation for building new data s
3030
planning, and logical plan optimizations, and then transpiles the logical plan to Dask operations for execution.
3131
- [DataFusion Ballista](https://github.com/apache/datafusion-ballista) is a distributed SQL query engine that extends
3232
DataFusion's Python bindings for distributed use cases.
33-
34-
It is also possible to use these Python bindings directly for DataFrame and SQL operations, but you may find that
35-
[Polars](http://pola.rs/) and [DuckDB](http://www.duckdb.org/) are more suitable for this use case, since they have
36-
more of an end-user focus and are more actively maintained than these Python bindings.
33+
- [DataFusion Ray](https://github.com/apache/datafusion-ray) is another distributed query engine that uses
34+
DataFusion's Python bindings.
3735

3836
## Features
3937

@@ -114,6 +112,11 @@ Printing the context will show the current configuration settings.
114112
print(ctx)
115113
```
116114

115+
## Extensions
116+
117+
For information about how to extend DataFusion Python, please see the extensions page of the
118+
[online documentation](https://datafusion.apache.org/python/).
119+
117120
## More Examples
118121

119122
See [examples](examples/README.md) for more information.

docs/source/contributor-guide/ffi.rst

Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
.. Licensed to the Apache Software Foundation (ASF) under one
2+
.. or more contributor license agreements. See the NOTICE file
3+
.. distributed with this work for additional information
4+
.. regarding copyright ownership. The ASF licenses this file
5+
.. to you under the Apache License, Version 2.0 (the
6+
.. "License"); you may not use this file except in compliance
7+
.. with the License. You may obtain a copy of the License at
8+
9+
.. http://www.apache.org/licenses/LICENSE-2.0
10+
11+
.. Unless required by applicable law or agreed to in writing,
12+
.. software distributed under the License is distributed on an
13+
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
.. KIND, either express or implied. See the License for the
15+
.. specific language governing permissions and limitations
16+
.. under the License.
17+
18+
Python Extensions
19+
=================
20+
21+
The DataFusion in Python project is designed to allow users to extend its functionality in a few core
22+
areas. Ideally many users would like to package their extensions as a Python package and easily
23+
integrate that package with this project. This page serves to describe some of the challenges we face
24+
when doing these integrations and the approach our project uses.
25+
26+
The Primary Issue
27+
-----------------
28+
29+
Suppose you wish to use DataFusion and you have a custom data source that can produce tables that
30+
can then be queried against, similar to how you can register a :ref:`CSV <io_csv>` or
31+
:ref:`Parquet <io_parquet>` file. In DataFusion terminology, you likely want to implement a
32+
:ref:`Custom Table Provider <io_custom_table_provider>`. In an effort to make your data source
33+
as performant as possible and to utilize the features of DataFusion, you may decide to write
34+
your source in Rust and then expose it through `PyO3 <https://pyo3.rs>`_ as a Python library.
35+
36+
At first glance, it may appear the best way to do this is to add the ``datafusion-python``
37+
crate as a dependency, provide a ``PyTable``, and then to register it with the
38+
``SessionContext``. Unfortunately, this will not work.
39+
40+
When you produce your code as a Python library and it needs to interact with the DataFusion
41+
library, at the lowest level they communicate through an Application Binary Interface (ABI).
42+
The acronym sounds similar to API (Application Programming Interface), but it is distinctly
43+
different.
44+
45+
The ABI sets the standard for how these libraries can share data and functions between each
46+
other. One of the key differences between Rust and other programming languages is that Rust
47+
does not have a stable ABI. What this means in practice is that if you compile a Rust library
48+
with one version of the ``rustc`` compiler and I compile another library to interface with it
49+
but I use a different version of the compiler, there is no guarantee the interface will be
50+
the same.
51+
52+
In practice, this means that a Python library built with ``datafusion-python`` as a Rust
53+
dependency will generally **not** be compatible with the DataFusion Python package, even
54+
if they reference the same version of ``datafusion-python``. If you attempt to do this, it may
55+
work on your local computer if you have built both packages with the same optimizations.
56+
This can sometimes lead to a false expectation that the code will work, but it frequently
57+
breaks the moment you try to use your package against the released packages.
58+
59+
You can find more information about the Rust ABI in their
60+
`online documentation <https://doc.rust-lang.org/reference/abi.html>`_.
61+
62+
The FFI Approach
63+
----------------
64+
65+
Rust supports interacting with other programming languages through it's Foreign Function
66+
Interface (FFI). The advantage of using the FFI is that it enables you to write data structures
67+
and functions that have a stable ABI. The allows you to use Rust code with C, Python, and
68+
other languages. In fact, the `PyO3 <https://pyo3.rs>`_ library uses the FFI to share data
69+
and functions between Python and Rust.
70+
71+
The approach we are taking in the DataFusion in Python project is to incrementally expose
72+
more portions of the DataFusion project via FFI interfaces. This allows users to write Rust
73+
code that does **not** require the ``datafusion-python`` crate as a dependency, expose their
74+
code in Python via PyO3, and have it interact with the DataFusion Python package.
75+
76+
Early adopters of this approach include `delta-rs <https://delta-io.github.io/delta-rs/>`_
77+
who has adapted their Table Provider for use in ```datafusion-python``` with only a few lines
78+
of code. Also, the DataFusion Python project uses the existing definitions from
79+
`Apache Arrow CStream Interface <https://arrow.apache.org/docs/format/CStreamInterface.html>`_
80+
to support importing **and** exporting tables. Any Python package that supports reading
81+
the Arrow C Stream interface can work with DataFusion Python out of the box! You can read
82+
more about working with Arrow sources in the :ref:`Data Sources <user_guide_data_sources>`
83+
page.
84+
85+
To learn more about the Foreign Function Interface in Rust, the
86+
`Rustonomicon <https://doc.rust-lang.org/nomicon/ffi.html>`_ is a good resource.
87+
88+
Inspiration from Arrow
89+
----------------------
90+
91+
DataFusion is built upon `Apache Arrow <https://arrow.apache.org/>`_. The canonical Python
92+
Arrow implementation, `pyarrow <https://arrow.apache.org/docs/python/index.html>`_ provides
93+
an excellent way to share Arrow data between Python projects without performing any copy
94+
operations on the data. They do this by using a well defined set of interfaces. You can
95+
find the details about their stream interface
96+
`here <https://arrow.apache.org/docs/format/CStreamInterface.html>`_. The
97+
`Rust Arrow Implementation <https://github.com/apache/arrow-rs>`_ also supports these
98+
``C`` style definitions via the Foreign Function Interface.
99+
100+
In addition to using these interfaces to transfer Arrow data between libraries, ``pyarrow``
101+
goes one step further to make sharing the interfaces easier in Python. They do this
102+
by exposing PyCapsules that contain the expected functionality.
103+
104+
You can learn more about PyCapsules from the official
105+
`Python online documentation <https://docs.python.org/3/c-api/capsule.html>`_. PyCapsules
106+
have excellent support in PyO3 already. The
107+
`PyO3 online documentation <https://pyo3.rs/main/doc/pyo3/types/struct.pycapsule>`_ is a good source
108+
for more details on using PyCapsules in Rust.
109+
110+
Two lessons we leverage from the Arrow project in DataFusion Python are:
111+
112+
- We reuse the existing Arrow FFI functionality wherever possible.
113+
- We expose PyCapsules that contain a FFI stable struct.
114+
115+
Implementation Details
116+
----------------------
117+
118+
The bulk of the code necessary to perform our FFI operations is in the upstream
119+
`DataFusion <https://datafusion.apache.org/>`_ core repository. You can review the code and
120+
documentation in the `datafusion-ffi`_ crate.
121+
122+
Our FFI implementation is narrowly focused at sharing data and functions with Rust backed
123+
libraries. This allows us to use the `abi_stable crate <https://crates.io/crates/abi_stable>`_.
124+
This is an excellent crate that allows for easy conversion between Rust native types
125+
and FFI-safe alternatives. For example, if you needed to pass a ``Vec<String>`` via FFI,
126+
you can simply convert it to a ``RVec<RString>`` in an intuitive manner. It also supports
127+
features like ``RResult`` and ``ROption`` that do not have an obvious translation to a
128+
C equivalent.
129+
130+
The `datafusion-ffi`_ crate has been designed to make it easy to convert from DataFusion
131+
traits into their FFI counterparts. For example, if you have defined a custom
132+
`TableProvider <https://docs.rs/datafusion/45.0.0/datafusion/catalog/trait.TableProvider.html>`_
133+
and you want to create a sharable FFI counterpart, you could write:
134+
135+
.. code-block:: rust
136+
137+
let my_provider = MyTableProvider::default();
138+
let ffi_provider = FFI_TableProvider::new(Arc::new(my_provider), false, None);
139+
140+
If you were interfacing with a library that provided the above ``FFI_TableProvider`` and
141+
you needed to turn it back into an ``TableProvider``, you can turn it into a
142+
``ForeignTableProvider`` with implements the ``TableProvider`` trait.
143+
144+
.. code-block:: rust
145+
146+
let foreign_provider: ForeignTableProvider = ffi_provider.into();
147+
148+
If you review the code in `datafusion-ffi`_ you will find that each of the traits we share
149+
across the boundary has two portions, one with a ``FFI_`` prefix and one with a ``Foreign``
150+
prefix. This is used to distinguish which side of the FFI boundary that struct is
151+
designed to be used on. The structures with the ``FFI_`` prefix are to be used on the
152+
**provider** of the structure. In the example we're showing, this means the code that has
153+
written the underlying ``TableProvider`` implementation to access your custom data source.
154+
The structures with the ``Foreign`` prefix are to be used by the receiver. In this case,
155+
it is the ``datafusion-python`` library.
156+
157+
In order to share these FFI structures, we need to wrap them in some kind of Python object
158+
that can be used to interface from one package to another. As described in the above
159+
section on our inspiration from Arrow, we use ``PyCapsule``. We can create a ``PyCapsule``
160+
for our provider thusly:
161+
162+
.. code-block:: rust
163+
164+
let name = CString::new("datafusion_table_provider")?;
165+
let my_capsule = PyCapsule::new_bound(py, provider, Some(name))?;
166+
167+
On the receiving side, turn this pycapsule object into the ``FFI_TableProvider``, which
168+
can then be turned into a ``ForeignTableProvider`` the associated code is:
169+
170+
.. code-block:: rust
171+
172+
let capsule = capsule.downcast::<PyCapsule>()?;
173+
let provider = unsafe { capsule.reference::<FFI_TableProvider>() };
174+
175+
By convention the ``datafusion-python`` library expects a Python object that has a
176+
``TableProvider`` PyCapsule to have this capsule accessible by calling a function named
177+
``__datafusion_table_provider__``. You can see a complete working example of how to
178+
share a ``TableProvider`` from one python library to DataFusion Python in the
179+
`repository examples folder <https://github.com/apache/datafusion-python/tree/main/examples/ffi-table-provider>`_.
180+
181+
This section has been written using ``TableProvider`` as an example. It is the first
182+
extension that has been written using this approach and the most thoroughly implemented.
183+
As we continue to expose more of the DataFusion features, we intend to follow this same
184+
design pattern.
185+
186+
Alternative Approach
187+
--------------------
188+
189+
Suppose you needed to expose some other features of DataFusion and you could not wait
190+
for the upstream repository to implement the FFI approach we describe. In this case
191+
you decide to create your dependency on the ``datafusion-python`` crate instead.
192+
193+
As we discussed, this is not guaranteed to work across different compiler versions and
194+
optimization levels. If you wish to go down this route, there are two approaches we
195+
have identified you can use.
196+
197+
#. Re-export all of ``datafusion-python`` yourself with your extensions built in.
198+
#. Carefully synchonize your software releases with the ``datafusion-python`` CI build
199+
system so that your libraries use the exact same compiler, features, and
200+
optimization level.
201+
202+
We currently do not recommend either of these approaches as they are difficult to
203+
maintain over a long period. Additionally, they require a tight version coupling
204+
between libraries.
205+
206+
Status of Work
207+
--------------
208+
209+
At the time of this writing, the FFI features are under active development. To see
210+
the latest status, we recommend reviewing the code in the `datafusion-ffi`_ crate.
211+
212+
.. _datafusion-ffi: https://crates.io/crates/datafusion-ffi

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,7 @@ Example
8585
:caption: CONTRIBUTOR GUIDE
8686

8787
contributor-guide/introduction
88+
contributor-guide/ffi
8889

8990
.. _toc.api:
9091
.. toctree::

0 commit comments

Comments
 (0)