Skip to content

[PoC] Allow JIT compilation with an internal API #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

datapythonista
Copy link
Owner

The approach here is to use a jit parameter for any function that could make sense to JIT in pandas (DataFrame.apply, Series.map, SeriesGroupBy.transform...) that delegates to the JIT compiler (Numba or Bodo) 100% of the logic.

Final user API would look like:

df.apply(lambda x: x.A + x.B, axis=1, jit=bodo.jit(parallel=True))

Which I think it's very simple and intuitive, and at the same time makes users import numba and bodo themselves, creating the right impression that they are using those libraries to JIT compile, and it's not something provided by pandas. At least that's my expectation, maybe others disagree.

I think this approach is very convenient for the pandas team, as maintaining the changes in pandas is trivial. And I think it should be very convenient for Bodo, which doesn't depend on reviews and decisions from pandas, as it will be Bodo maintaining all the logic. Also, Bodo can probably release much faster than what pandas will, speeding up the release of new features and bug fixes.

The exact internal API (the __pandas_udf__ function in this PR) can probably be improved by Bodo (and Numba). But probably better to discuss if this is the approach we want to implement first, and then discuss the details of the exact API.

datapythonista and others added 30 commits March 2, 2025 17:43
updates:
- [github.com/astral-sh/ruff-pre-commit: v0.9.4 → v0.9.9](astral-sh/ruff-pre-commit@v0.9.4...v0.9.9)
- [github.com/PyCQA/isort: 6.0.0 → 6.0.1](PyCQA/isort@6.0.0...6.0.1)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* ENH: Add HalfYear offsets

* Add entry to whatsnew

* Resolve cython typing issue
* test_datetimes.py: fix literal string

* fix test

* fix repeated whitespace

* add whatsnew entry

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Fix Styler.to_latex to be in Writer column
changed "normalise" to "normalize"
* Updated set_index doc with a warning

* Updated set_index parameter append along with an example

* Updated set_index example for append

* Updated set_index example
* BUG: Recognize chained fsspec URLs

* Add whatsnew note

* Rename regex variable appropriately and allow more complex chaining

* Fix pre-commit
Remove bogus syntax highlighting on LICENSE in overview.rst
)

* Add inference type info to apply

* DOC: Add inference type information to Dataframe Apply
* DOC: Add link description

Also remove errant space

* fix line too long

* Undo space removal
* Modify an existing test to cover the issue with na_pos > 128.

* Change na_position type from int8_t and int64_t consistently to Py_ssize_t.

* Add What's New entry.

* Sort whatsnew entries alphabetically

* Improve the whatsnew entry.

* Move whatsnew entry from v2.3.0.rst to v3.0.0.rst.

* Update doc/source/whatsnew/v3.0.0.rst

Co-authored-by: Matthew Roeschke <[email protected]>

* Undo remove '-'.

* Sort whatsnew entries alphabetically.

---------

Co-authored-by: avm19 <[email protected]>
Co-authored-by: Matthew Roeschke <[email protected]>
…ev#60983)

* modified the files according to bug#60237

* Update doc/source/whatsnew/v3.0.0.rst

Co-authored-by: Matthew Roeschke <[email protected]>

* moved test case to frame and serier folders

* fix pyarrow import error

* inconsistent issue fix

* added test cases and fixed old pr test cases

* added rst and small changes in tests file

* fixed column name issue for column wise concat

* fixed text case for concat

* fix test cases issue

* Trigger redeployment

* fixed reviewed changes and added extra test cases

* removed duplicate test case

---------

Co-authored-by: Matthew Roeschke <[email protected]>
)

* BUG: Fix OverflowError in lib.maybe_indices_to_slice()

This fixes this error when slicing massive dataframes:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/frame.py", line 4093, in __getitem__
    return self._getitem_bool_array(key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/frame.py", line 4155, in _getitem_bool_array
    return self._take_with_is_copy(indexer, axis=0)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/generic.py", line 4153, in _take_with_is_copy
    result = self.take(indices=indices, axis=axis)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/generic.py", line 4133, in take
    new_data = self._mgr.take(
               ^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 893, in take
    new_labels = self.axes[axis].take(indexer)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/indexes/datetimelike.py", line 839, in take
    maybe_slice = lib.maybe_indices_to_slice(indices, len(self))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib.pyx", line 522, in pandas._libs.lib.maybe_indices_to_slice
OverflowError: value too large to convert to int

* Sort whatsnew entries

* Set type hint back to int

---------

Co-authored-by: benjamindonnachie <[email protected]>
datapythonista and others added 5 commits March 10, 2025 23:43
* ENH: Add Rolling.nunique()

* Add docstring for Expanding.nunique()

* Add a test for float precision issues
* DOC: Add doc for half year offsets

* Fix freq strings

* Fix docstring error

* Fix more docstring errors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.