Skip to content

Bitmask Backed MaskedArray #54506

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 142 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
142 commits
Select commit Hold shift + click to select a range
1832617
initial build
WillAyd Aug 11, 2023
b69c00f
removed cpplint
WillAyd Aug 11, 2023
64b0f01
checkpoint
WillAyd Aug 11, 2023
e5238d9
Passing test suite
WillAyd Aug 11, 2023
b63b671
revert modifications to nanoarrow
WillAyd Aug 11, 2023
fe31993
force vendor
WillAyd Aug 11, 2023
a39581b
more to_numpy adds
WillAyd Aug 11, 2023
cb1b274
Revert "more to_numpy adds"
WillAyd Aug 12, 2023
dabe1b6
implement __or__
WillAyd Aug 12, 2023
28f7ab1
checkpoint
WillAyd Aug 12, 2023
43f3cbc
more cleanups
WillAyd Aug 12, 2023
902cef9
groupby support
WillAyd Aug 12, 2023
4d4ebfe
prep for 2d
WillAyd Aug 13, 2023
2898bb1
support 2D
WillAyd Aug 13, 2023
108a86c
fix numeric
WillAyd Aug 13, 2023
8decf2a
temp pass for CI
WillAyd Aug 13, 2023
3da7aa2
fixed negative indexing
WillAyd Aug 13, 2023
11467c7
fixed copying
WillAyd Aug 13, 2023
d91fb8e
Working
WillAyd Aug 13, 2023
757605c
cleanups
WillAyd Aug 13, 2023
6fbbad8
fix
WillAyd Aug 13, 2023
b9723ab
cleanups and some performance boosts
WillAyd Aug 14, 2023
3f60cd0
perf boost
WillAyd Aug 14, 2023
3b8921a
perf boost
WillAyd Aug 14, 2023
999e743
more performance
WillAyd Aug 14, 2023
2b764ce
better perf
WillAyd Aug 14, 2023
74548e8
code and typing cleanups
WillAyd Aug 14, 2023
8496e03
Merge remote-tracking branch 'upstream/main' into bitmask-backed
WillAyd Aug 14, 2023
dce8002
refactor and lower level invert/or implementation
WillAyd Aug 14, 2023
e641fed
Mass append nanoarrow for buffer performance
WillAyd Aug 14, 2023
109dd57
delete duplicative struct members
WillAyd Aug 14, 2023
35f3b9c
fix pickling
WillAyd Aug 14, 2023
25e3c51
nanoarrow typo fixups
WillAyd Aug 14, 2023
82e082e
vectorized to_numpy()
WillAyd Aug 17, 2023
c140af4
sum impl
WillAyd Aug 17, 2023
86ce656
any impl
WillAyd Aug 17, 2023
03b1661
updated cython typing
WillAyd Aug 17, 2023
e9d4da4
remove bad __or__ impl
WillAyd Aug 17, 2023
1993e96
fix __or__
WillAyd Aug 17, 2023
5b5faa3
Merge remote-tracking branch 'upstream/main' into bitmask-backed
WillAyd Aug 17, 2023
10ce5ca
removed faulty inversion
WillAyd Aug 17, 2023
e8b7819
more performant bit unpacking
WillAyd Aug 17, 2023
9fdb652
try non-shift nanoarrow packing
WillAyd Aug 18, 2023
17059cb
Remove to_numpy + copy chains
WillAyd Aug 18, 2023
c5a3584
higher performance dunders
WillAyd Aug 18, 2023
28b589f
updated typing
WillAyd Aug 18, 2023
e3618fb
consolidated to_numpy()
WillAyd Aug 18, 2023
4c82771
fixups
WillAyd Aug 18, 2023
633935d
deferred to_numpy() calls in boolean
WillAyd Aug 20, 2023
6ed2c55
test fix
WillAyd Aug 20, 2023
d8e715d
take and copy implementations
WillAyd Aug 20, 2023
37ccec3
small optimization
WillAyd Aug 20, 2023
5436b04
simplified buf passing and fixed bugs
WillAyd Aug 20, 2023
b4aa12d
setitem fastpaths
WillAyd Aug 20, 2023
8c5cd15
cython < 3 compat
WillAyd Aug 21, 2023
e904e18
Revert "simplified buf passing and fixed bugs"
WillAyd Aug 21, 2023
c218e51
implemented all
WillAyd Aug 21, 2023
9cf54f9
faster any
WillAyd Aug 21, 2023
4f6d035
faster all implementation
WillAyd Aug 21, 2023
dca1c65
faster reshape
WillAyd Aug 21, 2023
946c892
Faster is_null_slice implementation
WillAyd Aug 21, 2023
6c2d590
Merge remote-tracking branch 'upstream/main' into bitmask-backed
WillAyd Aug 21, 2023
1eb0e01
revert troublesome __getitem__ enhancements
WillAyd Aug 21, 2023
07594d6
typo fixup
WillAyd Aug 21, 2023
d30b613
finish revert
WillAyd Aug 21, 2023
34ac613
reshape fast path
WillAyd Aug 21, 2023
44aae25
fix is_null_slice
WillAyd Aug 21, 2023
8b72d09
fix indexer perf boost
WillAyd Aug 21, 2023
45d1cf0
less to_numpy()
WillAyd Aug 21, 2023
68b7191
make bitmaskarray iterable
WillAyd Aug 21, 2023
685f481
typing cleanups
WillAyd Aug 21, 2023
82826e9
boolean fixes
WillAyd Aug 21, 2023
78e4245
perf in take
WillAyd Aug 22, 2023
5f26ff1
Merge branch 'main' into bitmask-backed
WillAyd Aug 22, 2023
69c51c2
fixed typing
WillAyd Aug 22, 2023
404268f
rework pickling
WillAyd Aug 22, 2023
28dd82d
fixed attribute lookup
WillAyd Aug 22, 2023
f0bc4a2
More efficient invert
WillAyd Aug 22, 2023
b6ae9bb
doc fix
WillAyd Aug 23, 2023
e1825ae
Have invert return BitMaskArray
WillAyd Aug 23, 2023
5211e2e
Implemented Bitmask Concatenate
WillAyd Aug 23, 2023
cfa3b93
bitmask_any moved to algorithms
WillAyd Aug 23, 2023
6df2930
more algorithms
WillAyd Aug 23, 2023
06f3b01
C-implemented take / putmask
WillAyd Aug 23, 2023
9a61874
clean up calling conventions
WillAyd Aug 23, 2023
cd27943
fix off by one
WillAyd Aug 23, 2023
dc54ca0
make mypy happy
WillAyd Aug 23, 2023
3794ec5
fix bug moving cursor when crossing byte boundary
WillAyd Aug 23, 2023
274a7b5
pedantic cleanups
WillAyd Aug 23, 2023
4b06038
off by one fix
WillAyd Aug 23, 2023
6265784
Revert "fix bug moving cursor when crossing byte boundary"
WillAyd Aug 23, 2023
cea82a5
concatenate bug fix
WillAyd Aug 24, 2023
0d529e8
fixed bounds issues
WillAyd Aug 25, 2023
e80e709
Revert "fixed bounds issues"
WillAyd Aug 25, 2023
8689c99
faster impl
WillAyd Aug 25, 2023
4ed1875
move condition out of loop
WillAyd Aug 25, 2023
24c3814
memory benchmark
WillAyd Aug 25, 2023
5ad8964
use c standard malloc/free
WillAyd Aug 25, 2023
96200e3
Merge remote-tracking branch 'upstream/main' into bitmask-backed
WillAyd Aug 25, 2023
b64ba05
added repr for bitmaskarray
WillAyd Aug 25, 2023
a51dfe9
more tests and better repr
WillAyd Aug 25, 2023
34d4ffc
BitMask -> bitmask
WillAyd Aug 25, 2023
0d78ac3
fix error type
WillAyd Aug 25, 2023
d40a1d8
less to_numpy
WillAyd Aug 25, 2023
e81dcc1
Merge remote-tracking branch 'upstream/main' into bitmask-backed
WillAyd Aug 25, 2023
35da3f6
licenses
WillAyd Aug 25, 2023
5b7d0c2
typing fixes
WillAyd Aug 25, 2023
fa6f6cc
Merge branch 'main' into bitmask-backed
WillAyd Aug 26, 2023
e08a647
buffer protocol implementation for BitmaskArray
WillAyd Aug 26, 2023
e987daf
Merge remote-tracking branch 'upstream/main' into bitmask-backed
WillAyd Aug 28, 2023
a0d538a
fixups
WillAyd Aug 28, 2023
9a97677
getitem fastpath for slice
WillAyd Aug 28, 2023
96f080d
mypy fix
WillAyd Aug 28, 2023
e35b769
fix OOB memcpy
WillAyd Aug 28, 2023
8149e03
fix slicing issue with memview
WillAyd Aug 28, 2023
202de07
fixups
WillAyd Aug 28, 2023
73f438c
fixed memory issues with getitem fastpath
WillAyd Aug 28, 2023
e09743f
fix copy
WillAyd Aug 28, 2023
ddcdc94
Merge remote-tracking branch 'upstream/main' into bitmask-backed
WillAyd Aug 29, 2023
3303be7
win/32bit support
WillAyd Aug 29, 2023
29873e4
NumPy compat
WillAyd Aug 29, 2023
173b4cb
test restructure
WillAyd Aug 29, 2023
a1278a9
more performance
WillAyd Aug 30, 2023
bc772c3
bugfix with all refactor
WillAyd Aug 30, 2023
1c637a1
less to_numpy()
WillAyd Aug 30, 2023
3dfe668
Error message cleanups
WillAyd Aug 30, 2023
5e9f08c
re-enable cpplint
WillAyd Aug 30, 2023
97da641
updated pre-commit
WillAyd Aug 30, 2023
a3dca8a
Fix typing issues
WillAyd Aug 30, 2023
1f77d9a
more cleanups
WillAyd Aug 31, 2023
22148de
Merge remote-tracking branch 'upstream/main' into bitmask-backed
WillAyd Aug 31, 2023
afef21e
Merge remote-tracking branch 'upstream/main' into bitmask-backed
WillAyd Aug 31, 2023
d1bd251
Merge branch 'main' into bitmask-backed
WillAyd Sep 5, 2023
7b4810d
Merge branch 'main' into bitmask-backed
WillAyd Sep 6, 2023
6a56ec1
remove cast
WillAyd Sep 6, 2023
23fb76d
less diff
WillAyd Sep 6, 2023
3eeaa12
Merge remote-tracking branch 'upstream/main' into bitmask-backed
WillAyd Sep 10, 2023
dfd2b57
Merge remote-tracking branch 'upstream/main' into bitmask-backed
WillAyd Sep 15, 2023
3fb26ec
reverted cythonized is_null_slice
WillAyd Sep 15, 2023
541de2e
remove xfail of test
WillAyd Sep 15, 2023
34bc194
change assert to ignore
WillAyd Sep 15, 2023
c6abf22
Merge remote-tracking branch 'upstream/main' into bitmask-backed
WillAyd Oct 3, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,10 @@ repos:
- id: codespell
types_or: [python, rst, markdown, cython, c]
additional_dependencies: [tomli]
exclude: |
(?x)
^pandas/_libs/include/pandas/vendored/nanoarrow.h
|pandas/_libs/src/vendored/nanoarrow.c
- repo: https://github.com/MarcoGorelli/cython-lint
rev: v0.15.0
hooks:
Expand Down Expand Up @@ -74,7 +78,11 @@ repos:
rev: 1.6.1
hooks:
- id: cpplint
exclude: ^pandas/_libs/include/pandas/vendored/klib
exclude: |
(?x)
^pandas/_libs/include/pandas/vendored/klib
|pandas/_libs/include/pandas/vendored/nanoarrow.h
|pandas/_libs/src/vendored/nanoarrow.c
args: [
--quiet,
'--extensions=c,h',
Expand Down
11 changes: 11 additions & 0 deletions asv_bench/benchmarks/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,17 @@ def time_from_float_array(self):
pd.array(self.values_float, dtype="boolean")


class BooleanArrayMem:
def setup_cache(self):
N = 250_000
data = np.array([True] * N)
mask = np.array([False] * N)
return [pd.arrays.BooleanArray(data, mask)] * 500

def peakmem_array(self, arrays):
return [~x for x in arrays]


class IntegerArray:
def setup(self):
N = 250_000
Expand Down
35 changes: 35 additions & 0 deletions pandas/_libs/arrays.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,13 @@ from typing import Sequence
import numpy as np

from pandas._typing import (
ArrayLike,
AxisInt,
DtypeObj,
PositionalIndexer,
Self,
Shape,
type_t,
)

class NDArrayBacked:
Expand Down Expand Up @@ -38,3 +41,35 @@ class NDArrayBacked:
def _concat_same_type(
cls, to_concat: Sequence[Self], axis: AxisInt = ...
) -> Self: ...

class BitmaskArray:
parent: Self
def __init__(self, data: np.ndarray | Self) -> None: ...
def __len__(self) -> int: ...
def __setitem__(self, key: PositionalIndexer, value: ArrayLike | bool) -> None: ...
def __getitem__(self, key: PositionalIndexer) -> bool: ...
def __invert__(self) -> Self: ...
def __and__(self, other: np.ndarray | Self | bool) -> np.ndarray: ...
def __or__(self, other: np.ndarray | Self | bool) -> np.ndarray: ...
def __xor__(self, other: np.ndarray | Self | bool) -> np.ndarray: ...
def __getstate__(self) -> dict: ...
def __setstate__(self, other: dict) -> None: ...
def __iter__(self): ...
@classmethod
def concatenate(cls, objs: list[Self], axis: int) -> Self: ...
@property
def size(self) -> int: ...
@property
def nbytes(self) -> int: ...
@property
def bytes(self) -> bytes: ...
@property
def shape(self) -> tuple[int, ...]: ...
@property
def dtype(self) -> type_t[bool]: ...
def any(self) -> bool: ...
def all(self) -> bool: ...
def sum(self) -> int: ...
def take_1d(self, indices: np.ndarray, axis: int) -> Self: ...
def copy(self) -> Self: ...
def to_numpy(self) -> np.ndarray: ...
Loading