|
| 1 | +# PDEP-5: NoRowIndex |
| 2 | + |
| 3 | +- Created: 14 November 2022 |
| 4 | +- Status: Withdrawn |
| 5 | +- Discussion: [#49693](https://github.com/pandas-dev/pandas/pull/49693) |
| 6 | +- Author: [Marco Gorelli](https://github.com/MarcoGorelli) |
| 7 | +- Revision: 2 |
| 8 | + |
| 9 | +## Abstract |
| 10 | + |
| 11 | +The suggestion is to add a ``NoRowIndex`` class. Internally, it would act a bit like |
| 12 | +a ``RangeIndex``, but some methods would be stricter. This would be one |
| 13 | +step towards enabling users who do not want to think about indices to not need to. |
| 14 | + |
| 15 | +## Motivation |
| 16 | + |
| 17 | +The Index can be a source of confusion and frustration for pandas users. For example, let's consider the inputs |
| 18 | + |
| 19 | +```python |
| 20 | +In[37]: ser1 = pd.Series([10, 15, 20, 25], index=[1, 2, 3, 5]) |
| 21 | + |
| 22 | +In[38]: ser2 = pd.Series([10, 15, 20, 25], index=[1, 2, 3, 4]) |
| 23 | +``` |
| 24 | + |
| 25 | +Then: |
| 26 | + |
| 27 | +- it can be unexpected that adding `Series` with the same length (but different indices) produces `NaN`s in the result (https://stackoverflow.com/q/66094702/4451315): |
| 28 | + |
| 29 | + ```python |
| 30 | + In [41]: ser1 + ser2 |
| 31 | + Out[41]: |
| 32 | + 1 20.0 |
| 33 | + 2 30.0 |
| 34 | + 3 40.0 |
| 35 | + 4 NaN |
| 36 | + 5 NaN |
| 37 | + dtype: float64 |
| 38 | + ``` |
| 39 | + |
| 40 | +- concatenation, even with `ignore_index=True`, still aligns on the index (https://github.com/pandas-dev/pandas/issues/25349): |
| 41 | + |
| 42 | + ```python |
| 43 | + In [42]: pd.concat([ser1, ser2], axis=1, ignore_index=True) |
| 44 | + Out[42]: |
| 45 | + 0 1 |
| 46 | + 1 10.0 10.0 |
| 47 | + 2 15.0 15.0 |
| 48 | + 3 20.0 20.0 |
| 49 | + 5 25.0 NaN |
| 50 | + 4 NaN 25.0 |
| 51 | + ``` |
| 52 | + |
| 53 | +- it can be frustrating to have to repeatedly call `.reset_index()` (https://twitter.com/chowthedog/status/1559946277315641345): |
| 54 | + |
| 55 | + ```python |
| 56 | + In [3]: ser1.reset_index(drop=True) + ser2.reset_index(drop=True) |
| 57 | + Out[3]: |
| 58 | + 0 20 |
| 59 | + 1 30 |
| 60 | + 2 40 |
| 61 | + 3 50 |
| 62 | + dtype: int64 |
| 63 | + ``` |
| 64 | + |
| 65 | +If a user did not want to think about row labels (which they may have ended up after slicing / concatenating operations), |
| 66 | +then ``NoRowIndex`` would enable the above to work in a more intuitive |
| 67 | +manner (details and examples to follow below). |
| 68 | + |
| 69 | +## Scope |
| 70 | + |
| 71 | +This proposal deals exclusively with the ``NoRowIndex`` class. To allow users to fully "opt-out" of having to think |
| 72 | +about row labels, the following could also be useful: |
| 73 | +- a ``pd.set_option('mode.no_row_index', True)`` mode which would default to creating new ``DataFrame``s and |
| 74 | + ``Series`` with ``NoRowIndex`` instead of ``RangeIndex``; |
| 75 | +- giving ``as_index`` options to methods which currently create an index |
| 76 | + (e.g. ``value_counts``, ``.sum()``, ``.pivot_table``) to just insert a new column instead of creating an |
| 77 | + ``Index``. |
| 78 | + |
| 79 | +However, neither of the above will be discussed here. |
| 80 | + |
| 81 | +## Detailed Description |
| 82 | + |
| 83 | +The core pandas code would change as little as possible. The additional complexity should be handled |
| 84 | +within the ``NoRowIndex`` object. It would act just like ``RangeIndex``, but would be a bit stricter |
| 85 | +in some cases: |
| 86 | +- `name` could only be `None`; |
| 87 | +- `start` could only be `0`, `step` `1`; |
| 88 | +- when appending one ``NoRowIndex`` to another ``NoRowIndex``, the result would still be ``NoRowIndex``. |
| 89 | + Appending a ``NoRowIndex`` to any other index (or vice-versa) would raise; |
| 90 | +- the ``NoRowIndex`` class would be preserved under slicing; |
| 91 | +- a ``NoRowIndex`` could only be aligned with another ``Index`` if it's also ``NoRowIndex`` and if it's of the same length; |
| 92 | +- ``DataFrame`` columns cannot be `NoRowIndex` (so ``transpose`` would need some adjustments when called on a ``NoRowIndex`` ``DataFrame``); |
| 93 | +- `insert` and `delete` should raise. As a consequence, if ``df`` is a ``DataFrame`` with a |
| 94 | + ``NoRowIndex``, then `df.drop` with `axis=0` would always raise; |
| 95 | +- arithmetic operations (e.g. `NoRowIndex(3) + 2`) would always raise; |
| 96 | +- when printing a ``DataFrame``/``Series`` with a ``NoRowIndex``, then the row labels would not be printed; |
| 97 | +- a ``MultiIndex`` could not be created with a ``NoRowIndex`` as one of its levels. |
| 98 | + |
| 99 | +Let's go into more detail for some of these. In the examples that follow, the ``NoRowIndex`` will be passed explicitly, |
| 100 | +but this is not how users would be expected to use it (see "Usage and Impact" section for details). |
| 101 | + |
| 102 | +### NoRowIndex.append |
| 103 | + |
| 104 | +If one has two ``DataFrame``s with ``NoRowIndex``, then one would expect that concatenating them would |
| 105 | +result in a ``DataFrame`` which still has ``NoRowIndex``. To do this, the following rule could be introduced: |
| 106 | + |
| 107 | +> If appending a ``NoRowIndex`` of length ``y`` to a ``NoRowIndex`` of length ``x``, the result will be a |
| 108 | + ``NoRowIndex`` of length ``x + y``. |
| 109 | + |
| 110 | +Example: |
| 111 | + |
| 112 | +```python |
| 113 | +In [6]: df1 = pd.DataFrame({'a': [1, 2], 'b': [4, 5]}, index=NoRowIndex(2)) |
| 114 | + |
| 115 | +In [7]: df2 = pd.DataFrame({'a': [4], 'b': [0]}, index=NoRowIndex(1)) |
| 116 | + |
| 117 | +In [8]: df1 |
| 118 | +Out[8]: |
| 119 | + a b |
| 120 | + 1 4 |
| 121 | + 2 5 |
| 122 | + |
| 123 | +In [9]: df2 |
| 124 | +Out[9]: |
| 125 | + a b |
| 126 | + 4 0 |
| 127 | + |
| 128 | +In [10]: pd.concat([df1, df2]) |
| 129 | +Out[10]: |
| 130 | + a b |
| 131 | + 1 4 |
| 132 | + 2 5 |
| 133 | + 4 0 |
| 134 | + |
| 135 | +In [11]: pd.concat([df1, df2]).index |
| 136 | +Out[11]: NoRowIndex(len=3) |
| 137 | +``` |
| 138 | + |
| 139 | +Appending anything other than another ``NoRowIndex`` would raise. |
| 140 | + |
| 141 | +### Slicing a ``NoRowIndex`` |
| 142 | + |
| 143 | +If one has a ``DataFrame`` with ``NoRowIndex``, then one would expect that a slice of it would still have |
| 144 | +a ``NoRowIndex``. This could be accomplished with: |
| 145 | + |
| 146 | +> If a slice of length ``x`` is taken from a ``NoRowIndex`` of length ``y``, then one gets a |
| 147 | + ``NoRowIndex`` of length ``x``. Label-based slicing would not be allowed. |
| 148 | + |
| 149 | +Example: |
| 150 | + |
| 151 | +```python |
| 152 | +In [12]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, index=NoRowIndex(3)) |
| 153 | + |
| 154 | +In [13]: df.loc[df['a']>1, 'b'] |
| 155 | +Out[13]: |
| 156 | +5 |
| 157 | +6 |
| 158 | +Name: b, dtype: int64 |
| 159 | + |
| 160 | +In [14]: df.loc[df['a']>1, 'b'].index |
| 161 | +Out[14]: NoRowIndex(len=2) |
| 162 | +``` |
| 163 | + |
| 164 | +Slicing by label, however, would be disallowed: |
| 165 | +```python |
| 166 | +In [15]: df.loc[0, 'b'] |
| 167 | +--------------------------------------------------------------------------- |
| 168 | +IndexError: Cannot use label-based indexing on NoRowIndex! |
| 169 | +``` |
| 170 | + |
| 171 | +Note too that: |
| 172 | +- other uses of ``.loc``, such as boolean masks, would still be allowed (see F.A.Q); |
| 173 | +- ``.iloc`` and ``.iat`` would keep working as before; |
| 174 | +- ``.at`` would raise. |
| 175 | + |
| 176 | +### Aligning ``NoRowIndex``s |
| 177 | + |
| 178 | +To minimise surprises, the rule would be: |
| 179 | + |
| 180 | +> A ``NoRowIndex`` can only be aligned with another ``NoRowIndex`` of the same length. |
| 181 | +> Attempting to align it with anything else would raise. |
| 182 | +
|
| 183 | +Example: |
| 184 | +```python |
| 185 | +In [1]: ser1 = pd.Series([1, 2, 3], index=NoRowIndex(3)) |
| 186 | + |
| 187 | +In [2]: ser2 = pd.Series([4, 5, 6], index=NoRowIndex(3)) |
| 188 | + |
| 189 | +In [3]: ser1 + ser2 # works! |
| 190 | +Out[3]: |
| 191 | +5 |
| 192 | +7 |
| 193 | +9 |
| 194 | +dtype: int64 |
| 195 | + |
| 196 | +In [4]: ser1 + ser2.iloc[1:] # errors! |
| 197 | +--------------------------------------------------------------------------- |
| 198 | +TypeError: Cannot join NoRowIndex of different lengths |
| 199 | +``` |
| 200 | + |
| 201 | +### Columns cannot be NoRowIndex |
| 202 | + |
| 203 | +This proposal deals exclusively with allowing users to not need to think about |
| 204 | +row labels. There's no suggestion to remove the column labels. |
| 205 | + |
| 206 | +In particular, calling ``transpose`` on a ``NoRowIndex`` ``DataFrame`` |
| 207 | +would error. The error would come with a helpful error message, informing |
| 208 | +users that they should first set an index. E.g.: |
| 209 | +```python |
| 210 | +In [4]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, index=NoRowIndex(3)) |
| 211 | + |
| 212 | +In [5]: df.transpose() |
| 213 | +--------------------------------------------------------------------------- |
| 214 | +ValueError: Columns cannot be NoRowIndex. |
| 215 | +If you got here via `transpose` or an `axis=1` operation, then you should first set an index, e.g.: `df.pipe(lambda _df: _df.set_axis(pd.RangeIndex(len(_df))))` |
| 216 | +``` |
| 217 | + |
| 218 | +### DataFrameFormatter and SeriesFormatter changes |
| 219 | + |
| 220 | +When printing an object with a ``NoRowIndex``, then the row labels would not be shown: |
| 221 | + |
| 222 | +```python |
| 223 | +In [15]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, index=NoRowIndex(3)) |
| 224 | + |
| 225 | +In [16]: df |
| 226 | +Out[16]: |
| 227 | + a b |
| 228 | + 1 4 |
| 229 | + 2 5 |
| 230 | + 3 6 |
| 231 | +``` |
| 232 | + |
| 233 | +Of the above changes, this may be the only one that would need implementing within |
| 234 | +``DataFrameFormatter`` / ``SerieFormatter``, as opposed to within ``NoRowIndex``. |
| 235 | + |
| 236 | +## Usage and Impact |
| 237 | + |
| 238 | +Users would not be expected to work with the ``NoRowIndex`` class itself directly. |
| 239 | +Usage would probably involve a mode which would change how the ``default_index`` |
| 240 | +function to return a ``NoRowIndex`` rather than a ``RangeIndex``. |
| 241 | +Then, if a ``mode.no_row_index`` option was introduced and a user opted in to it with |
| 242 | + |
| 243 | +```python |
| 244 | +pd.set_option("mode.no_row_index", True) |
| 245 | +``` |
| 246 | + |
| 247 | +then the following would all create a ``DataFrame`` with a ``NoRowIndex`` (as they |
| 248 | +all call ``default_index``): |
| 249 | + |
| 250 | +- ``df.reset_index()``; |
| 251 | +- ``pd.concat([df1, df2], ignore_index=True)`` |
| 252 | +- ``df1.merge(df2, on=col)``; |
| 253 | +- ``df = pd.DataFrame({'col_1': [1, 2, 3]})`` |
| 254 | + |
| 255 | +Further discussion of such a mode is out-of-scope for this proposal. A ``NoRowIndex`` would |
| 256 | +just be a first step towards getting there. |
| 257 | + |
| 258 | +## Implementation |
| 259 | + |
| 260 | +Draft pull request showing proof of concept: https://github.com/pandas-dev/pandas/pull/49693. |
| 261 | + |
| 262 | +Note that implementation details could well change even if this PDEP were |
| 263 | +accepted. For example, ``NoRowIndex`` would not necessarily need to subclass |
| 264 | +``RangeIndex``, and it would not necessarily need to be accessible to the user |
| 265 | +(``df.index`` could well return ``None``) |
| 266 | + |
| 267 | +## Likely FAQ |
| 268 | + |
| 269 | +**Q: Could not users just use ``RangeIndex``? Why do we need a new class?** |
| 270 | + |
| 271 | +**A**: ``RangeIndex`` is not preserved under slicing and appending, e.g.: |
| 272 | + ```python |
| 273 | + In[1]: ser = pd.Series([1, 2, 3]) |
| 274 | + |
| 275 | + In[2]: ser[ser != 2].index |
| 276 | + Out[2]: Int64Index([0, 2], dtype="int64") |
| 277 | + ``` |
| 278 | + If someone does not want to think about row labels and starts off |
| 279 | + with a ``RangeIndex``, they'll very quickly lose it. |
| 280 | + |
| 281 | +**Q: Are indices not really powerful?** |
| 282 | + |
| 283 | +**A:** Yes! And they're also confusing to many users, even experienced developers. |
| 284 | + Often users are using ``.reset_index`` to avoid issues with indices and alignment. |
| 285 | + Such users would benefit from being able to not think about indices |
| 286 | + and alignment. Indices would be here to stay, and ``NoRowIndex`` would not be the |
| 287 | + default. |
| 288 | + |
| 289 | +**Q: How could one switch a ``NoRowIndex`` ``DataFrame`` back to one with an index?** |
| 290 | + |
| 291 | +**A:** The simplest way would probably be: |
| 292 | + ```python |
| 293 | + df.set_axis(pd.RangeIndex(len(df))) |
| 294 | + ``` |
| 295 | + There's probably no need to introduce a new method for this. |
| 296 | + |
| 297 | + Conversely, to get rid of the index, then if the ``mode.no_row_index`` option was introduced, then |
| 298 | + one could simply do ``df.reset_index(drop=True)``. |
| 299 | + |
| 300 | +**Q: How would ``tz_localize`` and other methods which operate on the index work on a ``NoRowIndex`` ``DataFrame``?** |
| 301 | + |
| 302 | +**A:** Same way they work on other ``NumericIndex``s, which would typically be to raise: |
| 303 | + |
| 304 | + ```python |
| 305 | + In [2]: ser.tz_localize('UTC') |
| 306 | + --------------------------------------------------------------------------- |
| 307 | + TypeError: index is not a valid DatetimeIndex or PeriodIndex |
| 308 | + ``` |
| 309 | + |
| 310 | +**Q: Why not let transpose switch ``NoRowIndex`` to ``RangeIndex`` under the hood before swapping index and columns?** |
| 311 | + |
| 312 | +**A:** This is the kind of magic that can lead to surprising behaviour that's |
| 313 | + difficult to debug. For example, ``df.transpose().transpose()`` would not |
| 314 | + round-trip. It's easy enough to set an index after all, better to "force" users |
| 315 | + to be intentional about what they want and end up with fewer surprises later |
| 316 | + on. |
| 317 | + |
| 318 | +**Q: What would df.sum(), and other methods which introduce an index, return?** |
| 319 | + |
| 320 | +**A:** Such methods would still set an index and would work the same way they |
| 321 | + do now. There may be some way to change that (e.g. introducing ``as_index`` |
| 322 | + arguments and introducing a mode to set its default) but that's out of scope |
| 323 | + for this particular PDEP. |
| 324 | + |
| 325 | +**Q: How would a user opt-in to a ``NoRowIndex`` DataFrame?** |
| 326 | + |
| 327 | +**A:** This PDEP would only allow it via the constructor, passing |
| 328 | + ``index=NoRowIndex(len(df))``. A mode could be introduced to toggle |
| 329 | + making that the default, but would be out-of-scope for the current PDEP. |
| 330 | + |
| 331 | +**Q: Would ``.loc`` stop working?** |
| 332 | + |
| 333 | +**A:** No. It would only raise if used for label-based selection. Other uses |
| 334 | + of ``.loc``, such as ``df.loc[:, col_1]`` or ``df.loc[boolean_mask, col_1]``, would |
| 335 | + continue working. |
| 336 | + |
| 337 | +**Q: What's unintuitive about ``Series`` aligning indices when summing?** |
| 338 | + |
| 339 | +**A:** Not sure, but I once asked a group of experienced developers what the |
| 340 | + output of |
| 341 | + ```python |
| 342 | + ser1 = pd.Series([1, 1, 1], index=[1, 2, 3]) |
| 343 | + ser2 = pd.Series([1, 1, 1], index=[3, 4, 5]) |
| 344 | + print(ser1 + ser2) |
| 345 | + ``` |
| 346 | + would be, and _nobody_ got it right. |
| 347 | + |
| 348 | +## Reasons for withdrawal |
| 349 | + |
| 350 | +After some discussions, it has become clear there is not enough for support for the proposal in its current state. |
| 351 | +In short, it would add too much complexity to justify the potential benefits. It would unacceptably increase |
| 352 | +the maintenance burden, the testing requirements, and the benefits would be minimal. |
| 353 | + |
| 354 | +Concretely: |
| 355 | +- maintenance burden: it would not be possible to handle all the complexity within the ``NoRowIndex`` class itself, some |
| 356 | + extra logic would need to go into the pandas core codebase, which is already very complex and hard to maintain; |
| 357 | +- the testing burden would be too high. Properly testing this would mean almost doubling the size of the test suite. |
| 358 | + Coverage for options already is not great: for example [this issue](https://github.com/pandas-dev/pandas/issues/49732) |
| 359 | + was caused by a PR which passed CI, but CI did not (and still does not) cover that option (plotting backends); |
| 360 | +- it will not benefit most users, as users do not tend to use nor discover options which are not the default; |
| 361 | +- it would be difficult to reconcile with some existing behaviours: for example, ``df.sum()`` returns a Series with the |
| 362 | + column names in the index. |
| 363 | + |
| 364 | +In order to make no-index the pandas default and have a chance of benefiting users, a more comprehensive set of changes |
| 365 | +would need to made at the same time. This would require a proposal much larger in scope, and would be a much more radical change. |
| 366 | +It may be that this proposal will be revisited in the future, but in its current state (as an option) it cannot be accepted. |
| 367 | + |
| 368 | +This has still been a useful exercise, though, as it has resulted in two related proposals (see below). |
| 369 | + |
| 370 | +## Related proposals |
| 371 | + |
| 372 | +- Deprecate automatic alignment, at least in some cases: https://github.com/pandas-dev/pandas/issues/49939; |
| 373 | +- ``.value_counts`` behaviour change: https://github.com/pandas-dev/pandas/issues/49497 |
| 374 | + |
| 375 | +## PDEP History |
| 376 | + |
| 377 | +- 14 November 2022: Initial draft |
| 378 | +- 18 November 2022: First revision (limited the proposal to a new class, leaving a ``mode`` to a separate proposal) |
| 379 | +- 14 December 2022: Withdrawal (difficulty reconciling with some existing methods, lack of strong support, |
| 380 | + maintenance burden increasing unjustifiably) |
0 commit comments