PDEP-5: NoRowIndex (#49694)

MarcoGorelli · web-flow · commit a344dc904bb3 · 2023-03-01T21:56:02.000-05:00
* [skip ci] pdep-5 initial draft

* [skip ci] first revision

* [skip ci] note about multiindex

* [skip ci] clarify some points as per reviews

* [skip ci] fix typo

* [skip ci] withdraw

* [skip ci] typos

* [skip ci] clarify benefit to users part

* [skip ci] reword, reformat

* [skip ci] further reword

* status withdrawn

* clarify that mode.no_row_index would have been separate

* summarise revisions / withdrawal reasons

---------

Co-authored-by: MarcoGorelli &lt;&gt;
diff --git a/web/pandas/pdeps/0005-no-default-index-mode.md b/web/pandas/pdeps/0005-no-default-index-mode.md
@@ -0,0 +1,380 @@
+# PDEP-5: NoRowIndex
+
+- Created: 14 November 2022
+- Status: Withdrawn
+- Discussion: [#49693](https://github.com/pandas-dev/pandas/pull/49693)
+- Author: [Marco Gorelli](https://github.com/MarcoGorelli)
+- Revision: 2
+
+## Abstract
+
+The suggestion is to add a ``NoRowIndex`` class. Internally, it would act a bit like
+a ``RangeIndex``, but some methods would be stricter. This would be one
+step towards enabling users who do not want to think about indices to not need to.
+
+## Motivation
+
+The Index can be a source of confusion and frustration for pandas users. For example, let's consider the inputs
+
+```python
+In[37]: ser1 = pd.Series([10, 15, 20, 25], index=[1, 2, 3, 5])
+
+In[38]: ser2 = pd.Series([10, 15, 20, 25], index=[1, 2, 3, 4])
+```
+
+Then:
+
+- it can be unexpected that adding `Series` with the same length (but different indices) produces `NaN`s in the result (https://stackoverflow.com/q/66094702/4451315):
+
+  ```python
+  In [41]: ser1 + ser2
+  Out[41]:
+  1    20.0
+  2    30.0
+  3    40.0
+  4     NaN
+  5     NaN
+  dtype: float64
+  ```
+
+- concatenation, even with `ignore_index=True`, still aligns on the index (https://github.com/pandas-dev/pandas/issues/25349):
+
+    ```python
+    In [42]: pd.concat([ser1, ser2], axis=1, ignore_index=True)
+    Out[42]:
+          0     1
+    1  10.0  10.0
+    2  15.0  15.0
+    3  20.0  20.0
+    5  25.0   NaN
+    4   NaN  25.0
+    ```
+
+- it can be frustrating to have to repeatedly call `.reset_index()` (https://twitter.com/chowthedog/status/1559946277315641345):
+
+  ```python
+  In [3]: ser1.reset_index(drop=True) + ser2.reset_index(drop=True)
+  Out[3]:
+  0    20
+  1    30
+  2    40
+  3    50
+  dtype: int64
+  ```
+
+If a user did not want to think about row labels (which they may have ended up after slicing / concatenating operations),
+then ``NoRowIndex`` would enable the above to work in a more intuitive
+manner (details and examples to follow below).
+
+## Scope
+
+This proposal deals exclusively with the ``NoRowIndex`` class. To allow users to fully "opt-out" of having to think
+about row labels, the following could also be useful:
+- a ``pd.set_option('mode.no_row_index', True)`` mode which would default to creating new ``DataFrame``s and
+  ``Series`` with ``NoRowIndex`` instead of ``RangeIndex``;
+- giving ``as_index`` options to methods which currently create an index
+  (e.g. ``value_counts``, ``.sum()``, ``.pivot_table``) to just insert a new column instead of creating an
+  ``Index``.
+
+However, neither of the above will be discussed here.
+
+## Detailed Description
+
+The core pandas code would change as little as possible. The additional complexity should be handled
+within the ``NoRowIndex`` object. It would act just like ``RangeIndex``, but would be a bit stricter
+in some cases:
+- `name` could only be `None`;
+- `start` could only be `0`, `step` `1`;
+- when appending one ``NoRowIndex`` to another ``NoRowIndex``, the result would still be ``NoRowIndex``.
+  Appending a ``NoRowIndex`` to any other index (or vice-versa) would raise;
+- the ``NoRowIndex`` class would be preserved under slicing;
+- a ``NoRowIndex`` could only be aligned with another ``Index`` if it's also ``NoRowIndex`` and if it's of the same length;
+- ``DataFrame`` columns cannot be `NoRowIndex` (so ``transpose`` would need some adjustments when called on a ``NoRowIndex`` ``DataFrame``);
+- `insert` and `delete` should raise. As a consequence, if ``df`` is a ``DataFrame`` with a
+  ``NoRowIndex``, then `df.drop` with `axis=0` would always raise;
+- arithmetic operations (e.g. `NoRowIndex(3) + 2`) would always raise;
+- when printing a ``DataFrame``/``Series`` with a ``NoRowIndex``, then the row labels would not be printed;
+- a ``MultiIndex`` could not be created with a ``NoRowIndex`` as one of its levels.
+
+Let's go into more detail for some of these. In the examples that follow, the ``NoRowIndex`` will be passed explicitly,
+but this is not how users would be expected to use it (see "Usage and Impact" section for details).
+
+### NoRowIndex.append
+
+If one has two ``DataFrame``s with ``NoRowIndex``, then one would expect that concatenating them would
+result in a ``DataFrame`` which still has ``NoRowIndex``. To do this, the following rule could be introduced:
+
+> If appending a ``NoRowIndex`` of length ``y`` to a ``NoRowIndex`` of length ``x``, the result will be a
+  ``NoRowIndex`` of length ``x + y``.
+
+Example:
+
+```python
+In [6]: df1 = pd.DataFrame({'a': [1,  2], 'b': [4, 5]}, index=NoRowIndex(2))
+
+In [7]: df2 = pd.DataFrame({'a': [4], 'b': [0]}, index=NoRowIndex(1))
+
+In [8]: df1
+Out[8]:
+ a  b
+ 1  4
+ 2  5
+
+In [9]: df2
+Out[9]:
+ a  b
+ 4  0
+
+In [10]: pd.concat([df1, df2])
+Out[10]:
+ a  b
+ 1  4
+ 2  5
+ 4  0
+
+In [11]: pd.concat([df1, df2]).index
+Out[11]: NoRowIndex(len=3)
+```
+
+Appending anything other than another ``NoRowIndex`` would raise.
+
+### Slicing a ``NoRowIndex``
+
+If one has a ``DataFrame`` with ``NoRowIndex``, then one would expect that a slice of it would still have
+a ``NoRowIndex``. This could be accomplished with:
+
+> If a slice of length ``x`` is taken from a ``NoRowIndex`` of length ``y``, then one gets a
+ ``NoRowIndex`` of length ``x``. Label-based slicing would not be allowed.
+
+Example:
+
+```python
+In [12]: df = pd.DataFrame({'a': [1,  2, 3], 'b': [4, 5, 6]}, index=NoRowIndex(3))
+
+In [13]: df.loc[df['a']>1, 'b']
+Out[13]:
+5
+6
+Name: b, dtype: int64
+
+In [14]: df.loc[df['a']>1, 'b'].index
+Out[14]: NoRowIndex(len=2)
+```
+
+Slicing by label, however, would be disallowed:
+```python
+In [15]: df.loc[0, 'b']
+---------------------------------------------------------------------------
+IndexError: Cannot use label-based indexing on NoRowIndex!
+```
+
+Note too that:
+- other uses of ``.loc``, such as boolean masks, would still be allowed (see F.A.Q);
+- ``.iloc`` and ``.iat`` would keep working as before;
+- ``.at`` would raise.
+
+### Aligning ``NoRowIndex``s
+
+To minimise surprises, the rule would be:
+
+> A ``NoRowIndex`` can only be aligned with another ``NoRowIndex`` of the same length.
+> Attempting to align it with anything else would raise.
+
+Example:
+```python
+In [1]: ser1 = pd.Series([1, 2, 3], index=NoRowIndex(3))
+
+In [2]: ser2 = pd.Series([4, 5, 6], index=NoRowIndex(3))
+
+In [3]: ser1 + ser2  # works!
+Out[3]:
+5
+7
+9
+dtype: int64
+
+In [4]: ser1 + ser2.iloc[1:]  # errors!
+---------------------------------------------------------------------------
+TypeError: Cannot join NoRowIndex of different lengths
+```
+
+### Columns cannot be NoRowIndex
+
+This proposal deals exclusively with allowing users to not need to think about
+row labels. There's no suggestion to remove the column labels.
+
+In particular, calling ``transpose`` on a ``NoRowIndex`` ``DataFrame``
+would error. The error would come with a helpful error message, informing
+users that they should first set an index. E.g.:
+```python
+In [4]: df = pd.DataFrame({'a': [1,  2, 3], 'b': [4, 5, 6]}, index=NoRowIndex(3))
+
+In [5]: df.transpose()
+---------------------------------------------------------------------------
+ValueError: Columns cannot be NoRowIndex.
+If you got here via `transpose` or an `axis=1` operation, then you should first set an index, e.g.: `df.pipe(lambda _df: _df.set_axis(pd.RangeIndex(len(_df))))`
+```
+
+### DataFrameFormatter and SeriesFormatter changes
+
+When printing an object with a ``NoRowIndex``, then the row labels would not be shown:
+
+```python
+In [15]: df = pd.DataFrame({'a': [1,  2, 3], 'b': [4, 5, 6]}, index=NoRowIndex(3))
+
+In [16]: df
+Out[16]:
+ a  b
+ 1  4
+ 2  5
+ 3  6
+```
+
+Of the above changes, this may be the only one that would need implementing within
+``DataFrameFormatter`` / ``SerieFormatter``, as opposed to within ``NoRowIndex``.
+
+## Usage and Impact
+
+Users would not be expected to work with the ``NoRowIndex`` class itself directly.
+Usage would probably involve a mode which would change how the ``default_index``
+function to return a ``NoRowIndex`` rather than a ``RangeIndex``.
+Then, if a ``mode.no_row_index`` option was introduced and a user opted in to it with
+
+```python
+pd.set_option("mode.no_row_index", True)
+```
+
+then the following would all create a ``DataFrame`` with a ``NoRowIndex`` (as they
+all call ``default_index``):
+
+- ``df.reset_index()``;
+- ``pd.concat([df1, df2], ignore_index=True)``
+- ``df1.merge(df2, on=col)``;
+- ``df = pd.DataFrame({'col_1': [1, 2, 3]})``
+
+Further discussion of such a mode is out-of-scope for this proposal. A ``NoRowIndex`` would
+just be a first step towards getting there.
+
+## Implementation
+
+Draft pull request showing proof of concept: https://github.com/pandas-dev/pandas/pull/49693.
+
+Note that implementation details could well change even if this PDEP were
+accepted. For example, ``NoRowIndex`` would not necessarily need to subclass
+``RangeIndex``, and it would not necessarily need to be accessible to the user
+(``df.index`` could well return ``None``)
+
+## Likely FAQ
+
+**Q: Could not users just use ``RangeIndex``? Why do we need a new class?**
+
+**A**: ``RangeIndex`` is not preserved under slicing and appending, e.g.:
+  ```python
+  In[1]: ser = pd.Series([1, 2, 3])
+
+  In[2]: ser[ser != 2].index
+  Out[2]: Int64Index([0, 2], dtype="int64")
+  ```
+  If someone does not want to think about row labels and starts off
+  with a ``RangeIndex``, they'll very quickly lose it.
+
+**Q: Are indices not really powerful?**
+
+**A:** Yes! And they're also confusing to many users, even experienced developers.
+  Often users are using ``.reset_index`` to avoid issues with indices and alignment.
+  Such users would benefit from being able to not think about indices
+  and alignment. Indices would be here to stay, and ``NoRowIndex`` would not be the
+  default.
+
+**Q: How could one switch a ``NoRowIndex`` ``DataFrame`` back to one with an index?**
+
+**A:** The simplest way would probably be:
+  ```python
+  df.set_axis(pd.RangeIndex(len(df)))
+  ```
+  There's probably no need to introduce a new method for this.
+
+  Conversely, to get rid of the index, then if the ``mode.no_row_index`` option was introduced, then
+  one could simply do ``df.reset_index(drop=True)``.
+
+**Q: How would ``tz_localize`` and other methods which operate on the index work on a ``NoRowIndex`` ``DataFrame``?**
+
+**A:** Same way they work on other ``NumericIndex``s, which would typically be to raise:
+
+  ```python
+  In [2]: ser.tz_localize('UTC')
+  ---------------------------------------------------------------------------
+  TypeError: index is not a valid DatetimeIndex or PeriodIndex
+  ```
+
+**Q: Why not let transpose switch ``NoRowIndex`` to ``RangeIndex`` under the hood before swapping index and columns?**
+
+**A:** This is the kind of magic that can lead to surprising behaviour that's
+  difficult to debug. For example, ``df.transpose().transpose()`` would not
+  round-trip. It's easy enough to set an index after all, better to "force" users
+  to be intentional about what they want and end up with fewer surprises later
+  on.
+
+**Q: What would df.sum(), and other methods which introduce an index, return?**
+
+**A:** Such methods would still set an index and would work the same way they
+  do now. There may be some way to change that (e.g. introducing ``as_index``
+  arguments and introducing a mode to set its default) but that's out of scope
+  for this particular PDEP.
+
+**Q: How would a user opt-in to a ``NoRowIndex`` DataFrame?**
+
+**A:** This PDEP would only allow it via the constructor, passing
+  ``index=NoRowIndex(len(df))``. A mode could be introduced to toggle
+  making that the default, but would be out-of-scope for the current PDEP.
+
+**Q: Would ``.loc`` stop working?**
+
+**A:** No. It would only raise if used for label-based selection. Other uses
+  of ``.loc``, such as ``df.loc[:, col_1]`` or ``df.loc[boolean_mask, col_1]``, would
+  continue working.
+
+**Q: What's unintuitive about ``Series`` aligning indices when summing?**
+
+**A:** Not sure, but I once asked a group of experienced developers what the
+  output of
+  ```python
+  ser1 = pd.Series([1, 1, 1], index=[1, 2, 3])
+  ser2 = pd.Series([1, 1, 1], index=[3, 4, 5])
+  print(ser1 + ser2)
+  ```
+  would be, and _nobody_ got it right.
+
+## Reasons for withdrawal
+
+After some discussions, it has become clear there is not enough for support for the proposal in its current state.
+In short, it would add too much complexity to justify the potential benefits. It would unacceptably increase
+the maintenance burden, the testing requirements, and the benefits would be minimal.
+
+Concretely:
+- maintenance burden: it would not be possible to handle all the complexity within the ``NoRowIndex`` class itself, some
+  extra logic would need to go into the pandas core codebase, which is already very complex and hard to maintain;
+- the testing burden would be too high. Properly testing this would mean almost doubling the size of the test suite.
+  Coverage for options already is not great: for example [this issue](https://github.com/pandas-dev/pandas/issues/49732)
+  was caused by a PR which passed CI, but CI did not (and still does not) cover that option (plotting backends);
+- it will not benefit most users, as users do not tend to use nor discover options which are not the default;
+- it would be difficult to reconcile with some existing behaviours: for example, ``df.sum()`` returns a Series with the
+  column names in the index.
+
+In order to make no-index the pandas default and have a chance of benefiting users, a more comprehensive set of changes
+would need to made at the same time. This would require a proposal much larger in scope, and would be a much more radical change.
+It may be that this proposal will be revisited in the future, but in its current state (as an option) it cannot be accepted.
+
+This has still been a useful exercise, though, as it has resulted in two related proposals (see below).
+
+## Related proposals
+
+- Deprecate automatic alignment, at least in some cases: https://github.com/pandas-dev/pandas/issues/49939;
+- ``.value_counts`` behaviour change: https://github.com/pandas-dev/pandas/issues/49497
+
+## PDEP History
+
+- 14 November 2022: Initial draft
+- 18 November 2022: First revision (limited the proposal to a new class, leaving a ``mode`` to a separate proposal)
+- 14 December 2022: Withdrawal (difficulty reconciling with some existing methods, lack of strong support,
+  maintenance burden increasing unjustifiably)
diff --git a/web/pandas_web.py b/web/pandas_web.py
@@ -252,7 +252,13 @@ def roadmap_pdeps(context):
         and linked from there. This preprocessor obtains the list of
         PDEP's in different status from the directory tree and GitHub.
         """
-        KNOWN_STATUS = {"Under discussion", "Accepted", "Implemented", "Rejected"}
+        KNOWN_STATUS = {
+            "Under discussion",
+            "Accepted",
+            "Implemented",
+            "Rejected",
+            "Withdrawn",
+        }
         context["pdeps"] = collections.defaultdict(list)
 
         # accepted, rejected and implemented