Skip to content

Commit a344dc9

Browse files
authored
PDEP-5: NoRowIndex (#49694)
* [skip ci] pdep-5 initial draft * [skip ci] first revision * [skip ci] note about multiindex * [skip ci] clarify some points as per reviews * [skip ci] fix typo * [skip ci] withdraw * [skip ci] typos * [skip ci] clarify benefit to users part * [skip ci] reword, reformat * [skip ci] further reword * status withdrawn * clarify that mode.no_row_index would have been separate * summarise revisions / withdrawal reasons --------- Co-authored-by: MarcoGorelli <>
1 parent 1151e3b commit a344dc9

File tree

2 files changed

+387
-1
lines changed

2 files changed

+387
-1
lines changed
Lines changed: 380 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,380 @@
1+
# PDEP-5: NoRowIndex
2+
3+
- Created: 14 November 2022
4+
- Status: Withdrawn
5+
- Discussion: [#49693](https://github.com/pandas-dev/pandas/pull/49693)
6+
- Author: [Marco Gorelli](https://github.com/MarcoGorelli)
7+
- Revision: 2
8+
9+
## Abstract
10+
11+
The suggestion is to add a ``NoRowIndex`` class. Internally, it would act a bit like
12+
a ``RangeIndex``, but some methods would be stricter. This would be one
13+
step towards enabling users who do not want to think about indices to not need to.
14+
15+
## Motivation
16+
17+
The Index can be a source of confusion and frustration for pandas users. For example, let's consider the inputs
18+
19+
```python
20+
In[37]: ser1 = pd.Series([10, 15, 20, 25], index=[1, 2, 3, 5])
21+
22+
In[38]: ser2 = pd.Series([10, 15, 20, 25], index=[1, 2, 3, 4])
23+
```
24+
25+
Then:
26+
27+
- it can be unexpected that adding `Series` with the same length (but different indices) produces `NaN`s in the result (https://stackoverflow.com/q/66094702/4451315):
28+
29+
```python
30+
In [41]: ser1 + ser2
31+
Out[41]:
32+
1 20.0
33+
2 30.0
34+
3 40.0
35+
4 NaN
36+
5 NaN
37+
dtype: float64
38+
```
39+
40+
- concatenation, even with `ignore_index=True`, still aligns on the index (https://github.com/pandas-dev/pandas/issues/25349):
41+
42+
```python
43+
In [42]: pd.concat([ser1, ser2], axis=1, ignore_index=True)
44+
Out[42]:
45+
0 1
46+
1 10.0 10.0
47+
2 15.0 15.0
48+
3 20.0 20.0
49+
5 25.0 NaN
50+
4 NaN 25.0
51+
```
52+
53+
- it can be frustrating to have to repeatedly call `.reset_index()` (https://twitter.com/chowthedog/status/1559946277315641345):
54+
55+
```python
56+
In [3]: ser1.reset_index(drop=True) + ser2.reset_index(drop=True)
57+
Out[3]:
58+
0 20
59+
1 30
60+
2 40
61+
3 50
62+
dtype: int64
63+
```
64+
65+
If a user did not want to think about row labels (which they may have ended up after slicing / concatenating operations),
66+
then ``NoRowIndex`` would enable the above to work in a more intuitive
67+
manner (details and examples to follow below).
68+
69+
## Scope
70+
71+
This proposal deals exclusively with the ``NoRowIndex`` class. To allow users to fully "opt-out" of having to think
72+
about row labels, the following could also be useful:
73+
- a ``pd.set_option('mode.no_row_index', True)`` mode which would default to creating new ``DataFrame``s and
74+
``Series`` with ``NoRowIndex`` instead of ``RangeIndex``;
75+
- giving ``as_index`` options to methods which currently create an index
76+
(e.g. ``value_counts``, ``.sum()``, ``.pivot_table``) to just insert a new column instead of creating an
77+
``Index``.
78+
79+
However, neither of the above will be discussed here.
80+
81+
## Detailed Description
82+
83+
The core pandas code would change as little as possible. The additional complexity should be handled
84+
within the ``NoRowIndex`` object. It would act just like ``RangeIndex``, but would be a bit stricter
85+
in some cases:
86+
- `name` could only be `None`;
87+
- `start` could only be `0`, `step` `1`;
88+
- when appending one ``NoRowIndex`` to another ``NoRowIndex``, the result would still be ``NoRowIndex``.
89+
Appending a ``NoRowIndex`` to any other index (or vice-versa) would raise;
90+
- the ``NoRowIndex`` class would be preserved under slicing;
91+
- a ``NoRowIndex`` could only be aligned with another ``Index`` if it's also ``NoRowIndex`` and if it's of the same length;
92+
- ``DataFrame`` columns cannot be `NoRowIndex` (so ``transpose`` would need some adjustments when called on a ``NoRowIndex`` ``DataFrame``);
93+
- `insert` and `delete` should raise. As a consequence, if ``df`` is a ``DataFrame`` with a
94+
``NoRowIndex``, then `df.drop` with `axis=0` would always raise;
95+
- arithmetic operations (e.g. `NoRowIndex(3) + 2`) would always raise;
96+
- when printing a ``DataFrame``/``Series`` with a ``NoRowIndex``, then the row labels would not be printed;
97+
- a ``MultiIndex`` could not be created with a ``NoRowIndex`` as one of its levels.
98+
99+
Let's go into more detail for some of these. In the examples that follow, the ``NoRowIndex`` will be passed explicitly,
100+
but this is not how users would be expected to use it (see "Usage and Impact" section for details).
101+
102+
### NoRowIndex.append
103+
104+
If one has two ``DataFrame``s with ``NoRowIndex``, then one would expect that concatenating them would
105+
result in a ``DataFrame`` which still has ``NoRowIndex``. To do this, the following rule could be introduced:
106+
107+
> If appending a ``NoRowIndex`` of length ``y`` to a ``NoRowIndex`` of length ``x``, the result will be a
108+
``NoRowIndex`` of length ``x + y``.
109+
110+
Example:
111+
112+
```python
113+
In [6]: df1 = pd.DataFrame({'a': [1, 2], 'b': [4, 5]}, index=NoRowIndex(2))
114+
115+
In [7]: df2 = pd.DataFrame({'a': [4], 'b': [0]}, index=NoRowIndex(1))
116+
117+
In [8]: df1
118+
Out[8]:
119+
a b
120+
1 4
121+
2 5
122+
123+
In [9]: df2
124+
Out[9]:
125+
a b
126+
4 0
127+
128+
In [10]: pd.concat([df1, df2])
129+
Out[10]:
130+
a b
131+
1 4
132+
2 5
133+
4 0
134+
135+
In [11]: pd.concat([df1, df2]).index
136+
Out[11]: NoRowIndex(len=3)
137+
```
138+
139+
Appending anything other than another ``NoRowIndex`` would raise.
140+
141+
### Slicing a ``NoRowIndex``
142+
143+
If one has a ``DataFrame`` with ``NoRowIndex``, then one would expect that a slice of it would still have
144+
a ``NoRowIndex``. This could be accomplished with:
145+
146+
> If a slice of length ``x`` is taken from a ``NoRowIndex`` of length ``y``, then one gets a
147+
``NoRowIndex`` of length ``x``. Label-based slicing would not be allowed.
148+
149+
Example:
150+
151+
```python
152+
In [12]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, index=NoRowIndex(3))
153+
154+
In [13]: df.loc[df['a']>1, 'b']
155+
Out[13]:
156+
5
157+
6
158+
Name: b, dtype: int64
159+
160+
In [14]: df.loc[df['a']>1, 'b'].index
161+
Out[14]: NoRowIndex(len=2)
162+
```
163+
164+
Slicing by label, however, would be disallowed:
165+
```python
166+
In [15]: df.loc[0, 'b']
167+
---------------------------------------------------------------------------
168+
IndexError: Cannot use label-based indexing on NoRowIndex!
169+
```
170+
171+
Note too that:
172+
- other uses of ``.loc``, such as boolean masks, would still be allowed (see F.A.Q);
173+
- ``.iloc`` and ``.iat`` would keep working as before;
174+
- ``.at`` would raise.
175+
176+
### Aligning ``NoRowIndex``s
177+
178+
To minimise surprises, the rule would be:
179+
180+
> A ``NoRowIndex`` can only be aligned with another ``NoRowIndex`` of the same length.
181+
> Attempting to align it with anything else would raise.
182+
183+
Example:
184+
```python
185+
In [1]: ser1 = pd.Series([1, 2, 3], index=NoRowIndex(3))
186+
187+
In [2]: ser2 = pd.Series([4, 5, 6], index=NoRowIndex(3))
188+
189+
In [3]: ser1 + ser2 # works!
190+
Out[3]:
191+
5
192+
7
193+
9
194+
dtype: int64
195+
196+
In [4]: ser1 + ser2.iloc[1:] # errors!
197+
---------------------------------------------------------------------------
198+
TypeError: Cannot join NoRowIndex of different lengths
199+
```
200+
201+
### Columns cannot be NoRowIndex
202+
203+
This proposal deals exclusively with allowing users to not need to think about
204+
row labels. There's no suggestion to remove the column labels.
205+
206+
In particular, calling ``transpose`` on a ``NoRowIndex`` ``DataFrame``
207+
would error. The error would come with a helpful error message, informing
208+
users that they should first set an index. E.g.:
209+
```python
210+
In [4]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, index=NoRowIndex(3))
211+
212+
In [5]: df.transpose()
213+
---------------------------------------------------------------------------
214+
ValueError: Columns cannot be NoRowIndex.
215+
If you got here via `transpose` or an `axis=1` operation, then you should first set an index, e.g.: `df.pipe(lambda _df: _df.set_axis(pd.RangeIndex(len(_df))))`
216+
```
217+
218+
### DataFrameFormatter and SeriesFormatter changes
219+
220+
When printing an object with a ``NoRowIndex``, then the row labels would not be shown:
221+
222+
```python
223+
In [15]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, index=NoRowIndex(3))
224+
225+
In [16]: df
226+
Out[16]:
227+
a b
228+
1 4
229+
2 5
230+
3 6
231+
```
232+
233+
Of the above changes, this may be the only one that would need implementing within
234+
``DataFrameFormatter`` / ``SerieFormatter``, as opposed to within ``NoRowIndex``.
235+
236+
## Usage and Impact
237+
238+
Users would not be expected to work with the ``NoRowIndex`` class itself directly.
239+
Usage would probably involve a mode which would change how the ``default_index``
240+
function to return a ``NoRowIndex`` rather than a ``RangeIndex``.
241+
Then, if a ``mode.no_row_index`` option was introduced and a user opted in to it with
242+
243+
```python
244+
pd.set_option("mode.no_row_index", True)
245+
```
246+
247+
then the following would all create a ``DataFrame`` with a ``NoRowIndex`` (as they
248+
all call ``default_index``):
249+
250+
- ``df.reset_index()``;
251+
- ``pd.concat([df1, df2], ignore_index=True)``
252+
- ``df1.merge(df2, on=col)``;
253+
- ``df = pd.DataFrame({'col_1': [1, 2, 3]})``
254+
255+
Further discussion of such a mode is out-of-scope for this proposal. A ``NoRowIndex`` would
256+
just be a first step towards getting there.
257+
258+
## Implementation
259+
260+
Draft pull request showing proof of concept: https://github.com/pandas-dev/pandas/pull/49693.
261+
262+
Note that implementation details could well change even if this PDEP were
263+
accepted. For example, ``NoRowIndex`` would not necessarily need to subclass
264+
``RangeIndex``, and it would not necessarily need to be accessible to the user
265+
(``df.index`` could well return ``None``)
266+
267+
## Likely FAQ
268+
269+
**Q: Could not users just use ``RangeIndex``? Why do we need a new class?**
270+
271+
**A**: ``RangeIndex`` is not preserved under slicing and appending, e.g.:
272+
```python
273+
In[1]: ser = pd.Series([1, 2, 3])
274+
275+
In[2]: ser[ser != 2].index
276+
Out[2]: Int64Index([0, 2], dtype="int64")
277+
```
278+
If someone does not want to think about row labels and starts off
279+
with a ``RangeIndex``, they'll very quickly lose it.
280+
281+
**Q: Are indices not really powerful?**
282+
283+
**A:** Yes! And they're also confusing to many users, even experienced developers.
284+
Often users are using ``.reset_index`` to avoid issues with indices and alignment.
285+
Such users would benefit from being able to not think about indices
286+
and alignment. Indices would be here to stay, and ``NoRowIndex`` would not be the
287+
default.
288+
289+
**Q: How could one switch a ``NoRowIndex`` ``DataFrame`` back to one with an index?**
290+
291+
**A:** The simplest way would probably be:
292+
```python
293+
df.set_axis(pd.RangeIndex(len(df)))
294+
```
295+
There's probably no need to introduce a new method for this.
296+
297+
Conversely, to get rid of the index, then if the ``mode.no_row_index`` option was introduced, then
298+
one could simply do ``df.reset_index(drop=True)``.
299+
300+
**Q: How would ``tz_localize`` and other methods which operate on the index work on a ``NoRowIndex`` ``DataFrame``?**
301+
302+
**A:** Same way they work on other ``NumericIndex``s, which would typically be to raise:
303+
304+
```python
305+
In [2]: ser.tz_localize('UTC')
306+
---------------------------------------------------------------------------
307+
TypeError: index is not a valid DatetimeIndex or PeriodIndex
308+
```
309+
310+
**Q: Why not let transpose switch ``NoRowIndex`` to ``RangeIndex`` under the hood before swapping index and columns?**
311+
312+
**A:** This is the kind of magic that can lead to surprising behaviour that's
313+
difficult to debug. For example, ``df.transpose().transpose()`` would not
314+
round-trip. It's easy enough to set an index after all, better to "force" users
315+
to be intentional about what they want and end up with fewer surprises later
316+
on.
317+
318+
**Q: What would df.sum(), and other methods which introduce an index, return?**
319+
320+
**A:** Such methods would still set an index and would work the same way they
321+
do now. There may be some way to change that (e.g. introducing ``as_index``
322+
arguments and introducing a mode to set its default) but that's out of scope
323+
for this particular PDEP.
324+
325+
**Q: How would a user opt-in to a ``NoRowIndex`` DataFrame?**
326+
327+
**A:** This PDEP would only allow it via the constructor, passing
328+
``index=NoRowIndex(len(df))``. A mode could be introduced to toggle
329+
making that the default, but would be out-of-scope for the current PDEP.
330+
331+
**Q: Would ``.loc`` stop working?**
332+
333+
**A:** No. It would only raise if used for label-based selection. Other uses
334+
of ``.loc``, such as ``df.loc[:, col_1]`` or ``df.loc[boolean_mask, col_1]``, would
335+
continue working.
336+
337+
**Q: What's unintuitive about ``Series`` aligning indices when summing?**
338+
339+
**A:** Not sure, but I once asked a group of experienced developers what the
340+
output of
341+
```python
342+
ser1 = pd.Series([1, 1, 1], index=[1, 2, 3])
343+
ser2 = pd.Series([1, 1, 1], index=[3, 4, 5])
344+
print(ser1 + ser2)
345+
```
346+
would be, and _nobody_ got it right.
347+
348+
## Reasons for withdrawal
349+
350+
After some discussions, it has become clear there is not enough for support for the proposal in its current state.
351+
In short, it would add too much complexity to justify the potential benefits. It would unacceptably increase
352+
the maintenance burden, the testing requirements, and the benefits would be minimal.
353+
354+
Concretely:
355+
- maintenance burden: it would not be possible to handle all the complexity within the ``NoRowIndex`` class itself, some
356+
extra logic would need to go into the pandas core codebase, which is already very complex and hard to maintain;
357+
- the testing burden would be too high. Properly testing this would mean almost doubling the size of the test suite.
358+
Coverage for options already is not great: for example [this issue](https://github.com/pandas-dev/pandas/issues/49732)
359+
was caused by a PR which passed CI, but CI did not (and still does not) cover that option (plotting backends);
360+
- it will not benefit most users, as users do not tend to use nor discover options which are not the default;
361+
- it would be difficult to reconcile with some existing behaviours: for example, ``df.sum()`` returns a Series with the
362+
column names in the index.
363+
364+
In order to make no-index the pandas default and have a chance of benefiting users, a more comprehensive set of changes
365+
would need to made at the same time. This would require a proposal much larger in scope, and would be a much more radical change.
366+
It may be that this proposal will be revisited in the future, but in its current state (as an option) it cannot be accepted.
367+
368+
This has still been a useful exercise, though, as it has resulted in two related proposals (see below).
369+
370+
## Related proposals
371+
372+
- Deprecate automatic alignment, at least in some cases: https://github.com/pandas-dev/pandas/issues/49939;
373+
- ``.value_counts`` behaviour change: https://github.com/pandas-dev/pandas/issues/49497
374+
375+
## PDEP History
376+
377+
- 14 November 2022: Initial draft
378+
- 18 November 2022: First revision (limited the proposal to a new class, leaving a ``mode`` to a separate proposal)
379+
- 14 December 2022: Withdrawal (difficulty reconciling with some existing methods, lack of strong support,
380+
maintenance burden increasing unjustifiably)

web/pandas_web.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -252,7 +252,13 @@ def roadmap_pdeps(context):
252252
and linked from there. This preprocessor obtains the list of
253253
PDEP's in different status from the directory tree and GitHub.
254254
"""
255-
KNOWN_STATUS = {"Under discussion", "Accepted", "Implemented", "Rejected"}
255+
KNOWN_STATUS = {
256+
"Under discussion",
257+
"Accepted",
258+
"Implemented",
259+
"Rejected",
260+
"Withdrawn",
261+
}
256262
context["pdeps"] = collections.defaultdict(list)
257263

258264
# accepted, rejected and implemented

0 commit comments

Comments
 (0)