Revisit Series implicit size mutability and implicit type conversions #32

wesm · 2016-09-27T18:37:50Z

I haven't been able to wrap my head around these behaviors:

In [10]: s = pd.Series()

In [11]: s['a'] = 7

In [12]: s
Out[12]: 
a    7
dtype: int64

or

In [2]: s = pd.Series([1, 2, 3])

In [3]: s['4'] = 'b'

In [4]: s
Out[4]: 
0    1
1    2
2    3
4    b
dtype: object

On first principles, I think these should raise KeyError and ValueError/TypeError, respectively. I'm concerned that preserving these APIs (especially implicit type changes) is going to be problematic for pandas 2.0 if our goal is to provide more precise / consistent / explicit behavior, particularly with respect to data types. If you want to change the type of something, you should cast (unless there is a standard implicit cast, e.g. int -> float in arithmetic). In the case of implicit size mutation, it seems like a recipe for bugs to me (or simply working around coding anti-patterns -- better to explicitly reindex then assign values).

Let me know other thoughts about this.

The text was updated successfully, but these errors were encountered:

chris-b1 · 2016-09-27T18:52:57Z

Personally, I'd let the type conversion on mutation go away completely, I don't see any upside to that, especially with a first class NA value that doesn't force casts.

More ambivalent on setting with expansion - sometimes that is useful (or at least convenient), e.g., below with a MultiIndex would be kind of annoying to express with a reindex? Granted this isn't very idiomatic and allowing this does add plenty of complexity to the indexing code.

In [27]: df = pd.DataFrame({'a': [1, 2, 3, 4]},
    ...:    index=pd.MultiIndex.from_product([['a', 'b'], ['k', 'd']]))

In [28]: idx = pd.IndexSlice

In [29]: df.loc[idx['a','j'], :] = 55

In [30]: df
Out[30]: 
        a
a k   1.0
  d   2.0
b k   3.0
  d   4.0
a j  55.0

shoyer · 2016-09-27T20:43:26Z

I'm strongly supportive here -- this is a variation of #21.

If we disable automatic reindexing in __setitem__, for consistency I think we should do the same for __getitem__, too. We've been quite happy with only an explicit .reindex() method for doing reindexing in xarray.

jreback · 2016-09-27T22:05:46Z

@shoyer IIUC xarray is mostly immutable, so this is quite different than pandas, where the core objects are mutable by-default / definition. In any event I agree we have too much magic currently on implicit type conversions.

However there is a special case that needs addressing. In particular, even though this is a discouraged pattern, its used a lot. e.g.

In [3]: s = Series()
   ...: s['a'] = 2

In [4]: s
Out[4]: 
a    2
dtype: int64

so an empty Series really can't default to dtype, and needs to be coerced on first fill. IOW, do we have a concept of .empty? This is further compilcated by residual dtypes from removals, IOW.

so this is clear

In [1]: s = Series([1,2,3])

In [2]: s[0:0]
Out[2]: Series([], dtype: int64)

but is this the same as Series()? or is Series() a dtype of 'empty' (or maybe 'any')?
IOW, what happens if I now add an element to [2]?

shoyer · 2016-09-27T22:23:28Z

@shoyer IIUC xarray is mostly immutable, so this is quite different than pandas, where the core objects are mutable by-default / definition. In any event I agree we have too much magic currently on implicit type conversions.

Xarray is immutable in the same way proposed here for pandas: DataFrame columns can be mutated and existing rows can be mutated, but rows cannot be inserted or removed. This requires users to do things in a reasonable efficient way, because, like pandas, data is ultimately stored in non-resizeable arrays.

Implicit size mutability is problematic because a naive user might expect that data is backed by a data structure that can be efficiently resized, like Python's list, but this isn't the case (the .append method is problematic for similar reasons). Now that we are redoing the guts of pandas in C++, we could actually store data in resizeable arrays, but on the other hand that does entail some additional complexity and memory overhead.

wesm · 2016-09-29T13:08:48Z

The fact that this is used a lot tells me there is some API deficiency:

In [3]: s = Series()
   ...: s['a'] = 2

In [4]: s
Out[4]: 
a    2
dtype: int64

Series is not a dict and was never intended to be fully substitutable for one -- the lack of an empty method may be the culprit. We should look more at this

jreback · 2017-05-15T10:16:35Z

related: pandas-dev/pandas#9738

jreback added dtypes API labels Sep 30, 2016

chris-b1 mentioned this issue Mar 20, 2017

Proposal to change behaviour with .loc and missing keys pandas-dev/pandas#15747

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit Series implicit size mutability and implicit type conversions #32

Revisit Series implicit size mutability and implicit type conversions #32

wesm commented Sep 27, 2016

chris-b1 commented Sep 27, 2016 •

edited

Loading

shoyer commented Sep 27, 2016

jreback commented Sep 27, 2016 •

edited

Loading

shoyer commented Sep 27, 2016

wesm commented Sep 29, 2016

jreback commented May 15, 2017

Revisit Series implicit size mutability and implicit type conversions #32

Revisit Series implicit size mutability and implicit type conversions #32

Comments

wesm commented Sep 27, 2016

chris-b1 commented Sep 27, 2016 • edited Loading

shoyer commented Sep 27, 2016

jreback commented Sep 27, 2016 • edited Loading

shoyer commented Sep 27, 2016

wesm commented Sep 29, 2016

jreback commented May 15, 2017

chris-b1 commented Sep 27, 2016 •

edited

Loading

jreback commented Sep 27, 2016 •

edited

Loading