Skip to content
This repository was archived by the owner on Apr 10, 2024. It is now read-only.

Revisit Series implicit size mutability and implicit type conversions #32

Open
wesm opened this issue Sep 27, 2016 · 6 comments
Open

Revisit Series implicit size mutability and implicit type conversions #32

wesm opened this issue Sep 27, 2016 · 6 comments

Comments

@wesm
Copy link
Owner

wesm commented Sep 27, 2016

I haven't been able to wrap my head around these behaviors:

In [10]: s = pd.Series()

In [11]: s['a'] = 7

In [12]: s
Out[12]: 
a    7
dtype: int64

or

In [2]: s = pd.Series([1, 2, 3])

In [3]: s['4'] = 'b'

In [4]: s
Out[4]: 
0    1
1    2
2    3
4    b
dtype: object

On first principles, I think these should raise KeyError and ValueError/TypeError, respectively. I'm concerned that preserving these APIs (especially implicit type changes) is going to be problematic for pandas 2.0 if our goal is to provide more precise / consistent / explicit behavior, particularly with respect to data types. If you want to change the type of something, you should cast (unless there is a standard implicit cast, e.g. int -> float in arithmetic). In the case of implicit size mutation, it seems like a recipe for bugs to me (or simply working around coding anti-patterns -- better to explicitly reindex then assign values).

Let me know other thoughts about this.

@chris-b1
Copy link

chris-b1 commented Sep 27, 2016

Personally, I'd let the type conversion on mutation go away completely, I don't see any upside to that, especially with a first class NA value that doesn't force casts.

More ambivalent on setting with expansion - sometimes that is useful (or at least convenient), e.g., below with a MultiIndex would be kind of annoying to express with a reindex? Granted this isn't very idiomatic and allowing this does add plenty of complexity to the indexing code.

In [27]: df = pd.DataFrame({'a': [1, 2, 3, 4]},
    ...:    index=pd.MultiIndex.from_product([['a', 'b'], ['k', 'd']]))

In [28]: idx = pd.IndexSlice

In [29]: df.loc[idx['a','j'], :] = 55

In [30]: df
Out[30]: 
        a
a k   1.0
  d   2.0
b k   3.0
  d   4.0
a j  55.0

@shoyer
Copy link

shoyer commented Sep 27, 2016

I'm strongly supportive here -- this is a variation of #21.

If we disable automatic reindexing in __setitem__, for consistency I think we should do the same for __getitem__, too. We've been quite happy with only an explicit .reindex() method for doing reindexing in xarray.

@jreback
Copy link

jreback commented Sep 27, 2016

@shoyer IIUC xarray is mostly immutable, so this is quite different than pandas, where the core objects are mutable by-default / definition. In any event I agree we have too much magic currently on implicit type conversions.

However there is a special case that needs addressing. In particular, even though this is a discouraged pattern, its used a lot. e.g.

In [3]: s = Series()
   ...: s['a'] = 2

In [4]: s
Out[4]: 
a    2
dtype: int64

so an empty Series really can't default to dtype, and needs to be coerced on first fill. IOW, do we have a concept of .empty? This is further compilcated by residual dtypes from removals, IOW.

so this is clear

In [1]: s = Series([1,2,3])

In [2]: s[0:0]
Out[2]: Series([], dtype: int64)

but is this the same as Series()? or is Series() a dtype of 'empty' (or maybe 'any')?
IOW, what happens if I now add an element to [2]?

@shoyer
Copy link

shoyer commented Sep 27, 2016

@shoyer IIUC xarray is mostly immutable, so this is quite different than pandas, where the core objects are mutable by-default / definition. In any event I agree we have too much magic currently on implicit type conversions.

Xarray is immutable in the same way proposed here for pandas: DataFrame columns can be mutated and existing rows can be mutated, but rows cannot be inserted or removed. This requires users to do things in a reasonable efficient way, because, like pandas, data is ultimately stored in non-resizeable arrays.

Implicit size mutability is problematic because a naive user might expect that data is backed by a data structure that can be efficiently resized, like Python's list, but this isn't the case (the .append method is problematic for similar reasons). Now that we are redoing the guts of pandas in C++, we could actually store data in resizeable arrays, but on the other hand that does entail some additional complexity and memory overhead.

@wesm
Copy link
Owner Author

wesm commented Sep 29, 2016

The fact that this is used a lot tells me there is some API deficiency:

In [3]: s = Series()
   ...: s['a'] = 2

In [4]: s
Out[4]: 
a    2
dtype: int64

Series is not a dict and was never intended to be fully substitutable for one -- the lack of an empty method may be the culprit. We should look more at this

@jreback
Copy link

jreback commented May 15, 2017

related: pandas-dev/pandas#9738

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants