BUG: Sparse incorrectly handle fill_value #12797

sinhrks · 2016-04-04T22:10:28Z

Sparse looks to handle missing (NaN) and fill_value confusingly. Based on the doc, I understand fill_value is a user-specified value to be omitted in the sparse internal repr. fill_value may be different from missing (NaN).

Code Sample, a copy-pastable example if possible

# NG, 2nd and last element must be NaN
pd.SparseArray([1, np.nan, 0, 3, np.nan], fill_value=0).to_dense()
# array([ 1.,  0.,  0.,  3.,  0.])

# NG, 2nd element must be NaN
orig = pd.Series([1, np.nan, 0, 3, np.nan], index=list('ABCDE'))
sparse = orig.to_sparse(fill_value=0)
sparse.reindex(['A', 'B', 'C'])
# A    1.0
# B    0.0
# C    0.0
# dtype: float64
# BlockIndex
# Block locations: array([0], dtype=int32)
# Block lengths: array([1], dtype=int32)

Expected Output

pd.SparseArray([1, np.nan, 0, 3, np.nan], fill_value=0).to_dense()
# array([ 1.,  np.nan,  0.,  3.,  np.nan])

sparse = orig.to_sparse(fill_value=0)
sparse.reindex(['A', 'B', 'C'])
# A    1.0
# B    NaN
# C    0.0
# dtype: float64
# BlockIndex
# Block locations: array([0], dtype=int32)
# Block lengths: array([1], dtype=int32)

output of `pd.show_versions()`

Current master.

The fix itself looks straightforward, but it breaks some tests use dubious comparison.

https://github.com/pydata/pandas/blob/master/pandas/sparse/tests/test_sparse.py#L1730

The text was updated successfully, but these errors were encountered:

jreback · 2016-04-04T22:23:33Z

hmm, I think its using np.nan as the missing value indicator. Which is right. THEN you fill using the fill_value those locations. not the other way around.

sinhrks · 2016-04-04T22:51:14Z

@jreback I may misunderstand, but fill_value will be a missing value indicator if provided (np.nan is included in SparseIndex indices).

pd.SparseArray([1, np.nan, 0, 3, np.nan], fill_value=0)
[1.0, nan, 0, 3.0, nan]
Fill: 0
IntIndex
Indices: array([0, 1, 3, 4], dtype=int32)

Thus I feel it is natural to .to_dense returns np.nan as it is, not fill_value.

jreback · 2016-04-04T23:14:04Z

in your example the 0 (2nd element) is the missing one.

In [5]: pd.SparseArray([1, np.nan, 0, 3, np.nan], fill_value=0).to_dense()
Out[5]: array([ 1.,  0.,  0.,  3.,  0.])

ahh so you think this should be
Out[5]: array([ 1., np.nan, 0., 3., np.nan])

yes that is prob right.

sinhrks · 2016-04-04T23:24:15Z

Ah sorry, added Expected Output section.

jreback · 2016-04-04T23:31:12Z

yep that looks right.

yeh I that comparison tests equates NaN to missing value, when in fact the fill_value are the missing ones.

sinhrks added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Sparse Sparse Data Type labels Apr 4, 2016

sinhrks added this to the 0.18.1 milestone Apr 4, 2016

This was referenced Apr 5, 2016

BUG: filling doesn't work well for sparse blocks #6949

Closed

BUG: SparseSeries.reindex with fill_value #12831

Closed

jreback closed this as completed in a23a136 Apr 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Sparse incorrectly handle fill_value #12797

BUG: Sparse incorrectly handle fill_value #12797

sinhrks commented Apr 4, 2016

jreback commented Apr 4, 2016

sinhrks commented Apr 4, 2016

jreback commented Apr 4, 2016

sinhrks commented Apr 4, 2016

jreback commented Apr 4, 2016

BUG: Sparse incorrectly handle fill_value #12797

BUG: Sparse incorrectly handle fill_value #12797

Comments

sinhrks commented Apr 4, 2016

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

jreback commented Apr 4, 2016

sinhrks commented Apr 4, 2016

jreback commented Apr 4, 2016

sinhrks commented Apr 4, 2016

jreback commented Apr 4, 2016

output of `pd.show_versions()`