Skip to content

Commit f351f74

Browse files
authored
DEPR: concat ignoring all-NA columns (#52613)
* DEPR: concat ignoring all-NA columns * silence warning * use code-block * fix duplicate whatsnew entry * remove duplicate whatsnew entry * Fix duplicates in whatsnew
1 parent f780104 commit f351f74

File tree

4 files changed

+76
-15
lines changed

4 files changed

+76
-15
lines changed

doc/source/whatsnew/v1.4.0.rst

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -279,11 +279,11 @@ if one of the DataFrames was empty or had all-NA values, its dtype was
279279
*sometimes* ignored when finding the concatenated dtype. These are now
280280
consistently *not* ignored (:issue:`43507`).
281281

282-
.. ipython:: python
282+
.. code-block:: ipython
283283
284-
df1 = pd.DataFrame({"bar": [pd.Timestamp("2013-01-01")]}, index=range(1))
285-
df2 = pd.DataFrame({"bar": np.nan}, index=range(1, 2))
286-
res = pd.concat([df1, df2])
284+
In [3]: df1 = pd.DataFrame({"bar": [pd.Timestamp("2013-01-01")]}, index=range(1))
285+
In [4]: df2 = pd.DataFrame({"bar": np.nan}, index=range(1, 2))
286+
In [5]: res = pd.concat([df1, df2])
287287
288288
Previously, the float-dtype in ``df2`` would be ignored so the result dtype
289289
would be ``datetime64[ns]``. As a result, the ``np.nan`` would be cast to
@@ -293,8 +293,8 @@ would be ``datetime64[ns]``. As a result, the ``np.nan`` would be cast to
293293

294294
.. code-block:: ipython
295295
296-
In [4]: res
297-
Out[4]:
296+
In [6]: res
297+
Out[6]:
298298
bar
299299
0 2013-01-01
300300
1 NaT
@@ -306,8 +306,8 @@ object, the ``np.nan`` is retained.
306306

307307
.. code-block:: ipython
308308
309-
In [4]: res
310-
Out[4]:
309+
In [6]: res
310+
Out[6]:
311311
bar
312312
0 2013-01-01 00:00:00
313313
1 NaN

doc/source/whatsnew/v2.1.0.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -233,6 +233,7 @@ Deprecations
233233
- Deprecated :meth:`DataFrame.applymap`. Use the new :meth:`DataFrame.map` method instead (:issue:`52353`)
234234
- Deprecated :meth:`DataFrame.swapaxes` and :meth:`Series.swapaxes`, use :meth:`DataFrame.transpose` or :meth:`Series.transpose` instead (:issue:`51946`)
235235
- Deprecated ``freq`` parameter in :class:`PeriodArray` constructor, pass ``dtype`` instead (:issue:`52462`)
236+
- Deprecated behavior of :func:`concat` when :class:`DataFrame` has columns that are all-NA, in a future version these will not be discarded when determining the resulting dtype (:issue:`40893`)
236237
- Deprecated behavior of :meth:`Series.dt.to_pydatetime`, in a future version this will return a :class:`Series` containing python ``datetime`` objects instead of an ``ndarray`` of datetimes; this matches the behavior of other :meth:`Series.dt` properties (:issue:`20306`)
237238
- Deprecated logical operations (``|``, ``&``, ``^``) between pandas objects and dtype-less sequences (e.g. ``list``, ``tuple``), wrap a sequence in a :class:`Series` or numpy array before operating instead (:issue:`51521`)
238239
- Deprecated making :meth:`Series.apply` return a :class:`DataFrame` when the passed-in callable returns a :class:`Series` object. In the future this will return a :class:`Series` whose values are themselves :class:`Series`. This pattern was very slow and it's recommended to use alternative methods to archive the same goal (:issue:`52116`)

pandas/core/internals/concat.py

Lines changed: 49 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
TYPE_CHECKING,
66
Sequence,
77
)
8+
import warnings
89

910
import numpy as np
1011

@@ -15,6 +16,7 @@
1516
)
1617
from pandas._libs.missing import NA
1718
from pandas.util._decorators import cache_readonly
19+
from pandas.util._exceptions import find_stack_level
1820

1921
from pandas.core.dtypes.astype import astype_array
2022
from pandas.core.dtypes.cast import (
@@ -439,6 +441,19 @@ def is_na(self) -> bool:
439441
return False
440442
return all(isna_all(row) for row in values)
441443

444+
@cache_readonly
445+
def is_na_without_isna_all(self) -> bool:
446+
blk = self.block
447+
if blk.dtype.kind == "V":
448+
return True
449+
if not blk._can_hold_na:
450+
return False
451+
452+
values = blk.values
453+
if values.size == 0:
454+
return True
455+
return False
456+
442457
def get_reindexed_values(self, empty_dtype: DtypeObj, upcasted_na) -> ArrayLike:
443458
values: ArrayLike
444459

@@ -487,7 +502,7 @@ def _concatenate_join_units(join_units: list[JoinUnit], copy: bool) -> ArrayLike
487502
"""
488503
Concatenate values from several join units along axis=1.
489504
"""
490-
empty_dtype = _get_empty_dtype(join_units)
505+
empty_dtype, empty_dtype_future = _get_empty_dtype(join_units)
491506

492507
has_none_blocks = any(unit.block.dtype.kind == "V" for unit in join_units)
493508
upcasted_na = _dtype_to_na_value(empty_dtype, has_none_blocks)
@@ -526,6 +541,19 @@ def _concatenate_join_units(join_units: list[JoinUnit], copy: bool) -> ArrayLike
526541
else:
527542
concat_values = concat_compat(to_concat, axis=1)
528543

544+
if empty_dtype != empty_dtype_future:
545+
if empty_dtype == concat_values.dtype:
546+
# GH#40893
547+
warnings.warn(
548+
"The behavior of DataFrame concatenation with all-NA entries is "
549+
"deprecated. In a future version, this will no longer exclude "
550+
"all-NA columns when determining the result dtypes. "
551+
"To retain the old behavior, cast the all-NA columns to the "
552+
"desired dtype before the concat operation.",
553+
FutureWarning,
554+
stacklevel=find_stack_level(),
555+
)
556+
529557
return concat_values
530558

531559

@@ -552,7 +580,7 @@ def _dtype_to_na_value(dtype: DtypeObj, has_none_blocks: bool):
552580
raise NotImplementedError
553581

554582

555-
def _get_empty_dtype(join_units: Sequence[JoinUnit]) -> DtypeObj:
583+
def _get_empty_dtype(join_units: Sequence[JoinUnit]) -> tuple[DtypeObj, DtypeObj]:
556584
"""
557585
Return dtype and N/A values to use when concatenating specified units.
558586
@@ -564,11 +592,11 @@ def _get_empty_dtype(join_units: Sequence[JoinUnit]) -> DtypeObj:
564592
"""
565593
if len(join_units) == 1:
566594
blk = join_units[0].block
567-
return blk.dtype
595+
return blk.dtype, blk.dtype
568596

569597
if lib.dtypes_all_equal([ju.block.dtype for ju in join_units]):
570598
empty_dtype = join_units[0].block.dtype
571-
return empty_dtype
599+
return empty_dtype, empty_dtype
572600

573601
has_none_blocks = any(unit.block.dtype.kind == "V" for unit in join_units)
574602

@@ -581,7 +609,23 @@ def _get_empty_dtype(join_units: Sequence[JoinUnit]) -> DtypeObj:
581609
dtype = find_common_type(dtypes)
582610
if has_none_blocks:
583611
dtype = ensure_dtype_can_hold_na(dtype)
584-
return dtype
612+
613+
dtype_future = dtype
614+
if len(dtypes) != len(join_units):
615+
dtypes_future = [
616+
unit.block.dtype for unit in join_units if not unit.is_na_without_isna_all
617+
]
618+
if not len(dtypes_future):
619+
dtypes_future = [
620+
unit.block.dtype for unit in join_units if unit.block.dtype.kind != "V"
621+
]
622+
623+
if len(dtypes) != len(dtypes_future):
624+
dtype_future = find_common_type(dtypes_future)
625+
if has_none_blocks:
626+
dtype_future = ensure_dtype_can_hold_na(dtype_future)
627+
628+
return dtype, dtype_future
585629

586630

587631
def _is_uniform_join_units(join_units: list[JoinUnit]) -> bool:

pandas/tests/reshape/concat/test_concat.py

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -747,7 +747,9 @@ def test_concat_ignore_empty_object_float(empty_dtype, df_dtype):
747747
# https://github.com/pandas-dev/pandas/issues/45637
748748
df = DataFrame({"foo": [1, 2], "bar": [1, 2]}, dtype=df_dtype)
749749
empty = DataFrame(columns=["foo", "bar"], dtype=empty_dtype)
750+
750751
result = concat([empty, df])
752+
751753
expected = df
752754
if df_dtype == "int64":
753755
# TODO what exact behaviour do we want for integer eventually?
@@ -764,14 +766,24 @@ def test_concat_ignore_empty_object_float(empty_dtype, df_dtype):
764766
def test_concat_ignore_all_na_object_float(empty_dtype, df_dtype):
765767
df = DataFrame({"foo": [1, 2], "bar": [1, 2]}, dtype=df_dtype)
766768
empty = DataFrame({"foo": [np.nan], "bar": [np.nan]}, dtype=empty_dtype)
767-
result = concat([empty, df], ignore_index=True)
768769

769770
if df_dtype == "int64":
770771
# TODO what exact behaviour do we want for integer eventually?
771772
if empty_dtype == "object":
772773
df_dtype = "object"
773774
else:
774775
df_dtype = "float64"
776+
777+
msg = "The behavior of DataFrame concatenation with all-NA entries"
778+
warn = None
779+
if empty_dtype != df_dtype and empty_dtype is not None:
780+
warn = FutureWarning
781+
elif df_dtype == "datetime64[ns]":
782+
warn = FutureWarning
783+
784+
with tm.assert_produces_warning(warn, match=msg):
785+
result = concat([empty, df], ignore_index=True)
786+
775787
expected = DataFrame({"foo": [None, 1, 2], "bar": [None, 1, 2]}, dtype=df_dtype)
776788
tm.assert_frame_equal(result, expected)
777789

@@ -782,7 +794,11 @@ def test_concat_ignore_empty_from_reindex():
782794
df1 = DataFrame({"a": [1], "b": [pd.Timestamp("2012-01-01")]})
783795
df2 = DataFrame({"a": [2]})
784796

785-
result = concat([df1, df2.reindex(columns=df1.columns)], ignore_index=True)
797+
aligned = df2.reindex(columns=df1.columns)
798+
799+
msg = "The behavior of DataFrame concatenation with all-NA entries"
800+
with tm.assert_produces_warning(FutureWarning, match=msg):
801+
result = concat([df1, aligned], ignore_index=True)
786802
expected = df1 = DataFrame({"a": [1, 2], "b": [pd.Timestamp("2012-01-01"), pd.NaT]})
787803
tm.assert_frame_equal(result, expected)
788804

0 commit comments

Comments
 (0)