CLN: Use dedup_names in all instances where duplicate column names are renamed #50371

datapythonista · 2022-12-21T06:59:15Z

In #50370 the function dedup_names has been moved to pandas.io.common so it can be reused by any reader dealing with duplicate column names. The function can be expanded in the future to allow custom renaming patterns, so it should be used by any reader, to make sure we keep consistency with the behavior (as well as avoid duplicate code). There is at least one instance identified in #50370 where a different implementation is used to rename the duplicate columns. We should call dedup_names instead, and in case other alternative implementations exist, find them and also call dedup_names.

The text was updated successfully, but these errors were encountered:

muddi900 · 2022-12-24T07:27:46Z

take

leftful · 2023-03-01T07:04:06Z

Hi @muddi900 how are you going with this issue?
Happy to take over if you don't have the time :)

muddi900 · 2023-03-01T07:35:46Z

You can take over if the maintainers allow.

leftful · 2023-03-02T06:51:26Z

take

shteken · 2023-04-22T09:15:17Z

Hi @RhysJohnLewis how are you going with this issue?
Happy to take over if you don't have the time :)

leftful · 2023-04-24T03:12:23Z

@shteken please do. I have not had the time.

hamedgibago · 2023-04-28T20:36:34Z

take

hamedgibago · 2023-04-28T20:44:58Z

Hello @datapythonista. Is there any other files to change duplicate method names as you mentioned in #50370? Would you name a file for me to start working on it?

datapythonista · 2023-04-28T21:51:05Z

It's possible that only that other function exists that do that. You can have a look at pandas functions that read data with possibly duplicated columns, or try to find a similar function with grep. But I'd start by just unifying those two functions for now.

hamedgibago · 2023-04-29T07:43:20Z

Ok. I search and if I find some functions, will show here if it will be ok or not.

hamedgibago · 2023-04-30T17:25:31Z

I took a look at folder pandas/io folder and think that I have to look at here. In excel and xml file formats may not be some duplicate columns, but when importing data from other formats, some duplicates may be found. Am I in the right direction?

datapythonista · 2023-04-30T18:02:09Z

Can you simply unify the two existing implementations identified in the issue description into one for now? Probably there is nothing else to do for this issue, but even if another one exists, we can leave that for later.

hamedgibago · 2023-04-30T21:31:27Z

I read #50370 codes and comments. _is_potential_multi_index and _dedup_names were modified and moved from pandas/io/parsers/base_parser.py to pandas/io/common.py. Sorry for asking such simple question. I just see one implementation of dedup_names. I think I get confused. Would you mention two implementations?

hamedgibago · 2023-05-06T19:56:00Z

It's possible that only that other function exists that do that. You can have a look at pandas functions that read data with possibly duplicated columns, or try to find a similar function with grep. But I'd start by just unifying those two functions for now.

I reviewed some parts of code and debugged 'test_frame_non_unique_columns' and test_round_trip_exception_. In my attached picture, these functions use dedup_names. For example test_round_trip_exception_ for any engine other that pyarrow uses wrapper file self._engine.read in method read in pandas\io\readers.py.

Should I look forward to other functions and check if dedup_names should be added for reading? Or adding a new generic method or wrapper to use these calls of dedup_names with same arguments?

datapythonista · 2023-05-08T13:49:54Z

Sorry the description is not clear enough, feel free to ask anything you need. If you check the diff in https://github.com/pandas-dev/pandas/pull/50370/files you will see a TODO comment was introduced, in the location where an equivalent implementation of the dedup_names function. So, the idea here is to instead of having that code repeating the same functionality to rename duplicated columns, we can remove it and just call the function there.

There may be small differences in both implementations (or maybe not), we can discuss after you give it a try and see the exact problems/differences if any.

hamedgibago · 2023-05-13T21:19:12Z

I added a sample test and check the code. I did not pushed to make my code cleaner and ask you if it is ok.

def test_duplicate_column(python_parser_only):
    #gh - 50371
    parser = python_parser_only
    data="""x,x
    a,b
    d,e"""

    import pandas as pd
    df = pd.read_csv(StringIO(data), header=None)
    df=df.rename(columns=df.iloc[0], copy=False).iloc[1:].reset_index(drop=True)

    result=parser.read_csv(StringIO(data))

    expected = df
    expected.columns = ["x", "x.1"]

    tm.assert_frame_equal(result, expected)

and also commented those lines you mentioned before and check the results and added calling dedup_names function like this. Would you please say your opinion about it?

this_columns=list(dedup_names(this_columns,is_potential_multi_index(this_columns, self.index_col)))
                    # for i in col_loop_order:
                    #     col = this_columns[i]
                    #     old_col = col
                    #     cur_count = counts[col]
                    #
                    #     if cur_count > 0:
                    #         while cur_count > 0:
                    #             counts[old_col] = cur_count + 1
                    #             col = f"{old_col}.{cur_count}"
                    #             if col in this_columns:
                    #                 cur_count += 1
                    #             else:
                    #                 cur_count = counts[col]
                    #
                    #         if (
                    #             self.dtype is not None
                    #             and is_dict_like(self.dtype)
                    #             and self.dtype.get(old_col) is not None
                    #             and self.dtype.get(col) is None
                    #         ):
                    #             self.dtype.update({col: self.dtype.get(old_col)})
                    #     this_columns[i] = col
                    #     counts[col] = cur_count + 1

rsm-23 · 2023-07-01T13:24:20Z

Can I take a jab at this? @datapythonista @hamedgibago

hamedgibago · 2023-07-01T16:08:11Z

Certainly, no problem. I commented the code above and added new line as you can see in the first line, despite results of current tests were ok, but others failed. I should spend some time to debug.
@datapythonista please do not unassign me. Let us both work on issue. Thank you.

rsm-23 · 2023-07-01T16:13:30Z

Thanks @hamedgibago , I'll try independently when I get some time :)

rsm-23 · 2023-07-01T17:55:19Z

@datapythonista the two implementations are definitely different. One approach names columns as [col, col.1, col.1.1] while the other one names it as [col, col.1, col.2] . Need your input. Should we make changes in all the tests or do we change the implementation of dedup_names ?

hamedgibago · 2023-07-02T07:24:23Z

As far as I know, we are not make any changes to existing tests unless we find a bug and inform it to maintainer. After changing the code, we can add new tests and also make sure all other tests will pass.
Good luck.

rsm-23 · 2023-07-02T14:58:24Z

@hamedgibago I think it would really depend. Some tests are already present that consider the output from the custom method and not dedup_names and like I mentioned above the way this de-duplication is handled is different in the two approaches so we need to either adjust the implementation of dedup_names or adjust the unit tests. Even if we adjust the result of dedup_names there should be existing unit tests that validate output from this method, so changing it's behavior would mean modifying those tests as well. There could be one more approach where we probably introduce a param to decide what kind of algorithm to follow inside the dedup_names method but personally, I am not a fan of this.

hamedgibago · 2023-07-02T21:28:31Z

@hamedgibago I think it would really depend. Some tests are already present that consider the output from the custom method and not dedup_names and like I mentioned above the way this de-duplication is handled is different in the two approaches so we need to either adjust the implementation of dedup_names or adjust the unit tests. Even if we adjust the result of dedup_names there should be existing unit tests that validate output from this method, so changing it's behavior would mean modifying those tests as well. There could be one more approach where we probably introduce a param to decide what kind of algorithm to follow inside the dedup_names method but personally, I am not a fan of this.

@datapythonista What is your idea?

yoav-edelist · 2024-12-14T10:30:29Z

@hamedgibago @datapythonista Is this still in the works? Is this free?

hamedgibago · 2024-12-21T06:40:42Z

@hamedgibago @datapythonista Is this still in the works? Is this free?

Its long time I do not working on it. I have to check it.

leftful · 2024-12-21T08:36:12Z

merry christmas everyone

…

On Sat, Dec 21, 2024 at 2:41 PM Reza Akraminejad ***@***.***> wrote: @hamedgibago <https://github.com/hamedgibago> @datapythonista <https://github.com/datapythonista> Is this still in the works? Is this free? Its long time I do not working on it. I have to check it. — Reply to this email directly, view it on GitHub <#50371 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AWSN7OQ77H2V6YXSSGAR2LL2GUEQBAVCNFSM6AAAAABTTNDWGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJYGAZDGNBVG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

LifeAsPixels · 2025-05-14T18:03:17Z

@hamedgibago @datapythonista Is this issue available to be taken? This would be my first issue to work on. It looks like the function was made however it needs to be called as a replacement for some other code attempting to do the same thing elsewhere as noted with 'TODO' in #50370

datapythonista added IO HTML read_html, to_html, Styler.apply, Styler.applymap Clean good first issue labels Dec 21, 2022

datapythonista mentioned this issue Dec 21, 2022

Rename duplicate column names in read_json(orient='split') #50370

Merged

1 task

github-actions bot assigned muddi900 Dec 24, 2022

pandas-dev deleted a comment from jayam30 Feb 16, 2023

muddi900 removed their assignment Mar 1, 2023

github-actions bot assigned leftful Mar 2, 2023

leftful removed their assignment Apr 24, 2023

github-actions bot assigned hamedgibago Apr 28, 2023

rsm-23 mentioned this issue Jul 1, 2023

applying dedup_names func #53964

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLN: Use dedup_names in all instances where duplicate column names are renamed #50371

CLN: Use dedup_names in all instances where duplicate column names are renamed #50371

datapythonista commented Dec 21, 2022

muddi900 commented Dec 24, 2022

leftful commented Mar 1, 2023

muddi900 commented Mar 1, 2023

leftful commented Mar 2, 2023

shteken commented Apr 22, 2023

leftful commented Apr 24, 2023

hamedgibago commented Apr 28, 2023

hamedgibago commented Apr 28, 2023

datapythonista commented Apr 28, 2023

hamedgibago commented Apr 29, 2023

hamedgibago commented Apr 30, 2023

datapythonista commented Apr 30, 2023

hamedgibago commented Apr 30, 2023

hamedgibago commented May 6, 2023

datapythonista commented May 8, 2023

hamedgibago commented May 13, 2023 •

edited

Loading

rsm-23 commented Jul 1, 2023

hamedgibago commented Jul 1, 2023 •

edited

Loading

rsm-23 commented Jul 1, 2023

rsm-23 commented Jul 1, 2023

hamedgibago commented Jul 2, 2023

rsm-23 commented Jul 2, 2023

hamedgibago commented Jul 2, 2023

yoav-edelist commented Dec 14, 2024

hamedgibago commented Dec 21, 2024

leftful commented Dec 21, 2024 via email

LifeAsPixels commented May 14, 2025

CLN: Use dedup_names in all instances where duplicate column names are renamed #50371

CLN: Use dedup_names in all instances where duplicate column names are renamed #50371

Comments

datapythonista commented Dec 21, 2022

muddi900 commented Dec 24, 2022

leftful commented Mar 1, 2023

muddi900 commented Mar 1, 2023

leftful commented Mar 2, 2023

shteken commented Apr 22, 2023

leftful commented Apr 24, 2023

hamedgibago commented Apr 28, 2023

hamedgibago commented Apr 28, 2023

datapythonista commented Apr 28, 2023

hamedgibago commented Apr 29, 2023

hamedgibago commented Apr 30, 2023

datapythonista commented Apr 30, 2023

hamedgibago commented Apr 30, 2023

hamedgibago commented May 6, 2023

datapythonista commented May 8, 2023

hamedgibago commented May 13, 2023 • edited Loading

rsm-23 commented Jul 1, 2023

hamedgibago commented Jul 1, 2023 • edited Loading

rsm-23 commented Jul 1, 2023

rsm-23 commented Jul 1, 2023

hamedgibago commented Jul 2, 2023

rsm-23 commented Jul 2, 2023

hamedgibago commented Jul 2, 2023

yoav-edelist commented Dec 14, 2024

hamedgibago commented Dec 21, 2024

leftful commented Dec 21, 2024 via email

LifeAsPixels commented May 14, 2025

hamedgibago commented May 13, 2023 •

edited

Loading

hamedgibago commented Jul 1, 2023 •

edited

Loading