Skip to content

CLN: Use dedup_names in all instances where duplicate column names are renamed #50371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
datapythonista opened this issue Dec 21, 2022 · 27 comments
Assignees
Labels
Clean good first issue IO HTML read_html, to_html, Styler.apply, Styler.applymap

Comments

@datapythonista
Copy link
Member

In #50370 the function dedup_names has been moved to pandas.io.common so it can be reused by any reader dealing with duplicate column names. The function can be expanded in the future to allow custom renaming patterns, so it should be used by any reader, to make sure we keep consistency with the behavior (as well as avoid duplicate code). There is at least one instance identified in #50370 where a different implementation is used to rename the duplicate columns. We should call dedup_names instead, and in case other alternative implementations exist, find them and also call dedup_names.

@muddi900
Copy link

take

@pandas-dev pandas-dev deleted a comment from jayam30 Feb 16, 2023
@leftful
Copy link

leftful commented Mar 1, 2023

Hi @muddi900 how are you going with this issue?
Happy to take over if you don't have the time :)

@muddi900
Copy link

muddi900 commented Mar 1, 2023

You can take over if the maintainers allow.

@muddi900 muddi900 removed their assignment Mar 1, 2023
@leftful
Copy link

leftful commented Mar 2, 2023

take

@shteken
Copy link
Contributor

shteken commented Apr 22, 2023

Hi @RhysJohnLewis how are you going with this issue?
Happy to take over if you don't have the time :)

@leftful
Copy link

leftful commented Apr 24, 2023

@shteken please do. I have not had the time.

@leftful leftful removed their assignment Apr 24, 2023
@hamedgibago
Copy link

take

@hamedgibago
Copy link

Hello @datapythonista. Is there any other files to change duplicate method names as you mentioned in #50370? Would you name a file for me to start working on it?

@datapythonista
Copy link
Member Author

It's possible that only that other function exists that do that. You can have a look at pandas functions that read data with possibly duplicated columns, or try to find a similar function with grep. But I'd start by just unifying those two functions for now.

@hamedgibago
Copy link

Ok. I search and if I find some functions, will show here if it will be ok or not.

@hamedgibago
Copy link

I took a look at folder pandas/io folder and think that I have to look at here. In excel and xml file formats may not be some duplicate columns, but when importing data from other formats, some duplicates may be found. Am I in the right direction?

@datapythonista
Copy link
Member Author

Can you simply unify the two existing implementations identified in the issue description into one for now? Probably there is nothing else to do for this issue, but even if another one exists, we can leave that for later.

@hamedgibago
Copy link

I read #50370 codes and comments. _is_potential_multi_index and _dedup_names were modified and moved from pandas/io/parsers/base_parser.py to pandas/io/common.py. Sorry for asking such simple question. I just see one implementation of dedup_names. I think I get confused. Would you mention two implementations?

@hamedgibago
Copy link

It's possible that only that other function exists that do that. You can have a look at pandas functions that read data with possibly duplicated columns, or try to find a similar function with grep. But I'd start by just unifying those two functions for now.

I reviewed some parts of code and debugged 'test_frame_non_unique_columns' and test_round_trip_exception_. In my attached picture, these functions use dedup_names. For example test_round_trip_exception_ for any engine other that pyarrow uses wrapper file self._engine.read in method read in pandas\io\readers.py.

image

Should I look forward to other functions and check if dedup_names should be added for reading? Or adding a new generic method or wrapper to use these calls of dedup_names with same arguments?

@datapythonista
Copy link
Member Author

Sorry the description is not clear enough, feel free to ask anything you need. If you check the diff in https://github.com/pandas-dev/pandas/pull/50370/files you will see a TODO comment was introduced, in the location where an equivalent implementation of the dedup_names function. So, the idea here is to instead of having that code repeating the same functionality to rename duplicated columns, we can remove it and just call the function there.

There may be small differences in both implementations (or maybe not), we can discuss after you give it a try and see the exact problems/differences if any.

@hamedgibago
Copy link

hamedgibago commented May 13, 2023

I added a sample test and check the code. I did not pushed to make my code cleaner and ask you if it is ok.

def test_duplicate_column(python_parser_only):
    #gh - 50371
    parser = python_parser_only
    data="""x,x
    a,b
    d,e"""

    import pandas as pd
    df = pd.read_csv(StringIO(data), header=None)
    df=df.rename(columns=df.iloc[0], copy=False).iloc[1:].reset_index(drop=True)

    result=parser.read_csv(StringIO(data))

    expected = df
    expected.columns = ["x", "x.1"]

    tm.assert_frame_equal(result, expected)

and also commented those lines you mentioned before and check the results and added calling dedup_names function like this. Would you please say your opinion about it?

this_columns=list(dedup_names(this_columns,is_potential_multi_index(this_columns, self.index_col)))
                    # for i in col_loop_order:
                    #     col = this_columns[i]
                    #     old_col = col
                    #     cur_count = counts[col]
                    #
                    #     if cur_count > 0:
                    #         while cur_count > 0:
                    #             counts[old_col] = cur_count + 1
                    #             col = f"{old_col}.{cur_count}"
                    #             if col in this_columns:
                    #                 cur_count += 1
                    #             else:
                    #                 cur_count = counts[col]
                    #
                    #         if (
                    #             self.dtype is not None
                    #             and is_dict_like(self.dtype)
                    #             and self.dtype.get(old_col) is not None
                    #             and self.dtype.get(col) is None
                    #         ):
                    #             self.dtype.update({col: self.dtype.get(old_col)})
                    #     this_columns[i] = col
                    #     counts[col] = cur_count + 1

@rsm-23
Copy link
Contributor

rsm-23 commented Jul 1, 2023

Can I take a jab at this? @datapythonista @hamedgibago

@hamedgibago
Copy link

hamedgibago commented Jul 1, 2023

Certainly, no problem. I commented the code above and added new line as you can see in the first line, despite results of current tests were ok, but others failed. I should spend some time to debug.
@datapythonista please do not unassign me. Let us both work on issue. Thank you.

@rsm-23
Copy link
Contributor

rsm-23 commented Jul 1, 2023

Thanks @hamedgibago , I'll try independently when I get some time :)

@rsm-23 rsm-23 mentioned this issue Jul 1, 2023
5 tasks
@rsm-23
Copy link
Contributor

rsm-23 commented Jul 1, 2023

@datapythonista the two implementations are definitely different. One approach names columns as [col, col.1, col.1.1] while the other one names it as [col, col.1, col.2] . Need your input. Should we make changes in all the tests or do we change the implementation of dedup_names ?

@hamedgibago
Copy link

As far as I know, we are not make any changes to existing tests unless we find a bug and inform it to maintainer. After changing the code, we can add new tests and also make sure all other tests will pass.
Good luck.

@rsm-23
Copy link
Contributor

rsm-23 commented Jul 2, 2023

@hamedgibago I think it would really depend. Some tests are already present that consider the output from the custom method and not dedup_names and like I mentioned above the way this de-duplication is handled is different in the two approaches so we need to either adjust the implementation of dedup_names or adjust the unit tests. Even if we adjust the result of dedup_names there should be existing unit tests that validate output from this method, so changing it's behavior would mean modifying those tests as well. There could be one more approach where we probably introduce a param to decide what kind of algorithm to follow inside the dedup_names method but personally, I am not a fan of this.

@hamedgibago
Copy link

@hamedgibago I think it would really depend. Some tests are already present that consider the output from the custom method and not dedup_names and like I mentioned above the way this de-duplication is handled is different in the two approaches so we need to either adjust the implementation of dedup_names or adjust the unit tests. Even if we adjust the result of dedup_names there should be existing unit tests that validate output from this method, so changing it's behavior would mean modifying those tests as well. There could be one more approach where we probably introduce a param to decide what kind of algorithm to follow inside the dedup_names method but personally, I am not a fan of this.

@datapythonista What is your idea?

@yoav-edelist
Copy link

@hamedgibago @datapythonista Is this still in the works? Is this free?

@hamedgibago
Copy link

@hamedgibago @datapythonista Is this still in the works? Is this free?

Its long time I do not working on it. I have to check it.

@leftful
Copy link

leftful commented Dec 21, 2024 via email

@LifeAsPixels
Copy link

@hamedgibago @datapythonista Is this issue available to be taken? This would be my first issue to work on. It looks like the function was made however it needs to be called as a replacement for some other code attempting to do the same thing elsewhere as noted with 'TODO' in #50370

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean good first issue IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants