Skip to content

ENH: Be able to add lines for all index levels, not just visible ones [fix #59877] #59916

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from

Conversation

edbennett
Copy link

Per discussion in #59877, add an -invisible suffix that can be added to the index selection in the clines option to Styler.to_latex(), allowing hidden indices to be used when deciding where to place lines between rows.

This approach does not break compatibility with existing code, at the expense of starting a potential combinatoric explosion of options being combined into this one string. There may be a more elegant and/or expressive way of achieving the same result, at the cost of breaking compatibility (or needing to maintain two interfaces until the previous one is deprecated).

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions. [No new arguments, methods, or functions added, outside of tests.]
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

…as-dev#59877]

- implement a new suffix for the `clines` option, `-invisible`, doubling the number of available options, specifying that hidden indices should be included when deciding whether to add \clines
- add tests for this behaviour
@edbennett edbennett requested a review from attack68 as a code owner September 28, 2024 19:56
@mroeschke mroeschke added the Styler conditional formatting using DataFrame.style label Oct 31, 2024
Copy link
Contributor

github-actions bot commented Dec 1, 2024

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Dec 1, 2024
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Dec 16, 2024
@edbennett
Copy link
Author

@mroeschke I merged in the main branch two weeks ago at the request of the bot, but forgot to add a comment saying such.

if I’m going to keep doing regular merging from main and fixing of tests, is there any way to know if or when this PR might be merged?

@mroeschke mroeschke reopened this Dec 16, 2024
@mroeschke
Copy link
Member

Sorry for the lack of response. We don't have many core team members familiar with this part of the codebase, only really @attack68. To ensure that this PR is in a mergeable state, please ensure all the CI checks are passing (unless there's a check failing in other PRs too)

@attack68
Copy link
Contributor

I quite like this PR, Matt's comments look valid amendments though. I have three points worth requiring a response.

  1. The string entry for the clines argument wrapped multiple items into a single argument because there were limited options. This is making it more complicated (see 3)) but perhaps its still valid since the use of clines is quite esoteric anyway.

  2. Is "invisible" the right word to use here? Why not "hidden" or "including-hidden" or "inc-hidden". Not arguing to or for any option just making sure its been considered.

  3. Hiding elements on a MultiIndex comes in three varieties:

  • You can hide individual rows identified by their index values,
  • You can hide the whole vertical column of index values for a particular level index.
  • You can hide the whole index (semantically the same as hiding all levels in the index).

Your "invisible" does not appear to make any distinction between the different cases here. What is hidden will have a cline if "invisible" is input. What about the potential case of:

  • A user hiding 100 rows and level 1 of the index, but he wants clines for the missing level but not clines for the hidden 100 rows? (Probably a more likely case)
  • A user hides 100 rows and level 1 of the index but he wants clines for the missing rows but not clines for the hidden index level? (Probably a more unlikely case)

Ultimately I think this PR is an overall feature addition so I am happy to support it, but I think it is worth documenting its short comings particularly if someone wants to follow it up later.

@edbennett
Copy link
Author

Many thanks for the feedback both.

The string entry for the clines argument wrapped multiple items into a single argument because there were limited options. This is making it more complicated (see 3)) but perhaps its still valid since the use of clines is quite esoteric anyway.

Yes; as I mentioned upthread, I chose this route to avoid introducing a change that would break existing code.

I guess a compatible alternative would be to have a second boolean kwarg clines_include_hidden_indexes=False (or a better name), but since clines already controls what index levels are considered for drawing lines, that seems very counterintuitive.

Were I designing it from clean I'd be inclined to split these options up, either to take a tuple of booleans like clines_include=(last_index, hidden_indexes, data) (e.g. clines_include=(True, True, True) would be equivalent to clines="all-invisible;data"), or as a list of strings to add options (e.g. clines_include=["last_index"] to be equivalent to clines="all;index", or clines_include=["hidden_indexes", "data]" for clines="skip-last-invisible;data"). I'm happy to implement one of these or similar if breaking compatibility is seen as worthwhile.

Is "invisible" the right word to use here? Why not "hidden" or "including-hidden" or "inc-hidden". Not arguing to or for any option just making sure it's been considered.

I was hung up here on the fact that ideally I'd have wanted to use all for this, and then to use visible for what is currently called all, but that would silently break existing code. As a result I had a failure of imagination around alternative wordings. I like including-hidden, I think it's clearer.

Hiding elements on a MultiIndex comes in three varieties:

  • You can hide individual rows identified by their index values,

  • You can hide the whole vertical column of index values for a particular level index.

  • You can hide the whole index (semantically the same as hiding all levels in the index).
    Your "invisible" does not appear to make any distinction between the different cases here. What is hidden will have a cline if "invisible" is input. What about the potential case of:

  • A user hiding 100 rows and level 1 of the index, but he wants clines for the missing level but not clines for the hidden 100 rows? (Probably a more likely case)

  • A user hides 100 rows and level 1 of the index but he wants clines for the missing rows but not clines for the hidden index level? (Probably a more unlikely case)

This is where I demonstrate my ignorance of Pandas indexing; I was focusing on my own use case and wasn't aware of this subtlety.

Would you please be able to send/link to an example code block to construct a DataFrame for each of the three cases so I can test?

Thanks again

@attack68
Copy link
Contributor

OK, good points.

For 1) and 2)

This is potentially going into v3.0.0 which means a little more flexibility in breaking things.
I think we can change the clines argument from str input to Tuple, (my previous construct was a bit rubbish), and this negates the "invisible" naming issue.
It is probably quite a rarely used argument, with the default being applied in most cases therefore breaking very little.
However, lets check with @mroeschke @rhshadrach @Aloqeely if we can straight up change this with a release note without any change cycle notice, or if we need to temporarily handle the change with a conversion, for backwards compatibility.

if isinstance(clines, str):
    clines = _convert_clines_to_tuple(clines)

For your PR I would then suggest (All levels or Skip last?, Applied to Index or Whole table?, Ignores hidden values?). Default remians at None for no clines.

Requires careful documentation.

@attack68
Copy link
Contributor

attack68 commented Dec 18, 2024

For 3) here are some examples for your consideration:

m = MultiIndex.from_product([["l00", "l01","l02"], ["l10", "l11", "l12"], ["l20", "l21", "l22"]])
d = DataFrame([[_] for _ in range(27)], index=m)

# Hiding Specific Rows
d.style.hide(d.index[3:22], axis=0)

image

# Hiding Specific Rows and hiding a level
d.style.hide(d.index[3:22], axis=0).hide(level=[1], axis=0)

image

I would be inclined to believe that if a user has hidden specific rows they would not expect a cline to be drawn for that hidden row, I think, information wise, you are only interested in drawing clines for entire levels that are hidden, which serves as a useful visual indicator to which category the displayed element is a part of.

@edbennett
Copy link
Author

Thanks for the feedback

This is potentially going into v3.0.0 which means a little more flexibility in breaking things. I think we can change the clines argument from str input to Tuple, (my previous construct was a bit rubbish), and this negates the "invisible" naming issue. It is probably quite a rarely used argument, with the default being applied in most cases therefore breaking very little.

Makes sense. Would it be a good idea to define a namedtuple so that calling code can be made more readable than passing three anonymous bools? Something like

class ClinesOptions(NamedTuple):
    rule_last_index: bool = False
    rule_data: bool = False
    rule_hidden_indexes: bool = False

which would then allow calling as e.g.

.to_latex(clines=ClinesOptions(rule_data=True))

However, lets check with @mroeschke @rhshadrach @Aloqeely if we can straight up change this with a release note without any change cycle notice, or if we need to temporarily handle the change with a conversion, for backwards compatibility.

OK, I'll hold off implementing this pending more feedback, and look at point 3.

@edbennett
Copy link
Author

edbennett commented Dec 18, 2024

For 3) here are some examples for your consideration:

Thanks for the examples.

I would be inclined to believe that if a user has hidden specific rows they would not expect a cline to be drawn for that hidden row, I think, information wise, you are only interested in drawing clines for entire levels that are hidden, which serves as a useful visual indicator to which category the displayed element is a part of.

That's right.

The block of hidden rows in the example you sent spans a change of all index levels, so I'd always expect a line to be drawn there, and indeed that's what I see in testing.

Modifying to only hide row 1, the first example shows no difference between the options with and without -invisible. The second example does what I expect—all and all-invisible draw lines for every row, skip-last only draws for the topmost index level (skipping the hidden level and the last visible level) and skip-last-invisible draws for both visible levels, but doesn't draw any extra line for hidden rows.

I'm not sure that "draw clines where a specific row has been hidden regardless of other cline-drawing options" is relevant to this discussion, although possibly a feature some would find useful at some point. So would you agree that the action is to (once the final interface is agreed on) clarify in the documentation that the include-hidden-levels option controls whether hidden index levels are used when deciding which index levels for which to draw clines? Or have I missed an edge case?

@attack68
Copy link
Contributor

NamedTuples are nice for internal code structuring, but in my experience they are horrible for UX. A user would have to import the ClinesOption and then use it, rather than just reading the docs and substituting with (True, False, True). I don't think there are any examples of pandas args requiring NamedTuples but I could be wrong.

@attack68
Copy link
Contributor

I think you've captured the required cases. Because this is a visual issue I think it will be worth rendering the test cases and going through them I can try to do this in the thread for those already constructed and if it is worth adding more.

@edbennett
Copy link
Author

NamedTuples are nice for internal code structuring, but in my experience they are horrible for UX. A user would have to import the ClinesOption and then use it, rather than just reading the docs and substituting with (True, False, True). I don't think there are any examples of pandas args requiring NamedTuples but I could be wrong.

Thanks—in principle a tuple can always be passed in place of a namedtuple (and the former could be coerced to the latter internally to the function, so you still get the benefits of named arguments), but if that's the intended way of calling then I agree there's little point in having it, and I agree that needing to import the specific class is a little clumsy. (It might be more idiomatic in Matplotlib, since that library makes heavier use of composition.) Potentially one could add a staticmethod to Styler that generates it, but that seems overkill for a relatively niche option.

@attack68
Copy link
Contributor

Agreed, I think overkill.

According to #59125 we need a DeprecationCycle. I think the best thing to do is test for str input and raise the warning, otherwise just code it to accept the new Tuple format.

@rhshadrach
Copy link
Member

I do not think changing this argument to a tuple is a good idea. I find

styler.to_latext(clines_include=(True, False, True))

to be hard to read code, whereas something like

styler.to_latex(clines_last_index=True, clines_hidden_indexes=False, clines_data=True)

is much better.

NamedTuples are nice for internal code structuring, but in my experience they are horrible for UX. A user would have to import the ClinesOption and then use it,

I don't think this holds for pandas - namely I would be expecting virtually all users to be importing pandas globally, so it would just be pd.ClinesOption or pandas.ClinesOption. That said, I'm not suggesting we go the NameTuple (or some other class of options) route, but thought this was worth noting.

@attack68
Copy link
Contributor

I do not think changing this argument to a tuple is a good idea.

clines is a niche feature that is off by default: clines=None.

If you turn it on a user has to currently specify;

  • how it handles the last index level in a multindex
  • whether it horizontally spans only the index or the whole data table.
    Currently this is handled by a string argument.

This PR proposes adding a third option to that list which is:

  • whether to additionally place a cline for hidden multindex levels.

So that's potentially 4 keyword arguments for clines:

styler.to_latex(
    clines=True,
    clines_include_last_level=True,
    clines_span_all_data=False,
    clines_include_hidden_levels=True
)

I agree that this is clearer in the keyword arguments as specified here, but once all this is added to the documentation (which has a lot of features anyway) I think it obfuscates the whole. Additionally clines=False is a master off switch which may conflict with these other options.

That is the reason I prefer:

styler.to_latex(
    clines=(True, False, True)  # or None
)

with dedicated documentation.

Ultimately though I would rather support a PR that gets these feature in and is likely to be more generally approved.

@rhshadrach
Copy link
Member

rhshadrach commented Dec 23, 2024

Can we name another common API that accepts tuples of Boolean arguments like this? flake8-boolean-trap was created to lint the anti-pattern of specifying Boolean argument by position. This API wouldn't give users the option to not specify 3 Boolean arguments by position.

@attack68
Copy link
Contributor

Not without looking for it, but the problem with the catch-all argument no one else has done it assumes that no new approach should ever be considered. I would prefer to evaluate merits of solution in its own context. (Styler.to_latex only accepts 1 positional argument, all other are keyword).

But what about the following as replacement: the current solution uses string identifiers, e.g. "skip-last;data". Now that more options are in scope what about amending that to a dictionary (with some defaults consistent with current version)? This is also easily extended going forward without top-line method argument changes needed?

styler.to_latex(
    clines={
        "include_last_level": True,
        "span_all_data": True,
        "include_hidden_levels": False
    }
)

@edbennett
Copy link
Author

FWIW I was thinking along the same lines as the dict suggestion; I find it significantly more readable than the boolean tuple.

@attack68
Copy link
Contributor

Just for additional consideration , a tuple recognising specific strings(?). I see this pattern sometimes (e.g. scipy boundary condition specifications). Reads fairly well on one line.

styler.to_latex(
    clines=("all-levels", "across-data", "exclude-hidden")
)

@attack68
Copy link
Contributor

attack68 commented Jan 3, 2025

@mroeschke @rhshadrach any preference on steering the development here: outlined choices for the expansion of the clines option in to_latex are:

  1. Stay with string flag with separator character: clines="skip-last;data:include-hidden".
  2. Use a boolean tuple for documented options: clines=(True, True, True).
  3. Use a string tuple defining some pre-set actions: clines=("skip-last", "data", "include-hidden").
  4. Use a new dict format with boolean flags: clines={"skip-last": True, "across-data": True, "include-hidden": True}.
  5. Introduce new keyword arguments to the header to_latex() function: clines=True, clines_skip_last=True, clines_across_data=True, ..)

My personal order of preference is 3 (top), 4, 2, 1, 5
OP is willing to contribute but would be nice if we can agree clear direction.

@mroeschke
Copy link
Member

I suppose 3 is fine with me too. If these latex API arguments were reworked, I would be +1 for them to be namedtuples/dataclasses such that the object name is the "argument theme" with the parameters being the various options that can be set within that "theme" since I find that more explicit

deprecate passing the single string parameter
convert the single string parameter to the new tuple format in the interim
@edbennett
Copy link
Author

Thanks for the steer; I've pushed changes that implement option 3.

@attack68
Copy link
Contributor

attack68 commented Jan 6, 2025

You have designed this as a tuple input where deafult settings are applied with a unordered set of override inputs. E.g. to just turn on clines with default settings we have clines=(), which is a bit strange. My vision of this was a little different. It requires a three tuple entry specifying the options directly. This is more verbose but it forces the user to understand and enter their parameters.

clines : tuple, optional
    In *None* will not produce any *clines*.
    If a tuple will produce *clines* defined by the parameter options in the tuple:
      - Which index levels affect the result: {"all-levels", "skip-last"},
      - Where to extend the horizontal lines across the table: {"rule-data", "rule-index"},
      - Whether to include or exclude hidden index levels: {"exclude-hidden", "include-hidden"}.
   An example, with the usual default values is ("all-levels", "rule-data", "exclude-hidden"). 

This provides 8 current possible settings as well as the additional None. If options were added in the future, e.g. "rule-2-cols" it would easier to add in because the option is placed in tuple.1 and can be easily parsed.

This is also not a major departure from the current config:
clines="all;data" is converted to clines=("all-levels", "rule-data", "exclude-hidden")

What do we think?

@attack68
Copy link
Contributor

attack68 commented Jan 6, 2025

Also, as a side note, I'd be interested in understanding what happens if you have a 3 level index and hide the last index level:

Does clines then interpret the last-level as level 1 in [0,1] because 2 is hidden or is last level always 2 even though it is hidden. This has relevance for the combinations "skip-last" and "include-hidden" or "exclude-hidden".

I think it can be interpreted either way so not suggesting a re-implementation, just an understanding of the status quo.

@edbennett
Copy link
Author

ou have designed this as a tuple input where deafult settings are applied with a unordered set of override inputs. E.g. to just turn on clines with default settings we have clines=(), which is a bit strange. My vision of this was a little different. It requires a three tuple entry specifying the options directly.

Ah, I see; apologies for misunderstanding your intention. My thoughts, in no particular order:

  • I realise now I had intended for True to be an alias for () (and I guess False for None), which would make things slightly less strange, but this slipped my mind when I was doing the implementation.
  • If one is going to have fully unambiguous keys for each option, then I'm not sure why one would want the ordering to matter—this just imposes an extra burden on downstream maintainers to remember the ordering convention.
  • On the other hand, if the ordering convention doesn't matter, then I'm not sure if there's an argument for this being a tuple over a list
  • Having default options that are overridden means that any future additions to the options for this argument will not break downstream code. (Imagining where a fourth option were added, in the model you suggest, now a four-element tuple would be expected instead of a three-element tuple. So to allow for a deprecation cycle, then a default would need to be set for the fourth element, temporarily. Having a default for every option preempts that need, and reduces the potential downstream maintenance burden.)

@edbennett
Copy link
Author

Does clines then interpret the last-level as level 1 in [0,1] because 2 is hidden or is last level always 2 even though it is hidden. This has relevance for the combinations "skip-last" and "include-hidden" or "exclude-hidden".

I had intended for it to be the former (so that it is possible to skip the last visible index, if that is interesting), but testing shows that the current implementation gives the latter—this is the output of taking the example DataFrame you list upthread, hiding level 2, and outputting both with ("skip-last", "rule-data") and ("skip-last", "rule-data", "include-hidden") in the current implementation.

image

If it's possible that anyone would want to skip the last visible level, I think it would be worth adjusting the implementation now, since this would be very awkward to do a deprecation cycle for later.

@attack68
Copy link
Contributor

attack68 commented Jan 6, 2025

Ah, I see; apologies for misunderstanding your intention. My thoughts, in no particular order: ...

We have the existing framework of either None or two-parameter string e.g. "all;data". If we work to upset that the least then:

Converting to an ordered tuple is most similar to a NamedTuple without the names so provides structure that might be useful later, and is also easy to parse in the code (possibly reduces mainetnance burden). I prefer having the explicitly named features rather than reverting to any system defaults because it is not particularly obvious what those defaults should be (e.g. why choose only across index or across data, its purely a subjective choice for visualisation). Thus forcing a user to specify means they must naturally become aware of the options.

You do raise a good point about possibility of adding a 4th option. I had thought of this but was hoping to defer it. However, here's a middle ground.

clines: tuple, optional
    If *None* no clines are created.
    If a tuple will create clines. The tuple must provide at least the first two options, but can extend to three. Options are:
      - Which index levels affect the result: {"all-levels", "skip-last"},
      - Where to extend the horizontal lines across the table: {"rule-data", "rule-index"},
      - Whether to include or exclude hidden index levels: {"exclude-hidden", "include-hidden"}.
   As an example ("all-levels","rule-data"). The third option will default to "exclude-hidden".

This also means the easiest/cleanest form of translating the existing string form to tuple form.

@attack68
Copy link
Contributor

attack68 commented Jan 6, 2025

I had intended for it to be the former (so that it is possible to skip the last visible index, if that is interesting), but testing shows that the current implementation gives the latter—this is the output of taking the example DataFrame you list upthread, hiding level 2, and outputting both with ("skip-last", "rule-data") and ("skip-last", "rule-data", "include-hidden") in the current implementation.

We are starting to get into territory where it is difficult for anyone to legitimately keep track of the possible combinations. Developers or users.

Since it is always possible to reindex a DataFrame (drop levels, etc.) and then create a Styler from that reindexed DataFrame one can always achieve precisely what one wants with the following suggestion of keeping these options as high-level as possible:

I think "skip-last" should refer only to the explicit last level of the index (whether it is hidden or not), as in the current implementation, and not the last visible level.

@rhshadrach
Copy link
Member

I suppose 3 is fine with me too. If these latex API arguments were reworked, I would be +1 for them to be namedtuples/dataclasses such that the object name is the "argument theme" with the parameters being the various options that can be set within that "theme" since I find that more explicit

+1

@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Apr 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale Styler conditional formatting using DataFrame.style
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants