Skip to content

Support unicode ids in toc #970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Oct 1, 2020
Merged

Conversation

Hsn723
Copy link
Contributor

@Hsn723 Hsn723 commented May 20, 2020

I'd like to propose a change to the way slugify works in toc.py.

Currently, slugify uses ascii to encode/decode strings and as a result if a header only contains unicode characters the resulting id and href becomes the default #_1, #_2, etc.

Instead, as the HTML standard does not impose any particular restriction on the character set used for the id attribute (https://html.spec.whatwg.org/multipage/dom.html#the-id-attribute), I think it would be sensible to instead use utf-8 for encoding/decoding strings in slugify in order to better support international input.

@waylan
Copy link
Member

waylan commented May 20, 2020

To make this change would mean that many users existing links would break the next time they update to the latest version of Markdown. I'm not comfortable with that.

However, the added functionality has value. I see a few ways forward:

  1. The existing option available to everyone is to provide the extension with your own slugify function. We don't need to change anything.
  2. Provide this new functionality as an option which users can enable through the extensions configuration.

@facelessuser
Copy link
Collaborator

Providing a Unicode slugify wouldn't be a bad idea (we live in a Unicode world). I know pymdown-extensions provides a number of variants so users can just pick the one they like: https://facelessuser.github.io/pymdown-extensions/extras/slugs/#alternate-slugify. It probably wouldn't be a bad idea for Python Markdown to have an optional Unicode variant, and if people want something above and beyond that, they need to provide their own function or use a publicly available one.

@waylan waylan added extension Related to one or more of the included extensions. feature Feature request. requires-changes Awaiting updates after a review. labels May 20, 2020
@Hsn723
Copy link
Contributor Author

Hsn723 commented May 20, 2020

@waylan @facelessuser Thank you for the quick feedback.
Indeed this change would introduce a noticeable breaking change. If possible, I'd like to implement option 2 since option 1 may not always be approachable for end users.

For instance, this functionality could be provided through a encoding: utf-8 configuration flag and having slugify accept it as an optional argument:

def slugify(value, separator, encoding='ascii')

If that sounds acceptable, I'll go ahead and implement the changes.

@waylan
Copy link
Member

waylan commented May 21, 2020

For instance, this functionality could be provided through a encoding: utf-8 configuration flag and having slugify accept it as an optional argument:

That is certainly one possible solution. Another might be to provide a separate slugify function which the user could pass into the slugify config option. Note that we document that any third party functions must accept two arguments value and separator. If we start passing a third option, that would break any existing third party functions. In other words, by simply providing a separate function, we don't change/break the existing API. Also, that keeps the number of ways to alter slugify at one.

In practice, defining two functions may not seem very DRY. Perhaps we could create a single function which accepts the third argument encoding and then have two wrappers which only accept value and separator. On would call the base function with ascii and the other with utf-8.

@Hsn723
Copy link
Contributor Author

Hsn723 commented May 25, 2020

@waylan
Got it. I went ahead and implemented the suggested changes. Please take a look.

@waylan
Copy link
Member

waylan commented May 25, 2020

Looks good. Now we need documentation.

Sent with GitHawk

@Hsn723
Copy link
Contributor Author

Hsn723 commented May 27, 2020

@waylan Sorry for the delay. I've updated the documentation.

Copy link
Member

@waylan waylan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. We need to add a note to the release notes. However, this would be in a new point release, for which release notes don't exist yet. I'll likely wait to merge this until after the new release notes are created.

@USSX-Hares
Copy link

USSX-Hares commented Sep 17, 2020

I would like to ask to add the slugify_encoding configuration parameter as the slugify parameter of function type cannot be loaded properly from the text configuration.

For backwards compatibility, when the configured encoding matches the default encoding, the extra argument may not be passed to the slugify() method.

Example of Implementation:

DEFAULT_ENCODING = 'ascii'


def _slugify(value, separator, encoding=DEFAULT_ENCODING):
    value = unicodedata.normalize('NFKD', value).encode(encoding, 'ignore')
    value = re.sub(r'[^\w\s-]', '', value.decode(encoding)).strip().lower()
    return re.sub(r'[%s\s]+' % separator, separator, value)


def slugify(value, separator):
    """ Slugify a string, to make it URL friendly. """
    return _slugify(value, separator)


def slugify_unicode(value, separator):
    """ Slugify a string, to make it URL friendly. """
    return _slugify(value, separator, 'utf-8')

<...>
class TocTreeprocessor(Treeprocessor):
    def __init__(self, md, config):
        super().__init__(md)
<...>
        self.slugify = config["slugify"]
        self.slugify_encoding = config["slugify_encoding"]
        self.sep = config["separator"]
<...>

    def run(self, doc):
<...>
                # Do not override pre-existing ids
                if "id" not in el.attrib:
                    innertext = unescape(stashedHTML2text(text, self.md))
                    slugify_kwargs = {}
                    if self.slugify_encoding != DEFAULT_ENCODING:
                        slugify_kwargs['encoding'] = self.slugify_encoding
                    el.attrib["id"] = unique(self.slugify(innertext, self.sep, **slugify_kwargs), used_ids)

<...>
class TocExtension(Extension):

    TreeProcessorClass = TocTreeprocessor

    def __init__(self, **kwargs):
        self.config = {
<...>
            "slugify": [_slugify,
                        "Function to generate anchors based on header text - "
                        "Defaults to the headerid ext's slugify function."],
            "slugify_encoding": [DEFAULT_ENCODING,
                                 "If set to non-default, will use the custom "
                                 "slugify encoding"],
            'separator': ['-', 'Word separator. Defaults to "-".'],
<...>

@waylan
Copy link
Member

waylan commented Sep 17, 2020

I would like to ask to add the slugify_encoding configuration parameter as the slugify parameter of function type cannot be loaded properly from the text configuration.

Then don't use text configuration. In fact, our documentation shows preference for creating an instance of an extension in Python code. The text-based methods are only provided for backward compatibility. Even the CLI includes support for a -c/--extension_configs option which accepts a YAML config file. In your YAML file you can point at a Python function (for example: !!python/name:markdown.extensions.toc.slugify_unicode).

I realize that this is more complicated for the user than setting a slugify_encoding setting. But I'm only interested in implementing one way to set the encoding. If you can provide an alternate solution that does that, I'll consider it. However, as we already support the option for the user to provide their own slugify function, I expect that any such solution would always be a second way to address the issue.

Finally, your suggested solution requires that a third parameter be passed to the slugify function. However, that would break any existing third party functions which only accept two parameters. To maintain backward compatibility, we can only pass two parameters to slugify, Therefore, there is no way to pass a slugify_encoding configuration option to the function without breaking existing users' custom functions.

@USSX-Hares
Copy link

@waylan did not know there is an option to point the function reference through YAML.
Thank you.

@waylan
Copy link
Member

waylan commented Sep 22, 2020

If an entry is added to the release notes, I'll merge this.

@waylan waylan merged commit 2e0962e into Python-Markdown:master Oct 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extension Related to one or more of the included extensions. feature Feature request. requires-changes Awaiting updates after a review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants