Support unicode ids in toc #970

Hsn723 · 2020-05-20T02:36:17Z

I'd like to propose a change to the way slugify works in toc.py.

Currently, slugify uses ascii to encode/decode strings and as a result if a header only contains unicode characters the resulting id and href becomes the default #_1, #_2, etc.

Instead, as the HTML standard does not impose any particular restriction on the character set used for the id attribute (https://html.spec.whatwg.org/multipage/dom.html#the-id-attribute), I think it would be sensible to instead use utf-8 for encoding/decoding strings in slugify in order to better support international input.

waylan · 2020-05-20T12:53:35Z

To make this change would mean that many users existing links would break the next time they update to the latest version of Markdown. I'm not comfortable with that.

However, the added functionality has value. I see a few ways forward:

The existing option available to everyone is to provide the extension with your own slugify function. We don't need to change anything.
Provide this new functionality as an option which users can enable through the extensions configuration.

facelessuser · 2020-05-20T13:08:30Z

Providing a Unicode slugify wouldn't be a bad idea (we live in a Unicode world). I know pymdown-extensions provides a number of variants so users can just pick the one they like: https://facelessuser.github.io/pymdown-extensions/extras/slugs/#alternate-slugify. It probably wouldn't be a bad idea for Python Markdown to have an optional Unicode variant, and if people want something above and beyond that, they need to provide their own function or use a publicly available one.

Hsn723 · 2020-05-20T23:13:44Z

@waylan @facelessuser Thank you for the quick feedback.
Indeed this change would introduce a noticeable breaking change. If possible, I'd like to implement option 2 since option 1 may not always be approachable for end users.

For instance, this functionality could be provided through a encoding: utf-8 configuration flag and having slugify accept it as an optional argument:

def slugify(value, separator, encoding='ascii')

If that sounds acceptable, I'll go ahead and implement the changes.

waylan · 2020-05-21T13:26:43Z

For instance, this functionality could be provided through a encoding: utf-8 configuration flag and having slugify accept it as an optional argument:

That is certainly one possible solution. Another might be to provide a separate slugify function which the user could pass into the slugify config option. Note that we document that any third party functions must accept two arguments value and separator. If we start passing a third option, that would break any existing third party functions. In other words, by simply providing a separate function, we don't change/break the existing API. Also, that keeps the number of ways to alter slugify at one.

In practice, defining two functions may not seem very DRY. Perhaps we could create a single function which accepts the third argument encoding and then have two wrappers which only accept value and separator. On would call the base function with ascii and the other with utf-8.

Hsn723 · 2020-05-25T00:08:09Z

@waylan
Got it. I went ahead and implemented the suggested changes. Please take a look.

waylan · 2020-05-25T20:52:56Z

Looks good. Now we need documentation.

_{Sent with GitHawk}

Hsn723 · 2020-05-27T00:37:57Z

@waylan Sorry for the delay. I've updated the documentation.

waylan

This is looking good. We need to add a note to the release notes. However, this would be in a new point release, for which release notes don't exist yet. I'll likely wait to merge this until after the new release notes are created.

docs/extensions/toc.md

USSX-Hares · 2020-09-17T16:38:32Z

I would like to ask to add the slugify_encoding configuration parameter as the slugify parameter of function type cannot be loaded properly from the text configuration.

For backwards compatibility, when the configured encoding matches the default encoding, the extra argument may not be passed to the slugify() method.

Example of Implementation:

DEFAULT_ENCODING = 'ascii'


def _slugify(value, separator, encoding=DEFAULT_ENCODING):
    value = unicodedata.normalize('NFKD', value).encode(encoding, 'ignore')
    value = re.sub(r'[^\w\s-]', '', value.decode(encoding)).strip().lower()
    return re.sub(r'[%s\s]+' % separator, separator, value)


def slugify(value, separator):
    """ Slugify a string, to make it URL friendly. """
    return _slugify(value, separator)


def slugify_unicode(value, separator):
    """ Slugify a string, to make it URL friendly. """
    return _slugify(value, separator, 'utf-8')

<...>
class TocTreeprocessor(Treeprocessor):
    def __init__(self, md, config):
        super().__init__(md)
<...>
        self.slugify = config["slugify"]
        self.slugify_encoding = config["slugify_encoding"]
        self.sep = config["separator"]
<...>

    def run(self, doc):
<...>
                # Do not override pre-existing ids
                if "id" not in el.attrib:
                    innertext = unescape(stashedHTML2text(text, self.md))
                    slugify_kwargs = {}
                    if self.slugify_encoding != DEFAULT_ENCODING:
                        slugify_kwargs['encoding'] = self.slugify_encoding
                    el.attrib["id"] = unique(self.slugify(innertext, self.sep, **slugify_kwargs), used_ids)

<...>
class TocExtension(Extension):

    TreeProcessorClass = TocTreeprocessor

    def __init__(self, **kwargs):
        self.config = {
<...>
            "slugify": [_slugify,
                        "Function to generate anchors based on header text - "
                        "Defaults to the headerid ext's slugify function."],
            "slugify_encoding": [DEFAULT_ENCODING,
                                 "If set to non-default, will use the custom "
                                 "slugify encoding"],
            'separator': ['-', 'Word separator. Defaults to "-".'],
<...>

waylan · 2020-09-17T18:23:32Z

I would like to ask to add the slugify_encoding configuration parameter as the slugify parameter of function type cannot be loaded properly from the text configuration.

Then don't use text configuration. In fact, our documentation shows preference for creating an instance of an extension in Python code. The text-based methods are only provided for backward compatibility. Even the CLI includes support for a -c/--extension_configs option which accepts a YAML config file. In your YAML file you can point at a Python function (for example: !!python/name:markdown.extensions.toc.slugify_unicode).

I realize that this is more complicated for the user than setting a slugify_encoding setting. But I'm only interested in implementing one way to set the encoding. If you can provide an alternate solution that does that, I'll consider it. However, as we already support the option for the user to provide their own slugify function, I expect that any such solution would always be a second way to address the issue.

Finally, your suggested solution requires that a third parameter be passed to the slugify function. However, that would break any existing third party functions which only accept two parameters. To maintain backward compatibility, we can only pass two parameters to slugify, Therefore, there is no way to pass a slugify_encoding configuration option to the function without breaking existing users' custom functions.

USSX-Hares · 2020-09-17T22:47:20Z

@waylan did not know there is an option to point the function reference through YAML.
Thank you.

waylan · 2020-09-22T14:50:58Z

If an entry is added to the release notes, I'll merge this.

waylan added extension Related to one or more of the included extensions. feature Feature request. requires-changes Awaiting updates after a review. labels May 20, 2020

waylan requested changes May 28, 2020

View reviewed changes

docs/extensions/toc.md Outdated Show resolved Hide resolved

Hsn723 force-pushed the toc-unicode branch from 08af512 to 95cf156 Compare May 31, 2020 23:21

waylan approved these changes Jun 1, 2020

View reviewed changes

USSX-Hares mentioned this pull request Sep 17, 2020

Feature Request: Support Unicode Letters in Header Links mkdocs/mkdocs#2190

Closed

Hsn723 and others added 5 commits October 1, 2020 14:16

fix support for unicode ids in toc

e07de28

fix test

96a09f6

separate ascii and utf-8 slugify

a5ce891

update documentation

f0b1360

Release notes and cleanup.

7790c06

waylan force-pushed the toc-unicode branch from 95cf156 to 7790c06 Compare October 1, 2020 18:45

fix typos

5747cac

waylan merged commit 2e0962e into Python-Markdown:master Oct 1, 2020

mitya57 mentioned this pull request Mar 22, 2021

TOC:Anchor link written in Japanese does not work #1118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support unicode ids in toc #970

Support unicode ids in toc #970

Hsn723 commented May 20, 2020

waylan commented May 20, 2020

facelessuser commented May 20, 2020

Hsn723 commented May 20, 2020

waylan commented May 21, 2020

Hsn723 commented May 25, 2020

waylan commented May 25, 2020

Hsn723 commented May 27, 2020

waylan left a comment

USSX-Hares commented Sep 17, 2020 •

edited

Loading

waylan commented Sep 17, 2020 •

edited

Loading

USSX-Hares commented Sep 17, 2020

waylan commented Sep 22, 2020

Support unicode ids in toc #970

Support unicode ids in toc #970

Conversation

Hsn723 commented May 20, 2020

waylan commented May 20, 2020

facelessuser commented May 20, 2020

Hsn723 commented May 20, 2020

waylan commented May 21, 2020

Hsn723 commented May 25, 2020

waylan commented May 25, 2020

Hsn723 commented May 27, 2020

waylan left a comment

Choose a reason for hiding this comment

USSX-Hares commented Sep 17, 2020 • edited Loading

Example of Implementation:

waylan commented Sep 17, 2020 • edited Loading

USSX-Hares commented Sep 17, 2020

waylan commented Sep 22, 2020

USSX-Hares commented Sep 17, 2020 •

edited

Loading

waylan commented Sep 17, 2020 •

edited

Loading