-
Notifications
You must be signed in to change notification settings - Fork 875
Support unicode ids in toc #970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
To make this change would mean that many users existing links would break the next time they update to the latest version of Markdown. I'm not comfortable with that. However, the added functionality has value. I see a few ways forward:
|
Providing a Unicode slugify wouldn't be a bad idea (we live in a Unicode world). I know pymdown-extensions provides a number of variants so users can just pick the one they like: https://facelessuser.github.io/pymdown-extensions/extras/slugs/#alternate-slugify. It probably wouldn't be a bad idea for Python Markdown to have an optional Unicode variant, and if people want something above and beyond that, they need to provide their own function or use a publicly available one. |
@waylan @facelessuser Thank you for the quick feedback. For instance, this functionality could be provided through a
If that sounds acceptable, I'll go ahead and implement the changes. |
That is certainly one possible solution. Another might be to provide a separate slugify function which the user could pass into the In practice, defining two functions may not seem very DRY. Perhaps we could create a single function which accepts the third argument |
@waylan |
Looks good. Now we need documentation. Sent with GitHawk |
@waylan Sorry for the delay. I've updated the documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good. We need to add a note to the release notes. However, this would be in a new point release, for which release notes don't exist yet. I'll likely wait to merge this until after the new release notes are created.
I would like to ask to add the For backwards compatibility, when the configured encoding matches the default encoding, the extra argument may not be passed to the Example of Implementation:DEFAULT_ENCODING = 'ascii'
def _slugify(value, separator, encoding=DEFAULT_ENCODING):
value = unicodedata.normalize('NFKD', value).encode(encoding, 'ignore')
value = re.sub(r'[^\w\s-]', '', value.decode(encoding)).strip().lower()
return re.sub(r'[%s\s]+' % separator, separator, value)
def slugify(value, separator):
""" Slugify a string, to make it URL friendly. """
return _slugify(value, separator)
def slugify_unicode(value, separator):
""" Slugify a string, to make it URL friendly. """
return _slugify(value, separator, 'utf-8')
<...>
class TocTreeprocessor(Treeprocessor):
def __init__(self, md, config):
super().__init__(md)
<...>
self.slugify = config["slugify"]
self.slugify_encoding = config["slugify_encoding"]
self.sep = config["separator"]
<...>
def run(self, doc):
<...>
# Do not override pre-existing ids
if "id" not in el.attrib:
innertext = unescape(stashedHTML2text(text, self.md))
slugify_kwargs = {}
if self.slugify_encoding != DEFAULT_ENCODING:
slugify_kwargs['encoding'] = self.slugify_encoding
el.attrib["id"] = unique(self.slugify(innertext, self.sep, **slugify_kwargs), used_ids)
<...>
class TocExtension(Extension):
TreeProcessorClass = TocTreeprocessor
def __init__(self, **kwargs):
self.config = {
<...>
"slugify": [_slugify,
"Function to generate anchors based on header text - "
"Defaults to the headerid ext's slugify function."],
"slugify_encoding": [DEFAULT_ENCODING,
"If set to non-default, will use the custom "
"slugify encoding"],
'separator': ['-', 'Word separator. Defaults to "-".'],
<...> |
Then don't use text configuration. In fact, our documentation shows preference for creating an instance of an extension in Python code. The text-based methods are only provided for backward compatibility. Even the CLI includes support for a I realize that this is more complicated for the user than setting a Finally, your suggested solution requires that a third parameter be passed to the slugify function. However, that would break any existing third party functions which only accept two parameters. To maintain backward compatibility, we can only pass two parameters to slugify, Therefore, there is no way to pass a |
@waylan did not know there is an option to point the function reference through YAML. |
If an entry is added to the release notes, I'll merge this. |
I'd like to propose a change to the way
slugify
works intoc.py
.Currently,
slugify
uses ascii to encode/decode strings and as a result if a header only contains unicode characters the resultingid
andhref
becomes the default#_1
,#_2
, etc.Instead, as the HTML standard does not impose any particular restriction on the character set used for the
id
attribute (https://html.spec.whatwg.org/multipage/dom.html#the-id-attribute), I think it would be sensible to instead use utf-8 for encoding/decoding strings inslugify
in order to better support international input.