Feature request: levenshtein with multibyte support #10180

tztztztz · 2022-12-28T23:14:17Z

Description

Please, could you make mb_levenshtein function with multibyte support, so substituting for example 'a' with 'ą' would produce 1 instead of 2 in current non-multibyte levenshtein function?

There are implementations of Levenshtein alghoritm in pure php that support multibyte characters, but they are much slower that buit-in function written in C.

When you have to compare a word with whole dictionary containing millions of words it makes significant difference.

highlyprofessionalscum · 2022-12-29T09:08:12Z

jit may help

cmb69 · 2022-12-29T12:16:40Z

That may make sense, but I presume it's harder than it looks (e.g. dealing with Unicode normalization issues). And I don't think this is a common requirement for PHP developers, so this should go through the RFC process. Do you want to pursue that?

An alternative might be to write a PECL extension, which might be a good first step anyway.

tztztztz · 2022-12-29T12:22:33Z

For making things easier, it could be a function with extra parameter for providing matrix of substitutions like:

mb_levenshtein ($str1, $str2, [$substition_matrix], ...other parameters..);

For example for Polish language $substition_matrix could look like:

['a' => 'ą', 'A' => 'Ą', 'e' => 'ę', 'E' => 'Ę', ... and so on ...]

it would probably simplify the complexity of creating such function?

alexdowad · 2023-01-13T19:47:55Z

I agree that an RFC is needed.

If there is significant popular demand, and an RFC is submitted and passes, then an implementation can be done. That is not a problem.

@tztztztz You are correct that built-in functions implemented in C are faster than those implemented in PHP user code. But this does not mean that the core developers should create built-in C-based functions for everything. Doing so would clutter up the PHP manual with a huge number of functions which most users don't want or need. It would also create a burden for the maintainers, and in the long run actually hinder ongoing improvements to the language and its implementation.

Sometimes, if you really, really need C/C++/Rust/etc. levels of CPU performance, the answer is to use C/C++/Rust/etc. For example, you could create your own C-based PHP extension, as @cmb69 suggested. Or, you could create an external binary and use exec, shell_exec, etc. to call it.

info-universeorange · 2023-03-25T23:44:07Z

Could you not just transform those accents before so replace: a accent with a.
It should not make a difference, or make the weight yourself when transforming chars.
When preprocessing a levensthein distance

tztztztz · 2023-03-26T11:57:36Z

No, it makes a difference

info-universeorange · 2023-03-26T20:15:48Z

thats a complicated function to create.

Basically you might even want different cost for replacing an accent, because i think there are not a lot of different meanings along accents, only thing is, that you can do it wrong, writing the wrong accent and might want to give it a score then (an Exam)

by a levensthein distance a letter replacement can mean a different word with meaning, an accent replacement (or accent to ASCII) should and would not make such a difference only in audio.

Can you explain why it makes a difference except on learning a language and their accents.

How much accents are there in a sentence where you want the distance from. you have 255 symbols (ASCII) to choose from to replace the accents with a symbol each on both sides so you can use the current levensthein method ?

alexdowad · 2023-03-27T12:00:28Z

Thanks for all the comments. If @tztztztz intends to create an RFC and seek approval to have this function added to PHP, I would like to suggest that any discussion can be done on the RFC and not here.

If @tztztztz does not intend to create an RFC, then this issue can be closed.

Thanks!

tztztztz · 2023-03-27T12:04:04Z

@alexdowad do you want me to create RFC according to guidance included in this document:

https://wiki.php.net/rfc/howto

?

alexdowad · 2023-03-27T12:06:44Z

@tztztztz Yes.

Just one note... the RFC howto document states: "If you don't have the skills to fully implement your RFC and no-one volunteers to code it, there is little chance your RFC will be successful." However, in this case, I don't think there is any problem with this. If the RFC passes and nobody else steps up to implement the new function, then I can do it. That is if the RFC passes, though.

Personally, I have no opinion about whether the RFC should be passed or not. You can work that out with the developers who have RFC voting rights.

github-actions · 2023-06-26T00:18:24Z

There has not been any recent activity in this feature request. It will automatically be closed in 14 days if no further action is taken. Please see https://github.com/probot/stale#is-closing-stale-issues-really-a-good-idea to understand why we auto-close stale feature requests.

youkidearitai · 2024-09-25T07:15:39Z

I tried implement to mb_levenshtein function. I discuss internal mailing lists.
#16043

nielsdos · 2025-05-18T19:39:01Z

mb_levenshtein got declined, grapheme_levenshtein got merged.

tztztztz added Feature Status: Needs Triage labels Dec 28, 2022

cmb69 added Extension: intl Extension: mbstring Status: Requires RFC and removed Status: Needs Triage labels Dec 29, 2022

github-actions bot added the Stale label Jun 26, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 11, 2023

youkidearitai mentioned this issue Sep 25, 2024

[Draft][Require RFC] mb_levenshtein function #16043

Closed

youkidearitai self-assigned this Sep 25, 2024

youkidearitai reopened this Sep 25, 2024

nielsdos closed this as completed May 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: levenshtein with multibyte support #10180

Feature request: levenshtein with multibyte support #10180

tztztztz commented Dec 28, 2022

highlyprofessionalscum commented Dec 29, 2022

cmb69 commented Dec 29, 2022

tztztztz commented Dec 29, 2022

alexdowad commented Jan 13, 2023

info-universeorange commented Mar 25, 2023

tztztztz commented Mar 26, 2023

info-universeorange commented Mar 26, 2023

alexdowad commented Mar 27, 2023

tztztztz commented Mar 27, 2023

alexdowad commented Mar 27, 2023

github-actions bot commented Jun 26, 2023

youkidearitai commented Sep 25, 2024 •

edited

Loading

nielsdos commented May 18, 2025

Feature request: levenshtein with multibyte support #10180

Feature request: levenshtein with multibyte support #10180

Comments

tztztztz commented Dec 28, 2022

Description

highlyprofessionalscum commented Dec 29, 2022

cmb69 commented Dec 29, 2022

tztztztz commented Dec 29, 2022

alexdowad commented Jan 13, 2023

info-universeorange commented Mar 25, 2023

tztztztz commented Mar 26, 2023

info-universeorange commented Mar 26, 2023

alexdowad commented Mar 27, 2023

tztztztz commented Mar 27, 2023

alexdowad commented Mar 27, 2023

github-actions bot commented Jun 26, 2023

youkidearitai commented Sep 25, 2024 • edited Loading

nielsdos commented May 18, 2025

youkidearitai commented Sep 25, 2024 •

edited

Loading