Skip to content

Feature request: levenshtein with multibyte support #10180

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tztztztz opened this issue Dec 28, 2022 · 13 comments
Closed

Feature request: levenshtein with multibyte support #10180

tztztztz opened this issue Dec 28, 2022 · 13 comments

Comments

@tztztztz
Copy link

Description

Please, could you make mb_levenshtein function with multibyte support, so substituting for example 'a' with 'ą' would produce 1 instead of 2 in current non-multibyte levenshtein function?

There are implementations of Levenshtein alghoritm in pure php that support multibyte characters, but they are much slower that buit-in function written in C.

When you have to compare a word with whole dictionary containing millions of words it makes significant difference.

@highlyprofessionalscum
Copy link

jit may help

@cmb69
Copy link
Member

cmb69 commented Dec 29, 2022

That may make sense, but I presume it's harder than it looks (e.g. dealing with Unicode normalization issues). And I don't think this is a common requirement for PHP developers, so this should go through the RFC process. Do you want to pursue that?

An alternative might be to write a PECL extension, which might be a good first step anyway.

@tztztztz
Copy link
Author

For making things easier, it could be a function with extra parameter for providing matrix of substitutions like:

mb_levenshtein ($str1, $str2, [$substition_matrix], ...other parameters..);

For example for Polish language $substition_matrix could look like:

['a' => 'ą', 'A' => 'Ą', 'e' => 'ę', 'E' => 'Ę', ... and so on ...]

it would probably simplify the complexity of creating such function?

@alexdowad
Copy link
Contributor

I agree that an RFC is needed.

If there is significant popular demand, and an RFC is submitted and passes, then an implementation can be done. That is not a problem.

@tztztztz You are correct that built-in functions implemented in C are faster than those implemented in PHP user code. But this does not mean that the core developers should create built-in C-based functions for everything. Doing so would clutter up the PHP manual with a huge number of functions which most users don't want or need. It would also create a burden for the maintainers, and in the long run actually hinder ongoing improvements to the language and its implementation.

Sometimes, if you really, really need C/C++/Rust/etc. levels of CPU performance, the answer is to use C/C++/Rust/etc. For example, you could create your own C-based PHP extension, as @cmb69 suggested. Or, you could create an external binary and use exec, shell_exec, etc. to call it.

@info-universeorange
Copy link

Could you not just transform those accents before so replace: a accent with a.
It should not make a difference, or make the weight yourself when transforming chars.
When preprocessing a levensthein distance

@tztztztz
Copy link
Author

No, it makes a difference

@info-universeorange
Copy link

thats a complicated function to create.

Basically you might even want different cost for replacing an accent, because i think there are not a lot of different meanings along accents, only thing is, that you can do it wrong, writing the wrong accent and might want to give it a score then (an Exam)

by a levensthein distance a letter replacement can mean a different word with meaning, an accent replacement (or accent to ASCII) should and would not make such a difference only in audio.

Can you explain why it makes a difference except on learning a language and their accents.

How much accents are there in a sentence where you want the distance from. you have 255 symbols (ASCII) to choose from to replace the accents with a symbol each on both sides so you can use the current levensthein method ?

@alexdowad
Copy link
Contributor

Thanks for all the comments. If @tztztztz intends to create an RFC and seek approval to have this function added to PHP, I would like to suggest that any discussion can be done on the RFC and not here.

If @tztztztz does not intend to create an RFC, then this issue can be closed.

Thanks!

@tztztztz
Copy link
Author

@alexdowad do you want me to create RFC according to guidance included in this document:

https://wiki.php.net/rfc/howto

?

@alexdowad
Copy link
Contributor

@tztztztz Yes.

Just one note... the RFC howto document states: "If you don't have the skills to fully implement your RFC and no-one volunteers to code it, there is little chance your RFC will be successful." However, in this case, I don't think there is any problem with this. If the RFC passes and nobody else steps up to implement the new function, then I can do it. That is if the RFC passes, though.

Personally, I have no opinion about whether the RFC should be passed or not. You can work that out with the developers who have RFC voting rights.

@github-actions
Copy link

There has not been any recent activity in this feature request. It will automatically be closed in 14 days if no further action is taken. Please see https://github.com/probot/stale#is-closing-stale-issues-really-a-good-idea to understand why we auto-close stale feature requests.

@github-actions github-actions bot added the Stale label Jun 26, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 11, 2023
@youkidearitai youkidearitai self-assigned this Sep 25, 2024
@youkidearitai youkidearitai reopened this Sep 25, 2024
@youkidearitai
Copy link
Contributor

youkidearitai commented Sep 25, 2024

I tried implement to mb_levenshtein function. I discuss internal mailing lists.
#16043

@nielsdos
Copy link
Member

mb_levenshtein got declined, grapheme_levenshtein got merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants