-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Feature request: levenshtein with multibyte support #10180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
jit may help |
That may make sense, but I presume it's harder than it looks (e.g. dealing with Unicode normalization issues). And I don't think this is a common requirement for PHP developers, so this should go through the RFC process. Do you want to pursue that? An alternative might be to write a PECL extension, which might be a good first step anyway. |
For making things easier, it could be a function with extra parameter for providing matrix of substitutions like: mb_levenshtein ($str1, $str2, [$substition_matrix], ...other parameters..); For example for Polish language $substition_matrix could look like: ['a' => 'ą', 'A' => 'Ą', 'e' => 'ę', 'E' => 'Ę', ... and so on ...] it would probably simplify the complexity of creating such function? |
I agree that an RFC is needed. If there is significant popular demand, and an RFC is submitted and passes, then an implementation can be done. That is not a problem. @tztztztz You are correct that built-in functions implemented in C are faster than those implemented in PHP user code. But this does not mean that the core developers should create built-in C-based functions for everything. Doing so would clutter up the PHP manual with a huge number of functions which most users don't want or need. It would also create a burden for the maintainers, and in the long run actually hinder ongoing improvements to the language and its implementation. Sometimes, if you really, really need C/C++/Rust/etc. levels of CPU performance, the answer is to use C/C++/Rust/etc. For example, you could create your own C-based PHP extension, as @cmb69 suggested. Or, you could create an external binary and use |
Could you not just transform those accents before so replace: a accent with a. |
No, it makes a difference |
thats a complicated function to create. Basically you might even want different cost for replacing an accent, because i think there are not a lot of different meanings along accents, only thing is, that you can do it wrong, writing the wrong accent and might want to give it a score then (an Exam) by a levensthein distance a letter replacement can mean a different word with meaning, an accent replacement (or accent to ASCII) should and would not make such a difference only in audio. Can you explain why it makes a difference except on learning a language and their accents. How much accents are there in a sentence where you want the distance from. you have 255 symbols (ASCII) to choose from to replace the accents with a symbol each on both sides so you can use the current levensthein method ? |
@alexdowad do you want me to create RFC according to guidance included in this document: https://wiki.php.net/rfc/howto ? |
@tztztztz Yes. Just one note... the RFC howto document states: "If you don't have the skills to fully implement your RFC and no-one volunteers to code it, there is little chance your RFC will be successful." However, in this case, I don't think there is any problem with this. If the RFC passes and nobody else steps up to implement the new function, then I can do it. That is if the RFC passes, though. Personally, I have no opinion about whether the RFC should be passed or not. You can work that out with the developers who have RFC voting rights. |
There has not been any recent activity in this feature request. It will automatically be closed in 14 days if no further action is taken. Please see https://github.com/probot/stale#is-closing-stale-issues-really-a-good-idea to understand why we auto-close stale feature requests. |
I tried implement to |
mb_levenshtein got declined, grapheme_levenshtein got merged. |
Description
Please, could you make mb_levenshtein function with multibyte support, so substituting for example 'a' with 'ą' would produce 1 instead of 2 in current non-multibyte levenshtein function?
There are implementations of Levenshtein alghoritm in pure php that support multibyte characters, but they are much slower that buit-in function written in C.
When you have to compare a word with whole dictionary containing millions of words it makes significant difference.
The text was updated successfully, but these errors were encountered: