Problem/Motivation
Our current transliteration class has a method to remove diacritics from certain Latin characters. It is desirable this to be expanded to other scripts.
The current behavior is custom. It was written to remove diacritics from a range of characters with an eye on how HTML entities are called, not from a Unicode perspective. The commonly suggested way -- utlizied by Libreoffice and Firefox both -- is to use Unicode decomposition rules, remove Unicode major category "M" (aka combining characters) and re-compose them. But our solution is different. For example, it changes o with stroke to just o and as this SO answer notes, Unicode doesn't decompose it and leaves it alone. There's no right or wrong answer whether it should be rewritten: Wikipedia notes this character is treated as a separate letter in Norwegian and Danish and as such it probably shouldn't be rewritten for those but in other languages it perhaps should be.
In other words, currently the word Øresund will be stored as oresund but if we were to follow the standard way above it would be stored literally. And then searching for oresund finds it currently but wouldn't if we were to follow the standard way.
Out of the 323 characters falling within the ranges PhpTransliteration::removeDiacritics()
as many as 48 (~15%) exhibit different behavior under the suggested standard route compared to the current one. Six of them are genuine bugs fixed in #3151364: diacritics are not removed from ǢǣǼǽǮǯ .
Proposed resolution
If intl is not enabled we stick to the current behavior. It is certainly possible to generate the NFD and NFC tables from the UCD and https://gist.github.com/hoehrmann/5652435 and the XO_NFD / XO_NFC properties. I doubt the effort is worth it and also the speed, especially for NFC would be abominably slow.
If intl is enabled then we introduce a pair of new classes, similar to the current Php pairs, the Component one does the work, the Core one does the alter. Also pass the langcode from search to removeDiacritics. We add the rules necessary to achieve parity with PhpTransliterator on the range it operates.
Remaining tasks
Immediate: write the test for alter. Possibly separate the component tests for the two classes by moving the classname to a constant, extending the class for intl and skip the test if intl is not enabled.
Either here or followups: give the community some way to supply sensible rules and settings as adequate. A Hungarian site for a Hungarian audience doesn't want to remove the diacritics Hungarian uses at all. However, a site about Hungary for more international audiences? I do not even know.
User interface changes
API changes
TransliterationInterface::removeDiacritics
gets a new, optional argument.
Data model changes
Release notes snippet
Original report
when removing diacritics in function search_simplify(), it not considering remove Arabic diacritics. How to test: add this text to any article: "السُّلَّامُ عَلَيْكُمْ وَرَحْمَةُ اللهِ وَبَرَكَاتُهُ" then search for "السلام". learning form this article, I develop an ugly patch to fix it. It may be better to move it to PhpTransliteration.php but I am not sure.