Problem/Motivation
The 8.x change "Search removes diacritics in indexing rather than relying on database collation" described in https://www.drupal.org/node/2447357 (based on the issue at https://www.drupal.org/node/731298) is incompatible with several languages. It introduces a new removeDiacritics function, \Drupal::service('transliteration')->removeDiacritics($text), into the search_simplify function. This function is always run (both during indexing and actual searches) and removes all diacritical marks from the input text. This happens before hook_search_preprocess implementations have had a chance to affect the text, making e.g. stemming (a suggested use case for the hook) impossible when it relies on the existence of accented characters.
Some examples of common stemming algorithms that expect input to have accented characters to produce reliable results are the following Snowball stemmers:
- Swedish: cannot replace 'löst' with 'lös' in step 3 when input only has 'lost',
- Danish: similarly, in step 3, cannot replace 'løst' with 'løs',
- Italian: in step 1, 'ità' won't get removed; in step 2, 'erà', 'erò' or 'irà' won't get removed,
- Spanish: all steps rely on the existence of diacritical marks, and
- French: all steps rely on the existence of diacritical marks.
Removing diacritics in the actual search phase also makes the search too greedy, producing results not related to what the user was searching for.
Proposed resolution
There are several possibilities:
- Don't remove diacritical marks.
- Make removing them optional per language (and probably per diacritical mark for best results) as originally planned in the linked issue, with sensible defaults.
- Run diacritical mark removal function after hook_search_preprocess implementations.
Workaround
It's possible to change the provider of the transliteration service to a custom class. This custom class can extend PhpTransliteration to retain all of its functionality, but have its own removeDiacritics function that does not alter the input text. This function will get called instead of the one in PhpTransliteration.