Postponed on #1923406: Use ASCII character set on alphanumeric fields so we can index all 255 characters
Problem/Motivation
The Drupal MySQL driver does not currently provide full UTF-8 support. The (confusingly named) MySQL utf8
charset only provides support for the basic character plane (out of 17 in total), which constitutes 6% of possible characters. As of MySQL 5.5.3 (2010), the utf8mb4
charset provides full UTF-8 support, being completely backwards compatible, not requiring more space than utf8
for characters that are in the utf8
set, and using extra byte for characters outside of the utf8
set.
For reference, Wordpress implemented this in February 2015: https://core.trac.wordpress.org/ticket/21212
With the current partial UTF-8 support we lack the 16 other character planes. This would include:
- Emojis!
- Mathematical/scientific symbols
- Some Chinese, Japanese and Korean signs
- Musical notation
- Obscure/ancient languages
- Emoticons, astral symbols, game symbols and other pictographic sets
See also:
- Wikipedia
- http://mzsanford.wordpress.com/2010/12/28/mysql-and-unicode/
- http://mathiasbynens.be/notes/mysql-utf8mb4
The Drupal PostgreSQL and SQLite drivers do implement full UTF-8 support.
Proposed resolution
Use the utf8mb4 character set by default.
Implementing <-- not relevant anymore after #1923406: Use ASCII character set on alphanumeric fields so we can index all 255 characters. We can just reduce the size of any UTF8 primary keys/unique fields to 190.utf8mb4
would currently pose a problem with indexes. In order to have large enough indexes on varchar fields we need the innodb_large_prefix
option (available as of MySQL 5.5.14) to be enabled.
Remaining tasks
- Unblocking this issue by fixing #1923406: Use ASCII character set on alphanumeric fields so we can index all 255 characters and after that is in, confirming we don't have any UTF8 prefixes left that are larger than 190 characters.
- #2473301: Raise MySQL requirement to 5.5.3
- Add a note to Drupal system requirements page that people need to re-configure mysql Not needed anymore after #1923406: Use ASCII character set on alphanumeric fields so we can index all 255 characters.innodb_large_prefix
to support this.
User interface changes
N/A
API changes
N/A
Original Issue Summary by mdupont
From MySQL documentation, its "utf8" encoding only supports a maximum of 3 bytes per character.
As such, if we try to save valid UTF-8 content that passes drupal_validate_utf8() but contains 4 bytes characters, it would trigger a buggy behavior (see #1199736: Error on pasting in different language):
- in D6, content will be truncated to the first 4-bytes character ; any content after it will not be saved (data loss)
- in D7 and later, it will trigger an error in PDO: "PDOException: SQLSTATE[HY000]: General error: 1366 Incorrect string value:"
When using SQLite there are no issues with 4 bytes characters. I hadn't tested with PostgreSQL.
MySQL can handle 4 bytes characters since version 5.5.3 by using "utf8mb4" encoding.
How could we avoid data loss or error when using regular MySQL utf8 encoding?