Problem/Motivation
The filters 'htmlcorrector', 'html' and the testing system needs html parsing and a valid DOM to work with. This is done by the libxml2 library provided in PHP that cleans html and transform it to a dom. Libxml2 assumes all html is HTML4 and correct it with HTML4 rules. As Drupal will be based on HTML5, typical HTML5 tags and constructions will be marked invalid, added or stripped.
A small test example is that <span lang="en"> transforms to <span lang="en" xml:lang="en">
Proposed resolution
Chosen solution
The consensus is that the best solution is to use html5-php (rewrite of html5lib). The library has almost all the functionality that we need. And the functionality is already implemented. There was a problem with the library and that has been fixed. The functionality that we miss is when the DOMDocument has a default namespace it's not possible to parse it using XPath unless the namespace is registered as something else. In the patch there are two helper-classes to add this functionality.
Remaining tasks
#2441373: [pp-1] Upgrade tests to HTML5
#2667340: Usage of field_prefix and container-inline creates invalid markup.
#2441811: Upgrade filter system to HTML5
@todo: needs an issue created or existing issue linked from this one- convert the filter module to use masterminds/html5
User interface changes
None
API changes
Change in behavior of filter_dom_load, the html-filter and html-corrector
Beta phase evaluation
| Issue priority | Major because the bug has many repercussions on filters using the HTML filter, and under some circumstances could result in an invalid DOM. Not critical because it does not render the HTML filter unusable. |
|---|---|
| Prioritized changes | The main goal of this issue is a bugfix to for HTML5 support, which is part of the Drupal 8 product. |
Blocked Issues
#1277290: Use a proper HTML parser for every core filter
Related Issues
PHP Bug #60021 DOMDocument errors on HTML5 tags
issue that replaced own (faulty) function with libxml for the html corrector filter in 2009: #374441: Refactor Drupal HTML corrector (PHP5)
#725260: Use PHP's Tidy component for the clean-up HTML filter
Original report
The html filter corrector is now based on XHTML, but Drupal8 should output html5
At the moment<span lang="en"> transforms to <span lang="en" xml:lang="en">
(see issue #1328768: attributes 'xml:lang' and 'xml:id' transform to 'lang' and 'id' in filter_xss)
two possible causes:function filter_dom_load currently loads @$dom_document->loadHTML('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body>' . $text . '</body></html>');
function filter_dom_serialize uses $dom_document->saveXML