Problem/Motivation
The filters 'htmlcorrector', 'html' and the testing system needs html parsing and a valid DOM to work with. This is done by the libxml2 library provided in PHP that cleans html and transform it to a dom. Libxml2 assumes all html is HTML4 and correct it with HTML4 rules. As Drupal will be based on HTML5, typical HTML5 tags and constructions will be marked invalid, added or stripped.
A small test example is that <span lang="en">
transforms to <span lang="en" xml:lang="en">
Proposed resolution
There are three solutions:
- suppress errors (+some finetuning with tweaks)
- Implement another parser in Drupal or Symfony
- Provide patches for PHP/libxml2 or wait until libxml2 get fixed
Suppress errors
This solution is the least effort. We don't need to worry about these warning errors of invalid html, as we only want to have a correct DOM-structure out of a possible incorrect HTML soup, and possibly no validation whether we have valid HTML5 or not (?). If so, we have to change the doctype in filter_dom_load() and silence the warnings of libxml2 with libxml_use_internal_errors(true). This same disabled error reporting is also implemented in Symfony DOM-crawler: https://github.com/symfony/symfony/issues/1733
Patches for this are provided in the issue queue but it creates an invalid HTML5 DOM. One issue is that empty tags like <div></div>
changes to <div />
, and this is possibly a deal breaker.
Implement another parser in Drupal or Symfony
Current approach: try to use HTML Purifier.
Using a real HTML5 parser library might solve the problem. This can be implemented in Drupal or Symfony.
There are at the moment two serious candidates to investigate:
Provide patches for PHP/libxml2 or wait until libxml2 get fixed
Best solution to improve libxml2 in PHP as we are (hopefully :) not the only software that have to parse HTML5. But even if libxml2 was fixed, that would only be contained in one of the next releases of PHP, and Drupal cannot require that as a minimum version. A solution for that is that in Symfony the new library for older PHP-versions will be implemented. This will keep Drupal clean, and makes Symfony an even stronger universal layer. Complication is that there doesn't even seem to be an ticket and roadmap to implement HTML5 in libxml2.
Chosen solution
The consensus is that the best solution is to use html5-php (rewrite of html5lib). The library has almost all the functionality that we need. And the functionality is already implemented. There was a problem with the library and that has been fixed. The functionality that we miss is when the DOMDocument has a default namespace it's not possible to parse it using XPath unless the namespace is registered as something else. In the patch there are two helper-classes to add this functionality.
Remaining tasks
- Review
- Commit
User interface changes
None
API changes
Change in behavior of filter_dom_load, the html-filter and html-corrector
Blocked Issues
#1277290: Use a proper HTML parser for every core filter
Related Issues
PHP Bug #60021 DOMDocument errors on HTML5 tags
issue that replaced own (faulty) function with libxml for the html corrector filter in 2009: #374441: Refactor Drupal HTML corrector (PHP5)
#725260: Use PHP's Tidy component for the clean-up HTML filter
Original report
The html filter corrector is now based on XHTML, but Drupal8 should output html5
At the moment<span lang="en">
transforms to <span lang="en" xml:lang="en">
(see issue #1328768: attributes 'xml:lang' and 'xml:id' transform to 'lang' and 'id' in filter_xss)
two possible causes:function filter_dom_load
currently loads @$dom_document->loadHTML('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body>' . $text . '</body></html>');
function filter_dom_serialize
uses $dom_document->saveXML