Quantcast
Channel: Issues for Drupal core
Viewing all articles
Browse latest Browse all 314030

[Meta] PHP DOM (libxml2) only understands XHTML4, misinterprets HTML5, but D8 must cope with HTML5

$
0
0

Problem/Motivation

The filters 'htmlcorrector', 'html' and the testing system needs html parsing and a valid DOM to work with. This is done by the libxml2 library provided in PHP that cleans html and transform it to a dom. Libxml2 assumes all html is HTML4 and correct it with HTML4 rules. As Drupal will be based on HTML5, typical HTML5 tags and constructions will be marked invalid, added or stripped.
A small test example is that <span lang="en"> transforms to <span lang="en" xml:lang="en">

Proposed resolution

Chosen solution

The consensus is that the best solution is to use html5-php (rewrite of html5lib). The library has almost all the functionality that we need. And the functionality is already implemented. There was a problem with the library and that has been fixed. The functionality that we miss is when the DOMDocument has a default namespace it's not possible to parse it using XPath unless the namespace is registered as something else. In the patch there are two helper-classes to add this functionality.

Remaining tasks

#2441373: [pp-1] Upgrade tests to HTML5
#2667340: Usage of field_prefix and container-inline creates invalid markup.
#2441811: Upgrade filter system to HTML5

@todo: needs an issue created or existing issue linked from this one- convert the filter module to use masterminds/html5

User interface changes

None

API changes

Change in behavior of filter_dom_load, the html-filter and html-corrector

Beta phase evaluation

Reference: https://www.drupal.org/core/beta-changes
Issue priorityMajor because the bug has many repercussions on filters using the HTML filter, and under some circumstances could result in an invalid DOM. Not critical because it does not render the HTML filter unusable.
Prioritized changesThe main goal of this issue is a bugfix to for HTML5 support, which is part of the Drupal 8 product.

Blocked Issues

#1277290: Use a proper HTML parser for every core filter

PHP Bug #60021 DOMDocument errors on HTML5 tags
issue that replaced own (faulty) function with libxml for the html corrector filter in 2009: #374441: Refactor Drupal HTML corrector (PHP5)
#725260: Use PHP's Tidy component for the clean-up HTML filter

Original report

The html filter corrector is now based on XHTML, but Drupal8 should output html5
At the moment
<span lang="en"> transforms to <span lang="en" xml:lang="en">
(see issue #1328768: attributes 'xml:lang' and 'xml:id' transform to 'lang' and 'id' in filter_xss)

two possible causes:
function filter_dom_load currently loads @$dom_document->loadHTML('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body>' . $text . '</body></html>');

function filter_dom_serialize uses $dom_document->saveXML


Viewing all articles
Browse latest Browse all 314030

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>