Problem/Motivation
Under certain circumstances, Drupal\Component\Utility\Html::normalize()
(in D7, it's _filter_htmlcorrector()
) will add messy </body></html>
to the resulting HTML. This happens when the HTML ends in the middle of an attribute, for example:
<p>Here <img alt="ao
This will produce output like:
You can reproduce on Drupal 7 or 8 by following these steps:
Drupal 8
- Install Drupal 8.3.1 with the standard profile
- Go to
/admin/structure/types/manage/article/display/teaser
and configure the "Body" filed to trim at 20 characters - Go to
/admin/config/content/formats/manage/basic_html
and both (a) enable the "Correct faulty and chopped off HTML" filter and (b) disable the "Restrict images to this site" filter (only necessary for the example HTML, not necessary to trigger the bug) - Go to
/node/add/article
and use this HTML as the body (be sure to click the "Source" button in the WYSIWYG toolbar before pasting it, otherwise you're adding text not HTML):Here <img alt="aoeunhteoas unthoaesn theoausnth oaesntheo asnthoae" src="http://flowjournal.org/wp-content/uploads/2011/12/Im-Not-Here.png" /> it is
- Go to
/node
and observe output like in the screenshot
Drupal 7
- Install Drupal 7.54 with the standard profile
- Go to
/admin/structure/types/manage/article/display/teaser
and configure the "Body" filed to trim at 20 characters - Go to
/admin/config/content/formats/filtered_html
and add<img>
to the "Allowed HTML tags" (under "Limit allowed HTML tags") - Go to
/node/add/article
and use this HTML as the body:Here <img alt="aoeunhteoas unthoaesn theoausnth oaesntheo asnthoae" src="http://flowjournal.org/wp-content/uploads/2011/12/Im-Not-Here.png" /> it is
- Go to
/node
and observe output like in the screenshot
Proposed resolution
Remove broken tags at the end of the text before parsing into the DOM. Views has some code that does this to avoid this exact problem when trimming fields in Views. Here's the Views code:
$value = rtrim(preg_replace('/(?:<(?!.+>)|&(?!.+;)).*$/us', '', $value));
Remaining tasks
- Port patch to Drupal 8
- Write automated tests
- Review
- Backport final patch to Drupal 7
User interface changes
None.
API changes
None.
Data model changes
None.
Original summary
_filter_htmlcorrector leaves in fragmentary tags that may be passed in and break the rest of the page. This is most liable to happen when using the "Field can contain HTML" filter in the Views module, but could also occur any other time a developer were to trim a string that contains HTML and pass it to this function.
Example (trimmed to 250 chars):
Lorem ipsum dolor sit amet, consectetur adipiscing elit. <strong>Aliquam posuere enim</strong>. Sed ultrices semper tortor. Pellentesque cenim consectetur. Nulla sed risus eu ipsum venenatis <a class="sample" href="http://www.example.com/partial/path
Output is identical to input, breaking any HTML that follows on the page. Ideal output would be:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. <strong>Aliquam posuere enim</strong>. Sed ultrices semper tortor. Pellentesque cenim consectetur. Nulla sed risus eu ipsum venenatis
Patch attached.