PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Tuesday, May 17, 2022

[FIXED] How to prevent DOMDocument from converting   to unicode

 May 17, 2022     domdocument, php     No comments   

Issue

I am trying to get the inner HTML of a DOMElement in PHP. Example markup:

<div>...</div>
<div id="target"><p>Here's some &nbsp; <em>funny</em> &nbsp; text</p></div>
<div>...</div>
<div>...</div>

Feeding the above string into the variable $html, I am doing:

$doc = new DOMDocument();
@$doc->loadHTML("<html><body>$html</body></html>");
$node = $doc->getElementById('target')
$markup = '';
foreach ($node->childNodes as $child) {
  $markup .= $child->ownerDocument->saveXML($child);
}

The resulting $markup string looks like this (converted to JSON to reveal the invisible characters):

"<p>Here's some \u00a0 <em>funny<\/em> \u00a0 text<\/p>"

All &nbsp; characters have been converted to Unicode non-breaking spaces, which breaks my application.

In my ideal world, there would be a way to retrieve the original string of HTML inside the target div as-is, without DomDocument doing anything to it at all. That doesn't seem to be possible, so the next best thing would be to somehow turn off this character conversion. So far I've tried:

  • Setting $doc->substituteEntities = false; with no result. Changing it to true doesn't help either.
  • Toggling $doc->preserveWhiteSpace with no change either way
  • Changing saveXML to saveHTML. Doesn't make a difference.

Finally I resorted to tacking on this hack, which works but doesn't feel like the right solution.

$markup = str_replace("\xc2\xa0", '&nbsp;', $markup);

Surely there is a better way?


Solution

You can use mb_convert_encoding() to convert the Unicode characters to their entities without touching your brackets and such:

<?php
$html = '
<div>...</div>
<div id="target"><p>Here\'s some &nbsp; <em>funny</em> &nbsp; text</p></div>
<div>...</div>
<div>...</div>
';

$doc = new DOMDocument();
libxml_use_internal_errors();
$doc->loadHTML("<html><body>$html</body></html>");
$node = $doc->getElementById('target');
$markup = '';
foreach ($node->childNodes as $child) {
  $markup .= $child->ownerDocument->saveHTML($child);
}

$markup = mb_convert_encoding($markup, 'HTML-ENTITIES', 'UTF-8');
echo $markup;

Output:

<p>Here's some &nbsp; <em>funny</em> &nbsp; text</p>


Answered By - miken32
Answer Checked By - Katrina (PHPFixing Volunteer)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing