Convert named entities to numeric in PHP
XML doesn't recognise most named HTML entities (e.g. ), so if you're taking HTML content and presenting it as XML, you will either need to declare those entities yourself, or convert them. The easiest way to convert them is to make your XML document UTF-8 (that's the default anyway) and then use PHP's built-in html_entity_decode() function:
<?php
$out = htmlspecialchars(
html_entity_decode($in, ENT_QUOTES, 'UTF-8'),
ENT_QUOTES, 'UTF-8'
);
?>
The htmlspecialchars() is there to make sure that <, >, and & all get converted - you don't need it if you're using an XML writer, like SimpleXML or XMLWriter (you'll just end up double-escaping, which makes you look silly). In this example I'm converting all quotes (single and double) to entities (" and ') because it's the paranoid option, which is always nice for example code. If you aren't actually printing inside an XML tag (e.g., the content of an attribute) then you can safely use ENT_NOQUOTES.
There are two possible problems with this approach. The first is invalid entities: html_entity_decode() won't touch them, which means you'll still get XML errors. The second is encoding. I suppose it's possible that you don't actually want UTF-8. You should, because it's awesome, but maybe you have a good reason. If you don't tell html_entity_decode() to use UTF-8, it won't convert entities that don't exist in the character set you specify. If you tell it to output in UTF-8 and then use something like iconv() to convert it, then you'll lose any characters that aren't in the output encoding.
This pair of functions converts all named entities to numeric entities, and gets rid of all invalid entities. It should leave existing numeric entities alone, so it's safe (but pointless) to run it multiple times on the same input. There are some notes about the code further down.
<?php
/* html_convert_entities($string) -- convert named HTML entities to
* XML-compatible numeric entities.
*/
function html_convert_entities($string) {
return preg_replace_callback('/&([a-zA-Z][a-zA-Z0-9]+);/',
'convert_entity', $string);
}
/* Swap HTML named entity with its numeric equivalent. If the entity
* isn't in the lookup table, this function returns a blank, which
* destroys the character in the output - this is probably the
* desired behaviour when producing XML. */
function convert_entity($matches) {
static $table = array('quot' => '"',
'amp' => '&',
'lt' => '<',
'gt' => '>',
'OElig' => 'Œ',
'oelig' => 'œ',
'Scaron' => 'Š',
'scaron' => 'š',
'Yuml' => 'Ÿ',
'circ' => 'ˆ',
'tilde' => '˜',
'ensp' => ' ',
'emsp' => ' ',
'thinsp' => ' ',
'zwnj' => '‌',
'zwj' => '‍',
'lrm' => '‎',
'rlm' => '‏',
'ndash' => '–',
'mdash' => '—',
'lsquo' => '‘',
'rsquo' => '’',
'sbquo' => '‚',
'ldquo' => '“',
'rdquo' => '”',
'bdquo' => '„',
'dagger' => '†',
'Dagger' => '‡',
'permil' => '‰',
'lsaquo' => '‹',
'rsaquo' => '›',
'euro' => '€',
'fnof' => 'ƒ',
'Alpha' => 'Α',
'Beta' => 'Β',
'Gamma' => 'Γ',
'Delta' => 'Δ',
'Epsilon' => 'Ε',
'Zeta' => 'Ζ',
'Eta' => 'Η',
'Theta' => 'Θ',
'Iota' => 'Ι',
'Kappa' => 'Κ',
'Lambda' => 'Λ',
'Mu' => 'Μ',
'Nu' => 'Ν',
'Xi' => 'Ξ',
'Omicron' => 'Ο',
'Pi' => 'Π',
'Rho' => 'Ρ',
'Sigma' => 'Σ',
'Tau' => 'Τ',
'Upsilon' => 'Υ',
'Phi' => 'Φ',
'Chi' => 'Χ',
'Psi' => 'Ψ',
'Omega' => 'Ω',
'alpha' => 'α',
'beta' => 'β',
'gamma' => 'γ',
'delta' => 'δ',
'epsilon' => 'ε',
'zeta' => 'ζ',
'eta' => 'η',
'theta' => 'θ',
'iota' => 'ι',
'kappa' => 'κ',
'lambda' => 'λ',
'mu' => 'μ',
'nu' => 'ν',
'xi' => 'ξ',
'omicron' => 'ο',
'pi' => 'π',
'rho' => 'ρ',
'sigmaf' => 'ς',
'sigma' => 'σ',
'tau' => 'τ',
'upsilon' => 'υ',
'phi' => 'φ',
'chi' => 'χ',
'psi' => 'ψ',
'omega' => 'ω',
'thetasym' => 'ϑ',
'upsih' => 'ϒ',
'piv' => 'ϖ',
'bull' => '•',
'hellip' => '…',
'prime' => '′',
'Prime' => '″',
'oline' => '‾',
'frasl' => '⁄',
'weierp' => '℘',
'image' => 'ℑ',
'real' => 'ℜ',
'trade' => '™',
'alefsym' => 'ℵ',
'larr' => '←',
'uarr' => '↑',
'rarr' => '→',
'darr' => '↓',
'harr' => '↔',
'crarr' => '↵',
'lArr' => '⇐',
'uArr' => '⇑',
'rArr' => '⇒',
'dArr' => '⇓',
'hArr' => '⇔',
'forall' => '∀',
'part' => '∂',
'exist' => '∃',
'empty' => '∅',
'nabla' => '∇',
'isin' => '∈',
'notin' => '∉',
'ni' => '∋',
'prod' => '∏',
'sum' => '∑',
'minus' => '−',
'lowast' => '∗',
'radic' => '√',
'prop' => '∝',
'infin' => '∞',
'ang' => '∠',
'and' => '∧',
'or' => '∨',
'cap' => '∩',
'cup' => '∪',
'int' => '∫',
'there4' => '∴',
'sim' => '∼',
'cong' => '≅',
'asymp' => '≈',
'ne' => '≠',
'equiv' => '≡',
'le' => '≤',
'ge' => '≥',
'sub' => '⊂',
'sup' => '⊃',
'nsub' => '⊄',
'sube' => '⊆',
'supe' => '⊇',
'oplus' => '⊕',
'otimes' => '⊗',
'perp' => '⊥',
'sdot' => '⋅',
'lceil' => '⌈',
'rceil' => '⌉',
'lfloor' => '⌊',
'rfloor' => '⌋',
'lang' => '〈',
'rang' => '〉',
'loz' => '◊',
'spades' => '♠',
'clubs' => '♣',
'hearts' => '♥',
'diams' => '♦',
'nbsp' => ' ',
'iexcl' => '¡',
'cent' => '¢',
'pound' => '£',
'curren' => '¤',
'yen' => '¥',
'brvbar' => '¦',
'sect' => '§',
'uml' => '¨',
'copy' => '©',
'ordf' => 'ª',
'laquo' => '«',
'not' => '¬',
'shy' => '­',
'reg' => '®',
'macr' => '¯',
'deg' => '°',
'plusmn' => '±',
'sup2' => '²',
'sup3' => '³',
'acute' => '´',
'micro' => 'µ',
'para' => '¶',
'middot' => '·',
'cedil' => '¸',
'sup1' => '¹',
'ordm' => 'º',
'raquo' => '»',
'frac14' => '¼',
'frac12' => '½',
'frac34' => '¾',
'iquest' => '¿',
'Agrave' => 'À',
'Aacute' => 'Á',
'Acirc' => 'Â',
'Atilde' => 'Ã',
'Auml' => 'Ä',
'Aring' => 'Å',
'AElig' => 'Æ',
'Ccedil' => 'Ç',
'Egrave' => 'È',
'Eacute' => 'É',
'Ecirc' => 'Ê',
'Euml' => 'Ë',
'Igrave' => 'Ì',
'Iacute' => 'Í',
'Icirc' => 'Î',
'Iuml' => 'Ï',
'ETH' => 'Ð',
'Ntilde' => 'Ñ',
'Ograve' => 'Ò',
'Oacute' => 'Ó',
'Ocirc' => 'Ô',
'Otilde' => 'Õ',
'Ouml' => 'Ö',
'times' => '×',
'Oslash' => 'Ø',
'Ugrave' => 'Ù',
'Uacute' => 'Ú',
'Ucirc' => 'Û',
'Uuml' => 'Ü',
'Yacute' => 'Ý',
'THORN' => 'Þ',
'szlig' => 'ß',
'agrave' => 'à',
'aacute' => 'á',
'acirc' => 'â',
'atilde' => 'ã',
'auml' => 'ä',
'aring' => 'å',
'aelig' => 'æ',
'ccedil' => 'ç',
'egrave' => 'è',
'eacute' => 'é',
'ecirc' => 'ê',
'euml' => 'ë',
'igrave' => 'ì',
'iacute' => 'í',
'icirc' => 'î',
'iuml' => 'ï',
'eth' => 'ð',
'ntilde' => 'ñ',
'ograve' => 'ò',
'oacute' => 'ó',
'ocirc' => 'ô',
'otilde' => 'õ',
'ouml' => 'ö',
'divide' => '÷',
'oslash' => 'ø',
'ugrave' => 'ù',
'uacute' => 'ú',
'ucirc' => 'û',
'uuml' => 'ü',
'yacute' => 'ý',
'thorn' => 'þ',
'yuml' => 'ÿ'
);
// Entity not found? Destroy it.
return isset($table[$matches[1]]) ? $table[$matches[1]] : '';
}
?>
Notes
- If you don't care about removing unknown entities, it might be quicker to use
str_replace()with two big arrays. It might not be, though, I haven't tried. If you're bored, you could try it out, and even usepreg_replace()afterward to remove the unknown entities. Let me know how you get on! My intuition is that the speed difference isn't worth the hassle. If you really care about performance, use PHP'shtml_entity_decode()with UTF-8 output as described above. If you really really care about performance, you shouldn't be using PHP ;) convert_entity()declares its translation table as a static variable. Again because I assume that this wastes less time than doing it every time the function is called, and because the obvious alternative is to declare it as a global, which seems messy to me.- Yes you're free to use this code, no you don't have to credit me with anything. Just don't sue me if you use it and it goes wrong. In as much as it's possible to do in the UK, I release this code into the public domain and waive all rights granted to me (and obligations required of me) as its creator.