Hướng dẫn php remove html entities

I am trying to sanitize a string and have ended up with the following:

Nội dung chính

  • Not the answer you're looking for? Browse other questions tagged php string replace html-entities or ask your own question.
  • Description
  • Return Values
  • How do you remove HTML tags from data in PHP?
  • How do I remove special characters from a string in HTML?
  • How remove all special characters from a string in PHP?
  • What is the use of Htmlentities () function in PHP?

Characterisation of the arsenic resistance genes in lt i gt Bacillus lt i gt sp UWC isolated from maturing fly ash acid mine drainage neutralised solids

I am trying to remove the lt, i, gt as those are reduced HTML entities which do not seem to be removed. What would be the best way to approach this or another solution that I could look at?

Here is my current solution for now:

/**
 * @return string
 */
public function getFormattedTitle()
{
    $string = preg_replace('/[^A-Za-z0-9\-]/', ' ',  filter_var($this->getTitle(), FILTER_SANITIZE_STRING));
    return $string;
}

And here is an example input string:

Assessing Clivia taxonomy using the core DNA barcode regions, matK and rbcLa

Thanks!

asked Jul 31, 2018 at 19:44

Hướng dẫn php remove html entities

1

The telltale lt and gt in your output tell me that the string you have is actually more like:

"Assessing <i>Clivia</i> taxonomy using the core DNA barcode regions, <i>matK</i> and <i>rbcLa</i>"

when viewed as plain text.

The string you show above is what would show in a browser which would interpret '<' as '<' and '>' as '>'. (These are usually called "HTML entities" and offer a way to encode a character that would otherwise be interpreted as HTML.)

One option is to process like this:

$s = "Assessing <i>Clivia</i> taxonomy …";
$s = html_entity_decode($s); // $s is now "Assessing Clivia taxonomy …"
$s = strip_tags($s); // $s is now "Assessing Clivia taxonomy"

But do be aware that strip_tags is an exceedingly naïve function. For example it would turn '1<5 and 6>2' into '12'! So you need to be sure that all your input text is double-HTML encoded as the example is for it to work perfectly.

answered Jul 31, 2018 at 20:14

0

Instead of filter_var, try strip_tags: http://php.net/manual/en/function.strip-tags.php

Clivia taxonomy using the core DNA barcode regions, matK and rbcLa';

  //strip away all html tags but leave whats inside
  $output_string = strip_tags($input_string);

  echo $output_string;
  //echos: Assessing Clivia taxonomy using the core DNA barcode regions, matK and rbcLa 

?>

answered Jul 31, 2018 at 19:48

The DogThe Dog

4622 silver badges11 bronze badges

2

Great, but if didn't clean UTF-8 icon char but it is a great begin. I have added

preg_replace('/[^(\x20-\x7F)]*/','', $s);

answered Jul 24 at 5:25

SebSeb

481 silver badge6 bronze badges

Better way is strip_tags(); See a manual here: http://php.net/manual/ru/function.strip-tags.php An example:

   public function getFormattedTitle()
    {
        return strip_tags($this->getTitle(), '');
    }

answered Jul 31, 2018 at 19:49

NikolaiNikolai

1951 gold badge1 silver badge12 bronze badges

Not the answer you're looking for? Browse other questions tagged php string replace html-entities or ask your own question.

(PHP 4 >= 4.3.0, PHP 5, PHP 7, PHP 8)

html_entity_decodeConvert HTML entities to their corresponding characters

Description

html_entity_decode(string $string, int $flags = ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML401, ?string $encoding = null): string

More precisely, this function decodes all the entities (including all numeric entities) that a) are necessarily valid for the chosen document type — i.e., for XML, this function does not decode named entities that might be defined in some DTD — and b) whose character or characters are in the coded character set associated with the chosen encoding and are permitted in the chosen document type. All other entities are left as is.

Parameters

string

The input string.

flags

A bitmask of one or more of the following flags, which specify how to handle quotes and which document type to use. The default is ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML401.

Available flags constants
Constant NameDescription
ENT_COMPAT Will convert double-quotes and leave single-quotes alone.
ENT_QUOTES Will convert both double and single quotes.
ENT_NOQUOTES Will leave both double and single quotes unconverted.
ENT_SUBSTITUTE Replace invalid code unit sequences with a Unicode Replacement Character U+FFFD (UTF-8) or � (otherwise) instead of returning an empty string.
ENT_HTML401 Handle code as HTML 4.01.
ENT_XML1 Handle code as XML 1.
ENT_XHTML Handle code as XHTML.
ENT_HTML5 Handle code as HTML 5.
encoding

An optional argument defining the encoding used when converting characters.

If omitted, encoding defaults to the value of the default_charset configuration option.

Although this argument is technically optional, you are highly encouraged to specify the correct value for your code if the default_charset configuration option may be set incorrectly for the given input.

The following character sets are supported:

Supported charsets
CharsetAliasesDescription
ISO-8859-1 ISO8859-1 Western European, Latin-1.
ISO-8859-5 ISO8859-5 Little used cyrillic charset (Latin/Cyrillic).
ISO-8859-15 ISO8859-15 Western European, Latin-9. Adds the Euro sign, French and Finnish letters missing in Latin-1 (ISO-8859-1).
UTF-8   ASCII compatible multi-byte 8-bit Unicode.
cp866 ibm866, 866 DOS-specific Cyrillic charset.
cp1251 Windows-1251, win-1251, 1251 Windows-specific Cyrillic charset.
cp1252 Windows-1252, 1252 Windows specific charset for Western European.
KOI8-R koi8-ru, koi8r Russian.
BIG5 950 Traditional Chinese, mainly used in Taiwan.
GB2312 936 Simplified Chinese, national standard character set.
BIG5-HKSCS   Big5 with Hong Kong extensions, Traditional Chinese.
Shift_JIS SJIS, SJIS-win, cp932, 932 Japanese
EUC-JP EUCJP, eucJP-win Japanese
MacRoman   Charset that was used by Mac OS.
''   An empty string activates detection from script encoding (Zend multibyte), default_charset and current locale (see nl_langinfo() and setlocale()), in this order. Not recommended.

Note: Any other character sets are not recognized. The default encoding will be used instead and a warning will be emitted.

Return Values

Returns the decoded string.

Changelog

VersionDescription
8.1.0 flags changed from ENT_COMPAT to ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML401.
8.0.0 encoding is nullable now.

Examples

Example #1 Decoding HTML entities

$orig "I'll \"walk\" the dog now";$a htmlentities($orig);$b html_entity_decode($a);

echo

$a// I'll "walk" the <b>dog</b> nowecho $b// I'll "walk" the dog now
?>

Notes

Note:

You might wonder why trim(html_entity_decode(' ')); doesn't reduce the string to an empty string, that's because the ' ' entity is not ASCII code 32 (which is stripped by trim()) but ASCII code 160 (0xa0) in the default ISO 8859-1 encoding.

See Also

  • htmlentities() - Convert all applicable characters to HTML entities
  • htmlspecialchars() - Convert special characters to HTML entities
  • get_html_translation_table() - Returns the translation table used by htmlspecialchars and htmlentities
  • urldecode() - Decodes URL-encoded string

Martin

11 years ago

If you need something that converts &#[0-9]+ entities to UTF-8, this is simple and works:

/* Entity crap. /
$input = "Fovič";

$output = preg_replace_callback("/(&#[0-9]+;)/", function($m) { return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); }, $input);

/* Plain UTF-8. */

echo $output;
?>

txnull

7 years ago

Use the following to decode all entities:
($string, ENT_QUOTES | ENT_XML1, 'UTF-8') ?>

I've checked these special entities:
- double quotes (")
- single quotes (' and ')
- non printable chars (e.g. )
With other $flags some or all won't be decoded.

It seems that ENT_XML1 and ENT_XHTML are identical when decoding.

Daniel A.

4 years ago

I wanted to use this function today and I found the documentation, especially about the flags, not particularly helpful.

Running the code below, for example, failed because the flag I used was the wrong one...

$string = 'Donna's Bakery';
$title = html_entity_decode($string, ENT_HTML401, 'UTF-8');
echo $title;

The correct flag to use in this case is ENT_QUOTES.

My understanding of the flag to use is the one that would correspond to the expected, converted outcome. So, ENT_QUOTES for a character that would be a single or double quote when converted... and so on.

Please help make the documentation a bit clearer.

Benjamin

9 years ago

The following function decodes named and numeric HTML entities and works on UTF-8. Requires iconv.

function decodeHtmlEnt($str) {
    $ret = html_entity_decode($str, ENT_COMPAT, 'UTF-8');
    $p2 = -1;
    for(;;) {
        $p = strpos($ret, '&#', $p2+1);
        if ($p === FALSE)
            break;
        $p2 = strpos($ret, ';', $p);
        if ($p2 === FALSE)
            break;

                    if (substr($ret, $p+2, 1) == 'x')
            $char = hexdec(substr($ret, $p+3, $p2-$p-3));
        else
            $char = intval(substr($ret, $p+2, $p2-$p-2));

                    //echo "$char\n";
        $newchar = iconv(
            'UCS-4', 'UTF-8',
            chr(($char>>24)&0xFF).chr(($char>>16)&0xFF).chr(($char>>8)&0xFF).chr($char&0xFF)
        );
        //echo "$newchar<$p<$p2<<\n";
        $ret = substr_replace($ret, $newchar, $p, 1+$p2-$p);
        $p2 = $p + strlen($newchar);
    }
    return $ret;
}

php dot net at c dash ovidiu dot tk

17 years ago

Quick & dirty code that translates numeric entities to UTF-8.

function replace_num_entity($ord)
    {
       
$ord = $ord[1];
        if (
preg_match('/^x([0-9a-f]+)$/i', $ord, $match))
        {
           
$ord = hexdec($match[1]);
        }
        else
        {
           
$ord = intval($ord);
        }
$no_bytes = 0;
       
$byte = array();

        if (

$ord < 128)
        {
            return
chr($ord);
        }
        elseif (
$ord < 2048)
        {
           
$no_bytes = 2;
        }
        elseif (
$ord < 65536)
        {
           
$no_bytes = 3;
        }
        elseif (
$ord < 1114112)
        {
           
$no_bytes = 4;
        }
        else
        {
            return;
        }

        switch(

$no_bytes)
        {
            case
2:
            {
               
$prefix = array(31, 192);
                break;
            }
            case
3:
            {
               
$prefix = array(15, 224);
                break;
            }
            case
4:
            {
               
$prefix = array(7, 240);
            }
        }

        for (

$i = 0; $i < $no_bytes; $i++)
        {
           
$byte[$no_bytes - $i - 1] = (($ord & (63 * pow(2, 6 * $i))) / pow(2, 6 * $i)) & 63 | 128;
        }
$byte[0] = ($byte[0] & $prefix[0]) | $prefix[1];$ret = '';
        for (
$i = 0; $i < $no_bytes; $i++)
        {
           
$ret .= chr($byte[$i]);
        }

        return

$ret;
    }
$test = 'This is a čא test'';

    echo

$test . "
\n"
;
    echo
preg_replace_callback('/&#([0-9a-fx]+);/mi', 'replace_num_entity', $test);?>

Anonymous

1 year ago

Why doesn't the html_entity_decode() function convert entities without the last semicolon (like A or A) to characters?

---
echo 'like A or A';
---
Browser displays fine:
----
like A or A

neurotic dot neu at gmail dot com

12 years ago

This is a safe rawurldecode with utf8 detection:

function utf8_rawurldecode($raw_url_encoded){
   
$enc = rawurldecode($raw_url_encoded);
    if(
utf8_encode(utf8_decode($enc))==$enc){;
        return
rawurldecode($raw_url_encoded);
    }else{
        return
utf8_encode(rawurldecode($raw_url_encoded));
    }
}
?>

Free at Key dot no

12 years ago

Handy function to convert remaining HTML-entities into human readable chars (for entities which do not exist in target charset):

function cleanString($in,$offset=null)
{
   
$out = trim($in);
    if (!empty(
$out))
    {
       
$entity_start = strpos($out,'&',$offset);
        if (
$entity_start === false)
        {
           
// ideal
           
return $out;   
        }
        else
        {
           
$entity_end = strpos($out,';',$entity_start);
            if (
$entity_end === false)
            {
                 return
$out;
            }
           
// zu lang um eine entity zu sein
           
else if ($entity_end > $entity_start+7)
            {
                
// und weiter gehts
                
$out = cleanString($out,$entity_start+1);
            }
           
// gottcha!
           
else
            {
                
$clean = substr($out,0,$entity_start);
                
$subst = substr($out,$entity_start+1,1);
                
// š => "s" / š => "_"
                
$clean .= ($subst != "#") ? $subst : "_";
                
$clean .= substr($out,$entity_end+1);
                
// und weiter gehts
                
$out = cleanString($clean,$entity_start+1);
            }
        }
    }
    return
$out;
}
?>

Matt Robinson

12 years ago

I wrote in a previous comment that html_entity_decode() only handled about 100 characters. That's not quite true; it only handles entities that exist in the output character set (the third argument). If you want to get ALL HTML entities, make sure you use ENT_QUOTES and set the third argument to 'UTF-8'.

If you don't want a UTF-8 string, you'll need to convert it afterward with something like utf8_decode(), iconv(), or mb_convert_encoding().

If you're producing XML, which doesn't recognise most HTML entities:

When producing a UTF-8 document (the default), then htmlspecialchars(html_entity_decode($string, ENT_QUOTES, 'UTF-8'), ENT_NOQUOTES, 'UTF-8') (because you only need to escape < and > and & unless you're printing inside the XML tags themselves).

Otherwise, either convert all the named entities to numeric ones, or declare the named entities in the document's DTD. The full list of 252 entities can be found in the HTML 4.01 Spec, or you can cut and paste the function from my site (http://inanimatt.com/php-convert-entities.php).

me at richardsnazell dot com

14 years ago

I had a problem getting the 'TM' trademark symbol to display correctly in an email subject line. Using html_entity_decode() with different charsets didn't work, but directly replacing the entity with it's ASCII equivalent did:

$subject = str_replace('™', chr(153), $subject);

Victor

10 years ago

We were having very peculiar behavior regarding foreign characters such as e-acute.

However, it was only showing up as a problem when extracting those characters out of our mysql database and when being displayed through a proxy server of ours that handles dns issues.

As other users have made a note of, the default character setting wasn't what they were expecting it to be when they left theirs blank.

When we changed our default_charset to "UTF-8", our problems and needs for using functions like these were no longer necessary in handling foreign characters such as e-acute. Good enough for us!

jojo

15 years ago

The decipherment does the character encoded by the escape function of JavaScript.
When the multi byte is used on the page, it is effective.

javascript escape('aaああaa') ..... 'aa%u3042%u3042aa'
php  jsEscape_decode('aa%u3042%u3042aa')..'aaああaa'

function jsEscape_decode($jsEscaped,$outCharCode='SJIS'){
   
$arrMojis = explode("%u",$jsEscaped);
    for (
$i = 1;$i < count($arrMojis);$i++){
       
$c = substr($arrMojis[$i],0,4);
       
$cc = mb_convert_encoding(pack('H*',$c),$outCharCode,'UTF-16');
       
$arrMojis[$i] = substr_replace($arrMojis[$i],$cc,0,4);
    }
    return
implode('',$arrMojis);
}
?>

daniel at brightbyte dot de

17 years ago

This function seems to have to have two limitations (at least in PHP 4.3.8):

a) it does not work with multibyte character codings, such as UTF-8
b) it does not decode numeric entity references

a) can be solved by using iconv to convert to ISO-8859-1, then decoding the entities, than convert to UTF-8 again. But that's quite ugly and detroys all characters not present in Latin-1.

b) can be solved rather nicely using the following code:

function decode_entities($text) {
   
$text= html_entity_decode($text,ENT_QUOTES,"ISO-8859-1"); #NOTE: UTF-8 does not work!
   
$text= preg_replace('/&#(\d+);/me',"chr(\\1)",$text); #decimal notation
   
$text= preg_replace('/&#x([a-f0-9]+);/mei',"chr(0x\\1)",$text);  #hex notation
   
return $text;
}
?>

HTH

grvg (at) free (dot) fr

16 years ago

Here is the ultimate functions to convert HTML entities to UTF-8 :
The main function is htmlentities2utf8
Others are helper functions

function chr_utf8($code)
    {
        if (
$code < 0) return false;
        elseif (
$code < 128) return chr($code);
        elseif (
$code < 160) // Remove Windows Illegals Cars
       
{
            if (
$code==128) $code=8364;
            elseif (
$code==129) $code=160; // not affected
           
elseif ($code==130) $code=8218;
            elseif (
$code==131) $code=402;
            elseif (
$code==132) $code=8222;
            elseif (
$code==133) $code=8230;
            elseif (
$code==134) $code=8224;
            elseif (
$code==135) $code=8225;
            elseif (
$code==136) $code=710;
            elseif (
$code==137) $code=8240;
            elseif (
$code==138) $code=352;
            elseif (
$code==139) $code=8249;
            elseif (
$code==140) $code=338;
            elseif (
$code==141) $code=160; // not affected
           
elseif ($code==142) $code=381;
            elseif (
$code==143) $code=160; // not affected
           
elseif ($code==144) $code=160; // not affected
           
elseif ($code==145) $code=8216;
            elseif (
$code==146) $code=8217;
            elseif (
$code==147) $code=8220;
            elseif (
$code==148) $code=8221;
            elseif (
$code==149) $code=8226;
            elseif (
$code==150) $code=8211;
            elseif (
$code==151) $code=8212;
            elseif (
$code==152) $code=732;
            elseif (
$code==153) $code=8482;
            elseif (
$code==154) $code=353;
            elseif (
$code==155) $code=8250;
            elseif (
$code==156) $code=339;
            elseif (
$code==157) $code=160; // not affected
           
elseif ($code==158) $code=382;
            elseif (
$code==159) $code=376;
        }
        if (
$code < 2048) return chr(192 | ($code >> 6)) . chr(128 | ($code & 63));
        elseif (
$code < 65536) return chr(224 | ($code >> 12)) . chr(128 | (($code >> 6) & 63)) . chr(128 | ($code & 63));
        else return
chr(240 | ($code >> 18)) . chr(128 | (($code >> 12) & 63)) . chr(128 | (($code >> 6) & 63)) . chr(128 | ($code & 63));
    }
// Callback for preg_replace_callback('~&(#(x?))?([^;]+);~', 'html_entity_replace', $str);
   
function html_entity_replace($matches)
    {
        if (
$matches[2])
        {
            return
chr_utf8(hexdec($matches[3]));
        } elseif (
$matches[1])
        {
            return
chr_utf8($matches[3]);
        }
        switch (
$matches[3])
        {
            case
"nbsp": return chr_utf8(160);
            case
"iexcl": return chr_utf8(161);
            case
"cent": return chr_utf8(162);
            case
"pound": return chr_utf8(163);
            case
"curren": return chr_utf8(164);
            case
"yen": return chr_utf8(165);
           
//... etc with all named HTML entities
       
}
        return
false;
    }

        function

htmlentities2utf8 ($string) // because of the html_entity_decode() bug with UTF-8
   
{
       
$string = preg_replace_callback('~&(#(x?))?([^;]+);~', 'html_entity_replace', $string);
        return
$string;
    }
?>

slickriptide at gmail dot com

5 years ago

When using this function, it's a good idea to pay attention when it says that leaving the charset parameter empty is "not recommended".

I had an issue where I was storing text files, with entities converted, into a database. When I retrieved them later and ran

$text_file = html_entity_decode($text_data);

the entities were NOT decoded.

Once I was aware of the problem, I changed the decode call to fully specify all of the parameters:

$text_file = html_entity_decode($text_data, ENT_COMPAT | ENT_HTML5,'utf-8');

This converted the entities as expected.

marion at figmentthinking dot com

13 years ago

I just ran into the:
Bug #27626 html_entity_decode bug - cannot yet handle MBCS in html_entity_decode()!

The simple solution if you're still running PHP 4 is to wrap the html_entity_decode() function with the utf8_decode() function.

$string = ' ';
$utf8_encode = utf8_encode(html_entity_decode($string));
?>

By default html_entity_decode() returns the ISO-8859-1 character set, and by default utf8_decode()...

http://us.php.net/manual/en/function.utf8-decode.php
"Converts a string with ISO-8859-1 characters encoded with UTF-8 to single-byte ISO-8859-1"

jl dot garcia at gmail dot com

13 years ago

I created this function to filter all the text that goes in or comes out of the database.

function filter_string($string, $nohtml='', $save='') {
    if(!empty(
$nohtml)) {
       
$string = trim($string);
        if(!empty(
$save)) $string = htmlentities(trim($string), ENT_QUOTES, 'ISO-8859-15');
        else
$string = html_entity_decode($string, ENT_QUOTES, 'ISO-8859-15');
    }
    if(!empty(
$save)) $string = mysql_real_escape_string($string);
    else
$string = stripslashes($string);
    return(
$string);
}
?>

kae at verens dot com

14 years ago

the references to 'chr()' in the example unhtmlentities() function should be changed to unichr, using the example unichr() function described in the 'chr' reference (http://php.net/chr).

the reason for this is characters such as € which do not break down into an ASCII number (that's the Euro, by the way).

How do you remove HTML tags from data in PHP?

The strip_tags() function strips a string from HTML, XML, and PHP tags. Note: HTML comments are always stripped. This cannot be changed with the allow parameter.

How do I remove special characters from a string in HTML?

Your answer function clean($string) { $string = str_replace(' ', '-', $string); // Replaces all spaces with hyphens. return preg_replace('/[^A-Za-z0-9\-]/', '', $string); // Removes special chars. }

How remove all special characters from a string in PHP?

Using str_ireplace() Method: The str_ireplace() method is used to remove all the special characters from the given string str by replacing these characters with the white space (” “).

What is the use of Htmlentities () function in PHP?

htmlentities() Function: The htmlentities() function is an inbuilt function in PHP that is used to transform all characters which are applicable to HTML entities. This function converts all characters that are applicable to HTML entities.