I have a string with HTML
encoding like below:
Ðột nhiên, ở gốc Tây Bắc văng vẳng có tiếng vó ngựa dồn dập.
I want to convert this String
to Unicode
. Expected output:
Ðột nhiên, ở gốc Tây Bắc văng vẳng có tiếng vó ngựa dồn dập.
I found a solution by Convert Decimal NCRs Code into UTF-8 in java [JSP] but it only works for strings with all characters which has its format begins with .
With characters
begin with &xxxx
, using the page HTML encoding of foreign language characters I got its encode is html encoding but my input string is the combination of convert HTML Entity [named] and HTML Entity [decimal].
Does anyone have any suggestion? It would be the best if we can make it without adding any additional libraries.
[UPDATE] I solved my problem by using Apache library :
String encodeString = "Ðột nhiên, ở gốc Tây Bắc văng vẳng có tiếng vó ngựa dồn dập.";
String unEncodeString = StringEscapeUtils.unescapeHtml4[encodeString];
System.out.println["OUTPUT : " + unEncodeString];
=====> OUTPUT : Ðột nhiên, ở gốc Tây Bắc văng vẳng có tiếng vó ngựa dồn dập.
Java examples to escape the characters in a String
using HTML entities. This converts the Java String to equivalent HTML content, browsers are capable to print.
1] StringEscapeUtils.escapeHtml4[] [Apache Commons Text]
- This method takes the raw string as parameter and then escapes the characters using HTML entities.
- It supports all known HTML 4.0 entities.
Apostrophe
escape character ['] is not a legal entity and so is not supported.
To use StringEscapeUtils
, import commons-text
dependency.
org.apache.commons commons-text 1.4
Now use StringEscapeUtils.escapeHtml4[]
method.
import org.apache.commons.text.StringEscapeUtils; public class HTMLEscapeExample { public static void main[String[] args] { String unEscapedString = "public static void main[String[] args] { ... }"; String escapedHTML = StringEscapeUtils.escapeHtml4[unEscapedString]; System.out.println[escapedHTML]; //Browser can now parse this and print } } //Output: <java>public static void main[String[] args] { ... }</java>
2] Custom StringUtils.encodeHtml[] method
If you have certain requirement where you need to modify the logic provided by library methods, the you can write your own method. Mostly this approach should be avoided, but may be handy when requirement arise.
package com.howtodoinjava.demo; public class HTMLEscapeExample { public static void main[String[] args] { String unEscapedString = "public static void main[String[] args] { ... }"; String escapedHTML = StringUtils.encodeHtml[unEscapedString]; System.out.println[escapedHTML]; //Browser can now parse this and print } } //Output: <java>public static void main[String[] args] { ... }</java>
StringUtils.java class
package com.howtodoinjava.demo; import java.util.HashMap; public class StringUtils { private static final HashMap htmlEncodeChars = new HashMap[]; static { // Special characters for HTML htmlEncodeChars.put['\u0026', "&"]; htmlEncodeChars.put['\u003C', "<"]; htmlEncodeChars.put['\u003E', ">"]; htmlEncodeChars.put['\u0022', """]; htmlEncodeChars.put['\u0152', "Œ"]; htmlEncodeChars.put['\u0153', "œ"]; htmlEncodeChars.put['\u0160', "Š"]; htmlEncodeChars.put['\u0161', "š"]; htmlEncodeChars.put['\u0178', "Ÿ"]; htmlEncodeChars.put['\u02C6', "ˆ"]; htmlEncodeChars.put['\u02DC', "˜"]; htmlEncodeChars.put['\u2002', " "]; htmlEncodeChars.put['\u2003', " "]; htmlEncodeChars.put['\u2009', " "]; htmlEncodeChars.put['\u200C', ""]; htmlEncodeChars.put['\u200D', ""]; htmlEncodeChars.put['\u200E', ""]; htmlEncodeChars.put['\u200F', ""]; htmlEncodeChars.put['\u2013', "–"]; htmlEncodeChars.put['\u2014', "—"]; htmlEncodeChars.put['\u2018', "‘"]; htmlEncodeChars.put['\u2019', "’"]; htmlEncodeChars.put['\u201A', "‚"]; htmlEncodeChars.put['\u201C', "“"]; htmlEncodeChars.put['\u201D', "”"]; htmlEncodeChars.put['\u201E', "„"]; htmlEncodeChars.put['\u2020', "†"]; htmlEncodeChars.put['\u2021', "‡"]; htmlEncodeChars.put['\u2030', "‰"]; htmlEncodeChars.put['\u2039', "‹"]; htmlEncodeChars.put['\u203A', "›"]; htmlEncodeChars.put['\u20AC', "€"]; // Character entity references for ISO 8859-1 characters htmlEncodeChars.put['\u00A0', " "]; htmlEncodeChars.put['\u00A1', "¡"]; htmlEncodeChars.put['\u00A2', "¢"]; htmlEncodeChars.put['\u00A3', "£"]; htmlEncodeChars.put['\u00A4', "¤"]; htmlEncodeChars.put['\u00A5', "¥"]; htmlEncodeChars.put['\u00A6', "¦"]; htmlEncodeChars.put['\u00A7', "§"]; htmlEncodeChars.put['\u00A8', "¨"]; htmlEncodeChars.put['\u00A9', "©"]; htmlEncodeChars.put['\u00AA', "ª"]; htmlEncodeChars.put['\u00AB', "«"]; htmlEncodeChars.put['\u00AC', "¬"]; htmlEncodeChars.put['\u00AD', ""]; htmlEncodeChars.put['\u00AE', "®"]; htmlEncodeChars.put['\u00AF', "¯"]; htmlEncodeChars.put['\u00B0', "°"]; htmlEncodeChars.put['\u00B1', "±"]; htmlEncodeChars.put['\u00B2', "²"]; htmlEncodeChars.put['\u00B3', "³"]; htmlEncodeChars.put['\u00B4', "´"]; htmlEncodeChars.put['\u00B5', "µ"]; htmlEncodeChars.put['\u00B6', "¶"]; htmlEncodeChars.put['\u00B7', "·"]; htmlEncodeChars.put['\u00B8', "¸"]; htmlEncodeChars.put['\u00B9', "¹"]; htmlEncodeChars.put['\u00BA', "º"]; htmlEncodeChars.put['\u00BB', "»"]; htmlEncodeChars.put['\u00BC', "¼"]; htmlEncodeChars.put['\u00BD', "½"]; htmlEncodeChars.put['\u00BE', "¾"]; htmlEncodeChars.put['\u00BF', "¿"]; htmlEncodeChars.put['\u00C0', "À"]; htmlEncodeChars.put['\u00C1', "Á"]; htmlEncodeChars.put['\u00C2', "Â"]; htmlEncodeChars.put['\u00C3', "Ã"]; htmlEncodeChars.put['\u00C4', "Ä"]; htmlEncodeChars.put['\u00C5', "Å"]; htmlEncodeChars.put['\u00C6', "Æ"]; htmlEncodeChars.put['\u00C7', "Ç"]; htmlEncodeChars.put['\u00C8', "È"]; htmlEncodeChars.put['\u00C9', "É"]; htmlEncodeChars.put['\u00CA', "Ê"]; htmlEncodeChars.put['\u00CB', "Ë"]; htmlEncodeChars.put['\u00CC', "Ì"]; htmlEncodeChars.put['\u00CD', "Í"]; htmlEncodeChars.put['\u00CE', "Î"]; htmlEncodeChars.put['\u00CF', "Ï"]; htmlEncodeChars.put['\u00D0', "Ð"]; htmlEncodeChars.put['\u00D1', "Ñ"]; htmlEncodeChars.put['\u00D2', "Ò"]; htmlEncodeChars.put['\u00D3', "Ó"]; htmlEncodeChars.put['\u00D4', "Ô"]; htmlEncodeChars.put['\u00D5', "Õ"]; htmlEncodeChars.put['\u00D6', "Ö"]; htmlEncodeChars.put['\u00D7', "×"]; htmlEncodeChars.put['\u00D8', "Ø"]; htmlEncodeChars.put['\u00D9', "Ù"]; htmlEncodeChars.put['\u00DA', "Ú"]; htmlEncodeChars.put['\u00DB', "Û"]; htmlEncodeChars.put['\u00DC', "Ü"]; htmlEncodeChars.put['\u00DD', "Ý"]; htmlEncodeChars.put['\u00DE', "Þ"]; htmlEncodeChars.put['\u00DF', "ß"]; htmlEncodeChars.put['\u00E0', "à"]; htmlEncodeChars.put['\u00E1', "á"]; htmlEncodeChars.put['\u00E2', "â"]; htmlEncodeChars.put['\u00E3', "ã"]; htmlEncodeChars.put['\u00E4', "ä"]; htmlEncodeChars.put['\u00E5', "å"]; htmlEncodeChars.put['\u00E6', "æ"]; htmlEncodeChars.put['\u00E7', "ç"]; htmlEncodeChars.put['\u00E8', "è"]; htmlEncodeChars.put['\u00E9', "é"]; htmlEncodeChars.put['\u00EA', "ê"]; htmlEncodeChars.put['\u00EB', "ë"]; htmlEncodeChars.put['\u00EC', "ì"]; htmlEncodeChars.put['\u00ED', "í"]; htmlEncodeChars.put['\u00EE', "î"]; htmlEncodeChars.put['\u00EF', "ï"]; htmlEncodeChars.put['\u00F0', "ð"]; htmlEncodeChars.put['\u00F1', "ñ"]; htmlEncodeChars.put['\u00F2', "ò"]; htmlEncodeChars.put['\u00F3', "ó"]; htmlEncodeChars.put['\u00F4', "ô"]; htmlEncodeChars.put['\u00F5', "õ"]; htmlEncodeChars.put['\u00F6', "ö"]; htmlEncodeChars.put['\u00F7', "÷"]; htmlEncodeChars.put['\u00F8', "ø"]; htmlEncodeChars.put['\u00F9', "ù"]; htmlEncodeChars.put['\u00FA', "ú"]; htmlEncodeChars.put['\u00FB', "û"]; htmlEncodeChars.put['\u00FC', "ü"]; htmlEncodeChars.put['\u00FD', "ý"]; htmlEncodeChars.put['\u00FE', "þ"]; htmlEncodeChars.put['\u00FF', "ÿ"]; // Mathematical, Greek and Symbolic characters for HTML htmlEncodeChars.put['\u0192', "ƒ"]; htmlEncodeChars.put['\u0391', "Α"]; htmlEncodeChars.put['\u0392', "Β"]; htmlEncodeChars.put['\u0393', "Γ"]; htmlEncodeChars.put['\u0394', "Δ"]; htmlEncodeChars.put['\u0395', "Ε"]; htmlEncodeChars.put['\u0396', "Ζ"]; htmlEncodeChars.put['\u0397', "Η"]; htmlEncodeChars.put['\u0398', "Θ"]; htmlEncodeChars.put['\u0399', "Ι"]; htmlEncodeChars.put['\u039A', "Κ"]; htmlEncodeChars.put['\u039B', "Λ"]; htmlEncodeChars.put['\u039C', "Μ"]; htmlEncodeChars.put['\u039D', "Ν"]; htmlEncodeChars.put['\u039E', "Ξ"]; htmlEncodeChars.put['\u039F', "Ο"]; htmlEncodeChars.put['\u03A0', "Π"]; htmlEncodeChars.put['\u03A1', "Ρ"]; htmlEncodeChars.put['\u03A3', "Σ"]; htmlEncodeChars.put['\u03A4', "Τ"]; htmlEncodeChars.put['\u03A5', "Υ"]; htmlEncodeChars.put['\u03A6', "Φ"]; htmlEncodeChars.put['\u03A7', "Χ"]; htmlEncodeChars.put['\u03A8', "Ψ"]; htmlEncodeChars.put['\u03A9', "Ω"]; htmlEncodeChars.put['\u03B1', "α"]; htmlEncodeChars.put['\u03B2', "β"]; htmlEncodeChars.put['\u03B3', "γ"]; htmlEncodeChars.put['\u03B4', "δ"]; htmlEncodeChars.put['\u03B5', "ε"]; htmlEncodeChars.put['\u03B6', "ζ"]; htmlEncodeChars.put['\u03B7', "η"]; htmlEncodeChars.put['\u03B8', "θ"]; htmlEncodeChars.put['\u03B9', "ι"]; htmlEncodeChars.put['\u03BA', "κ"]; htmlEncodeChars.put['\u03BB', "λ"]; htmlEncodeChars.put['\u03BC', "μ"]; htmlEncodeChars.put['\u03BD', "ν"]; htmlEncodeChars.put['\u03BE', "ξ"]; htmlEncodeChars.put['\u03BF', "ο"]; htmlEncodeChars.put['\u03C0', "π"]; htmlEncodeChars.put['\u03C1', "ρ"]; htmlEncodeChars.put['\u03C2', "ς"]; htmlEncodeChars.put['\u03C3', "σ"]; htmlEncodeChars.put['\u03C4', "τ"]; htmlEncodeChars.put['\u03C5', "υ"]; htmlEncodeChars.put['\u03C6', "φ"]; htmlEncodeChars.put['\u03C7', "χ"]; htmlEncodeChars.put['\u03C8', "ψ"]; htmlEncodeChars.put['\u03C9', "ω"]; htmlEncodeChars.put['\u03D1', "ϑ"]; htmlEncodeChars.put['\u03D2', "ϒ"]; htmlEncodeChars.put['\u03D6', "ϖ"]; htmlEncodeChars.put['\u2022', "•"]; htmlEncodeChars.put['\u2026', "…"]; htmlEncodeChars.put['\u2032', "′"]; htmlEncodeChars.put['\u2033', "″"]; htmlEncodeChars.put['\u203E', "‾"]; htmlEncodeChars.put['\u2044', "⁄"]; htmlEncodeChars.put['\u2118', "℘"]; htmlEncodeChars.put['\u2111', "ℑ"]; htmlEncodeChars.put['\u211C', "ℜ"]; htmlEncodeChars.put['\u2122', "™"]; htmlEncodeChars.put['\u2135', "ℵ"]; htmlEncodeChars.put['\u2190', "←"]; htmlEncodeChars.put['\u2191', "↑"]; htmlEncodeChars.put['\u2192', "→"]; htmlEncodeChars.put['\u2193', "↓"]; htmlEncodeChars.put['\u2194', "↔"]; htmlEncodeChars.put['\u21B5', "↵"]; htmlEncodeChars.put['\u21D0', "⇐"]; htmlEncodeChars.put['\u21D1', "⇑"]; htmlEncodeChars.put['\u21D2', "⇒"]; htmlEncodeChars.put['\u21D3', "⇓"]; htmlEncodeChars.put['\u21D4', "⇔"]; htmlEncodeChars.put['\u2200', "∀"]; htmlEncodeChars.put['\u2202', "∂"]; htmlEncodeChars.put['\u2203', "∃"]; htmlEncodeChars.put['\u2205', "∅"]; htmlEncodeChars.put['\u2207', "∇"]; htmlEncodeChars.put['\u2208', "∈"]; htmlEncodeChars.put['\u2209', "∉"]; htmlEncodeChars.put['\u220B', "∋"]; htmlEncodeChars.put['\u220F', "∏"]; htmlEncodeChars.put['\u2211', "∑"]; htmlEncodeChars.put['\u2212', "−"]; htmlEncodeChars.put['\u2217', "∗"]; htmlEncodeChars.put['\u221A', "√"]; htmlEncodeChars.put['\u221D', "∝"]; htmlEncodeChars.put['\u221E', "∞"]; htmlEncodeChars.put['\u2220', "∠"]; htmlEncodeChars.put['\u2227', "∧"]; htmlEncodeChars.put['\u2228', "∨"]; htmlEncodeChars.put['\u2229', "∩"]; htmlEncodeChars.put['\u222A', "∪"]; htmlEncodeChars.put['\u222B', "∫"]; htmlEncodeChars.put['\u2234', "∴"]; htmlEncodeChars.put['\u223C', "∼"]; htmlEncodeChars.put['\u2245', "≅"]; htmlEncodeChars.put['\u2248', "≈"]; htmlEncodeChars.put['\u2260', "≠"]; htmlEncodeChars.put['\u2261', "≡"]; htmlEncodeChars.put['\u2264', "≤"]; htmlEncodeChars.put['\u2265', "≥"]; htmlEncodeChars.put['\u2282', "⊂"]; htmlEncodeChars.put['\u2283', "⊃"]; htmlEncodeChars.put['\u2284', "⊄"]; htmlEncodeChars.put['\u2286', "⊆"]; htmlEncodeChars.put['\u2287', "⊇"]; htmlEncodeChars.put['\u2295', "⊕"]; htmlEncodeChars.put['\u2297', "⊗"]; htmlEncodeChars.put['\u22A5', "⊥"]; htmlEncodeChars.put['\u22C5', "⋅"]; htmlEncodeChars.put['\u2308', "⌈"]; htmlEncodeChars.put['\u2309', "⌉"]; htmlEncodeChars.put['\u230A', "⌊"]; htmlEncodeChars.put['\u230B', "⌋"]; htmlEncodeChars.put['\u2329', "〈"]; htmlEncodeChars.put['\u232A', "〉"]; htmlEncodeChars.put['\u25CA', "◊"]; htmlEncodeChars.put['\u2660', "♠"]; htmlEncodeChars.put['\u2663', "♣"]; htmlEncodeChars.put['\u2665', "♥"]; htmlEncodeChars.put['\u2666', "♦"]; } private StringUtils[] { } public static String encodeHtml[String source] { return encode[source, htmlEncodeChars]; } private static String encode[String source, HashMap encodingTable] { if [null == source] { return null; } if [null == encodingTable] { return source; } StringBuffer encoded_string = null; char[] string_to_encode_array = source.toCharArray[]; int last_match = -1; int difference = 0; for [int i = 0; i < string_to_encode_array.length; i++] { char char_to_encode = string_to_encode_array[i]; if [encodingTable.containsKey[char_to_encode]] { if [null == encoded_string] { encoded_string = new StringBuffer[source.length[]]; } difference = i - [last_match + 1]; if [difference > 0] { encoded_string.append[string_to_encode_array, last_match + 1, difference]; } encoded_string.append[encodingTable.get[char_to_encode]]; last_match = i; } } if [null == encoded_string] { return source; } else { difference = string_to_encode_array.length - [last_match + 1]; if [difference > 0] { encoded_string.append[string_to_encode_array, last_match + 1, difference]; } return encoded_string.toString[]; } } }
Happy Learning !!
References:
HTML 4.01 Character References
StringEscapeUtils.escapeHtml4[]