Postgresql convert html to text

Use xpath

Feed your database with XML datatype, not with "second class" TEXT, because is very simple to convert HTML into XHTML (see HTML-Tidy or standard DOM's loadHTML() and saveXML() methods).

! IT IS FAST AND IS VERY SAFE !

The commom information retrieval need, is not a full content, but something into the XHTML, so the power of xpath is wellcome.

Example: retrive all paragraphs with class="fn":

  WITH needinfo AS (
    SELECT *, xpath('//p[@class="fn"]//text()', xhtml)::text[] as frags
    FROM t 
  ) SELECT array_to_string(frags,' ') AS my_p_fn2txt
    FROM needinfo
    WHERE array_length(frags , 1)>0
  -- for full content use xpath('//text()',xhtml)

regex solutions...

I not recomend because is not an "information retrieval" solution... and, as @James and others commented here, the regex solution is not so safe.

I like "pure SQL", for me is better than use Perl (se @Daniel's solution) or another.

 CREATE OR REPLACE FUNCTION strip_tags(TEXT) RETURNS TEXT AS $$
     SELECT regexp_replace(
        regexp_replace($1, E'(?x)<[^>]*?(\s alt \s* = \s* ([\'"]) ([^>]*?) \2) [^>]*? >', E'\3'), 
       E'(?x)(< [^>]*? >)', '', 'g')
 $$ LANGUAGE SQL;

See this and many other variations at siafoo.net, eskpee.wordpress, ... and here at Stackoverflow.

Author - Kailash

Problem : How to create a function in Postgres that will remove HTML tags from a piece of text?

Solution : Create function in postgres :

CREATE OR REPLACE FUNCTION strip_tags(TEXT) RETURNS TEXT AS $$
    SELECT regexp_replace($1, '<[^>]*>', '', 'g')
$$ LANGUAGE SQL;

How to use :

SELECT strip_tags('
Kailash
Kumar');
Output: KailashKumar

Note: This function will remove all the content between < and > symbol. If HTML tags are not proper then your text may also get removed so check your HTML before parsing it through this function.

Webner Solutions is a Software Development company focused on developing Insurance Agency Management Systems, Learning Management Systems and Salesforce apps. Contact us at for your Insurance, eLearning and Salesforce applications.

ascii(string) int ASCII code of the first byte of the argument ascii('x') 120 btrim(string text [, characters text]) text Remove the longest string consisting only of characters in characters (a space by default) from the start and end of string btrim('xyxtrimyyx', 'xy') trim chr(int) text Character with the given ASCII code chr(65) A convert(string text, [src_encoding name,] dest_encoding name) text Convert string to dest_encoding. The original encoding is specified by src_encoding. If src_encoding is omitted, database encoding is assumed. convert( 'text_in_utf8', 'UTF8', 'LATIN1') text_in_utf8 represented in ISO 8859-1 encoding decode(string text, type text) bytea Decode binary data from string previously encoded with encode. Parameter type is same as in encode. decode('MTIzAAE=', 'base64') 123\000\001 encode(data bytea, type text) text Encode binary data to different representation. Supported types are: base64, hex, escape. Escape merely outputs null bytes as \000 and doubles backslashes. encode( E'123\\000\\001', 'base64') MTIzAAE= initcap(string) text Convert the first letter of each word to uppercase and the rest to lowercase. Words are sequences of alphanumeric characters separated by non-alphanumeric characters. initcap('hi THOMAS') Hi Thomas length(string) int Number of characters in string length('jose') 4 lpad(string text, length int [, fill text]) text Fill up the string to length length by prepending the characters fill (a space by default). If the string is already longer than length then it is truncated (on the right). lpad('hi', 5, 'xy') xyxhi ltrim(string text [, characters text]) text Remove the longest string containing only characters from characters (a space by default) from the start of string ltrim('zzzytrim', 'xyz') trim md5(string) text Calculates the MD5 hash of string, returning the result in hexadecimal md5('abc') 900150983cd24fb0 d6963f7d28e17f72 pg_client_encoding() name Current client encoding name pg_client_encoding() SQL_ASCII quote_ident(string) text Return the given string suitably quoted to be used as an identifier in an SQL statement string. Quotes are added only if necessary (i.e., if the string contains non-identifier characters or would be case-folded). Embedded quotes are properly doubled. quote_ident('Foo bar') "Foo bar" quote_literal(string) text Return the given string suitably quoted to be used as a string literal in an SQL statement string. Embedded single-quotes and backslashes are properly doubled. quote_literal( 'O\'Reilly') 'O''Reilly' regexp_replace(string text, pattern text, replacement text [,flags text]) text Replace substring matching POSIX regular expression. See Section 9.7 for more information on pattern matching. regexp_replace('Thomas', '.[mN]a.', 'M') ThM repeat(string text, number int) text Repeat string the specified number of times repeat('Pg', 4) PgPgPgPg replace(string text, from text, to text) text Replace all occurrences in string of substring from with substring to replace( 'abcdefabcdef', 'cd', 'XX') abXXefabXXef rpad(string text, length int [, fill text]) text Fill up the string to length length by appending the characters fill (a space by default). If the string is already longer than length then it is truncated. rpad('hi', 5, 'xy') hixyx rtrim(string text [, characters text]) text Remove the longest string containing only characters from characters (a space by default) from the end of string rtrim('trimxxxx', 'x') trim split_part(string text, delimiter text, field int) text Split string on delimiter and return the given field (counting from one) split_part('abc~@~def~@~ghi', '~@~', 2) def strpos(string, substring) int Location of specified substring (same as position(substring in string), but note the reversed argument order) strpos('high', 'ig') 2 substr(string, from [, count]) text Extract substring (same as substring(string from from for count)) substr('alphabet', 3, 2) ph to_ascii(string text [, encoding text]) text Convert string to ASCII from another encoding (only supports conversion from LATIN1, LATIN2, LATIN9, and WIN1250 encodings) to_ascii('Karel') Karel to_hex(number int or bigint) text Convert number to its equivalent hexadecimal representation to_hex(2147483647) 7fffffff translate(string text, from text, to text) text Any character in string that matches a character in the from set is replaced by the corresponding character in the to set translate('12345', '14', 'ax') a23x5