Postgresql convert html to text

Use xpath

Feed your database with XML datatype, not with "second class" TEXT, because is very simple to convert HTML into XHTML [see HTML-Tidy or standard DOM's loadHTML[] and saveXML[] methods].

! IT IS FAST AND IS VERY SAFE !

The commom information retrieval need, is not a full content, but something into the XHTML, so the power of xpath is wellcome.

Example: retrive all paragraphs with class="fn":

  WITH needinfo AS [
    SELECT *, xpath['//p[@class="fn"]//text[]', xhtml]::text[] as frags
    FROM t 
  ] SELECT array_to_string[frags,' '] AS my_p_fn2txt
    FROM needinfo
    WHERE array_length[frags , 1]>0
  -- for full content use xpath['//text[]',xhtml]

regex solutions...

I not recomend because is not an "information retrieval" solution... and, as @James and others commented here, the regex solution is not so safe.

I like "pure SQL", for me is better than use Perl [se @Daniel's solution] or another.

 CREATE OR REPLACE FUNCTION strip_tags[TEXT] RETURNS TEXT AS $$
     SELECT regexp_replace[
        regexp_replace[$1, E'[?x]]*?[\s alt \s* = \s* [[\'"]] [[^>]*?] \2] [^>]*? >', E'\3'], 
       E'[?x][< [^>]*? >]', '', 'g']
 $$ LANGUAGE SQL;

See this and many other variations at siafoo.net, eskpee.wordpress, ... and here at Stackoverflow.

Author - Kailash

Problem : How to create a function in Postgres that will remove HTML tags from a piece of text?

Solution : Create function in postgres :

CREATE OR REPLACE FUNCTION strip_tags[TEXT] RETURNS TEXT AS $$
    SELECT regexp_replace[$1, ']*>', '', 'g']
$$ LANGUAGE SQL;

How to use :

SELECT strip_tags['
Kailash
Kumar'];
Output: KailashKumar

Note: This function will remove all the content between < and > symbol. If HTML tags are not proper then your text may also get removed so check your HTML before parsing it through this function.

Webner Solutions is a Software Development company focused on developing Insurance Agency Management Systems, Learning Management Systems and Salesforce apps. Contact us at for your Insurance, eLearning and Salesforce applications.

ascii[string] int ASCII code of the first byte of the argument ascii['x'] 120 btrim[string text [, characters text]] text Remove the longest string consisting only of characters in characters [a space by default] from the start and end of string btrim['xyxtrimyyx', 'xy'] trim chr[int] text Character with the given ASCII code chr[65] A convert[string text, [src_encoding name,] dest_encoding name] text Convert string to dest_encoding. The original encoding is specified by src_encoding. If src_encoding is omitted, database encoding is assumed. convert[ 'text_in_utf8', 'UTF8', 'LATIN1'] text_in_utf8 represented in ISO 8859-1 encoding decode[string text, type text] bytea Decode binary data from string previously encoded with encode. Parameter type is same as in encode. decode['MTIzAAE=', 'base64'] 123\000\001 encode[data bytea, type text] text Encode binary data to different representation. Supported types are: base64, hex, escape. Escape merely outputs null bytes as \000 and doubles backslashes. encode[ E'123\\000\\001', 'base64'] MTIzAAE= initcap[string] text Convert the first letter of each word to uppercase and the rest to lowercase. Words are sequences of alphanumeric characters separated by non-alphanumeric characters. initcap['hi THOMAS'] Hi Thomas length[string] int Number of characters in string length['jose'] 4 lpad[string text, length int [, fill text]] text Fill up the string to length length by prepending the characters fill [a space by default]. If the string is already longer than length then it is truncated [on the right]. lpad['hi', 5, 'xy'] xyxhi ltrim[string text [, characters text]] text Remove the longest string containing only characters from characters [a space by default] from the start of string ltrim['zzzytrim', 'xyz'] trim md5[string] text Calculates the MD5 hash of string, returning the result in hexadecimal md5['abc'] 900150983cd24fb0 d6963f7d28e17f72 pg_client_encoding[] name Current client encoding name pg_client_encoding[] SQL_ASCII quote_ident[string] text Return the given string suitably quoted to be used as an identifier in an SQL statement string. Quotes are added only if necessary [i.e., if the string contains non-identifier characters or would be case-folded]. Embedded quotes are properly doubled. quote_ident['Foo bar'] "Foo bar" quote_literal[string] text Return the given string suitably quoted to be used as a string literal in an SQL statement string. Embedded single-quotes and backslashes are properly doubled. quote_literal[ 'O\'Reilly'] 'O''Reilly' regexp_replace[string text, pattern text, replacement text [,flags text]] text Replace substring matching POSIX regular expression. See Section 9.7 for more information on pattern matching. regexp_replace['Thomas', '.[mN]a.', 'M'] ThM repeat[string text, number int] text Repeat string the specified number of times repeat['Pg', 4] PgPgPgPg replace[string text, from text, to text] text Replace all occurrences in string of substring from with substring to replace[ 'abcdefabcdef', 'cd', 'XX'] abXXefabXXef rpad[string text, length int [, fill text]] text Fill up the string to length length by appending the characters fill [a space by default]. If the string is already longer than length then it is truncated. rpad['hi', 5, 'xy'] hixyx rtrim[string text [, characters text]] text Remove the longest string containing only characters from characters [a space by default] from the end of string rtrim['trimxxxx', 'x'] trim split_part[string text, delimiter text, field int] text Split string on delimiter and return the given field [counting from one] split_part['abc~@~def~@~ghi', '~@~', 2] def strpos[string, substring] int Location of specified substring [same as position[substring in string], but note the reversed argument order] strpos['high', 'ig'] 2 substr[string, from [, count]] text Extract substring [same as substring[string from from for count]] substr['alphabet', 3, 2] ph to_ascii[string text [, encoding text]] text Convert string to ASCII from another encoding [only supports conversion from LATIN1, LATIN2, LATIN9, and WIN1250 encodings] to_ascii['Karel'] Karel to_hex[number int or bigint] text Convert number to its equivalent hexadecimal representation to_hex[2147483647] 7fffffff translate[string text, from text, to text] text Any character in string that matches a character in the from set is replaced by the corresponding character in the to set translate['12345', '14', 'ax'] a23x5

Chủ Đề