Use xpath
Feed your database with XML datatype, not with "second class" TEXT, because is very simple to convert HTML into XHTML [see HTML-Tidy or standard DOM's loadHTML[]
and saveXML[]
methods].
! IT IS FAST AND IS VERY SAFE !
The commom information retrieval need, is not a full content, but something into the XHTML, so
the power of xpath
is wellcome.
Example: retrive all paragraphs with class="fn"
:
WITH needinfo AS [
SELECT *, xpath['//p[@class="fn"]//text[]', xhtml]::text[] as frags
FROM t
] SELECT array_to_string[frags,' '] AS my_p_fn2txt
FROM needinfo
WHERE array_length[frags , 1]>0
-- for full content use xpath['//text[]',xhtml]
regex solutions...
I not recomend because is not an "information retrieval" solution... and, as @James and others commented here, the regex solution is not so safe.
I like "pure SQL", for me is better than use Perl [se @Daniel's solution] or another.
CREATE OR REPLACE FUNCTION strip_tags[TEXT] RETURNS TEXT AS $$
SELECT regexp_replace[
regexp_replace[$1, E'[?x]]*?[\s alt \s* = \s* [[\'"]] [[^>]*?] \2] [^>]*? >', E'\3'],
E'[?x][< [^>]*? >]', '', 'g']
$$ LANGUAGE SQL;
See this and many other variations at siafoo.net, eskpee.wordpress, ... and here at Stackoverflow.
Author - Kailash
Problem : How to create a function in Postgres that will remove HTML tags from a piece of text?
Solution : Create function in postgres :
CREATE OR REPLACE FUNCTION strip_tags[TEXT] RETURNS TEXT AS $$ SELECT regexp_replace[$1, ']*>', '', 'g'] $$ LANGUAGE SQL;
How to use :
SELECT strip_tags['KailashKumar']; Output: KailashKumar
Note: This function will remove all the content between < and > symbol. If HTML tags are not proper then your text may also get removed so check your HTML before parsing it through this function.
Webner Solutions is a Software Development company focused on developing Insurance Agency Management Systems, Learning Management Systems and Salesforce apps. Contact us at for your Insurance, eLearning and Salesforce applications.
ascii
[string]
int
ASCII code of the first byte of the argument
ascii['x']
120
btrim
[string text [, characters text]]
text
Remove the longest string consisting only of characters in characters [a space by default] from the start and end of string
btrim['xyxtrimyyx', 'xy']
trim
chr
[int]
text
Character with the given ASCII code
chr[65]
A
convert
[string text, [src_encoding name,] dest_encoding name]
text
Convert string to dest_encoding. The original encoding is specified by src_encoding. If src_encoding is omitted, database encoding is assumed.
convert[ 'text_in_utf8', 'UTF8', 'LATIN1']
text_in_utf8 represented in ISO 8859-1 encoding
decode
[string text, type text]
bytea
Decode binary data from string previously encoded with encode
. Parameter type is same as in encode
.
decode['MTIzAAE=', 'base64']
123\000\001
encode
[data bytea, type text]
text
Encode binary data to different representation. Supported types are: base64, hex, escape. Escape merely outputs null bytes as \000 and doubles backslashes.
encode[ E'123\\000\\001', 'base64']
MTIzAAE=
initcap
[string]
text
Convert the first letter of each word to uppercase and the rest to lowercase. Words are sequences of alphanumeric characters separated by non-alphanumeric
characters.
initcap['hi THOMAS']
Hi Thomas
length
[string]
int
Number of characters in string
length['jose']
4
lpad
[string text, length int [, fill text]]
text
Fill up the string to length length by prepending the characters fill [a space by default]. If the string is already longer than length then it is truncated [on the right].
lpad['hi', 5, 'xy']
xyxhi
ltrim
[string text [, characters text]]
text
Remove the longest string containing only characters from characters [a space by default] from the start of string
ltrim['zzzytrim', 'xyz']
trim
md5
[string]
text
Calculates the MD5 hash of string, returning the result in
hexadecimal
md5['abc']
900150983cd24fb0 d6963f7d28e17f72
pg_client_encoding
[]
name
Current client encoding name
pg_client_encoding[]
SQL_ASCII
quote_ident
[string]
text
Return the given string suitably quoted to be used as an identifier in an SQL statement string. Quotes are added only if necessary [i.e., if the string contains non-identifier characters or would be case-folded]. Embedded quotes are properly doubled.
quote_ident['Foo bar']
"Foo bar"
quote_literal
[string]
text
Return the given string suitably quoted to be used as a string literal in an SQL statement string. Embedded single-quotes and backslashes are properly doubled.
quote_literal[ 'O\'Reilly']
'O''Reilly'
regexp_replace
[string text, pattern text, replacement text [,flags text]]
text
Replace substring matching POSIX regular expression. See Section 9.7 for more information on pattern matching.
regexp_replace['Thomas', '.[mN]a.', 'M']
ThM
repeat
[string text, number int]
text
Repeat string the specified number of times
repeat['Pg', 4]
PgPgPgPg
replace
[string text, from text, to text]
text
Replace all occurrences in string of substring from with substring to
replace[ 'abcdefabcdef', 'cd', 'XX']
abXXefabXXef
rpad
[string text, length int [, fill text]]
text
Fill up the string to length length by appending the characters fill [a space by default]. If the string is already longer than length then it is truncated.
rpad['hi', 5, 'xy']
hixyx
rtrim
[string text [, characters text]]
text
Remove the longest string containing only characters from characters [a space by default] from the end of string
rtrim['trimxxxx', 'x']
trim
split_part
[string text, delimiter text, field int]
text
Split string on delimiter and return the given field [counting from one]
split_part['abc~@~def~@~ghi', '~@~', 2]
def
strpos
[string, substring]
int
Location of specified substring [same as position[substring in string], but note the reversed argument order]
strpos['high', 'ig']
2
substr
[string, from [, count]]
text
Extract substring [same as substring[string from from for count]]
substr['alphabet', 3, 2]
ph
to_ascii
[string text [, encoding text]]
text
Convert string to ASCII from another encoding [only supports conversion from LATIN1, LATIN2, LATIN9, and WIN1250 encodings]
to_ascii['Karel']
Karel
to_hex
[number int or bigint]
text
Convert number to its equivalent hexadecimal representation
to_hex[2147483647]
7fffffff
translate
[string text, from text, to text]
text
Any character in string that matches a character in the from set is replaced by the corresponding character in the to set
translate['12345', '14', 'ax']
a23x5