Introduction to PostgreSQL Full-Text Retrieval

PostgreSQL comes with a simple full-text retrieval engine that enables full-text retrieval of small amounts of data.In this article, we will guide you through this feature, which is sufficient for small-volume searches without the need to build additional heavy-weight full-text retrieval servers such as ES.

For detailed full-text retrieval functions, see Official Documents .Thank you for the PostgreSQL Chinese Community Translate Documents

Getting Started with Full-Text Retrieval of PostgreSQL

The full-text search operator for PG is @@, returns true when a tsvector (document) and tsquery (condition) match, and the order is unaffected:

SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat & rat'::tsquery;
 ?column?
----------
 t

SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
 ?column?
----------
 f

Starting with PostgreSQL 10, jsonb also supports full-text retrieval.

As with ordinary SQL queries, simply using this symbol in a WHERE condition means filtering the document using a full-text retrieval condition.For example:

SELECT title
FROM pgweb
WHERE to_tsvector('english', body) @@ to_tsquery('english', 'friend');

The @@operator supports implicit conversions and does not require strong type conversions for text types (:: tsvectors or to_tsvector(config_name, text)), so the parameter types actually supported by this operator are as follows:

tsvector @@ tsquery
tsquery  @@ tsvector
text @@ tsquery
text @@ text

NOTE: In practice, it is recommended that to_tsvector(config_name, text) and to_tsquery(config_name, text) be used as explicit strong type conversions.Because if you use an implicit conversion or a default parameter conversion such as:tsvectors, you will use default_text_search_config Configure word segmentation, which defaults to pg_catalog.simple and may not apply to most query scenarios.It is therefore recommended that to_tsvector() and to_tsquery() functions be used for explicit strong type conversions, and that a word breaker dictionary be specified to achieve more precise query requirements.

About tsquery

tsquery Query criteria are not simple rules, but a set of search terms that use and combine Boolean operators &(AND), |(OR), and! (NOT), as well as phrase search operators <-> (FOLLOWED BY).See for more detailed syntax This document.

PostgreSQL also provides two relatively simplified versions of plain_tsquery and phraseto_tsquery.

Plain_tsquery (plain_tsquery ([config regconfig,] querytext) returns tsquery) Users change the unformatted text to tsquery after partitioning:

SELECT plainto_tsquery('english', 'The Fat Rats');
 plainto_tsquery 
-----------------
 'fat' & 'rat'

Phraseto_tsquery (phraseto_tsquery ([config regconfig,] querytext) behaves like plain_tsquery, but it does not insert after the word break <-> (FOLLOWED BY):

SELECT phraseto_tsquery('english', 'The Fat Rats');
 phraseto_tsquery
------------------
 'fat' <-> 'rat'

About Indexes

Using an index can speed up full-text retrieval.For full-text retrieval, the optional index types are GIN (Universal Inverted Index) and GIST (Universal Search Tree), which are preferred for official documents. GIN Index .Example of creating a GIN index:

CREATE INDEX pgweb_idx ON pgweb USING GIN(to_tsvector('english', body));

It can also be a join column:

CREATE INDEX pgweb_idx ON pgweb USING GIN(to_tsvector('english', title || ' ' || body));

You can also create a tsvector column separately to index it:

ALTER TABLE pgweb ADD COLUMN textsearchable_index_col tsvector;
UPDATE pgweb SET textsearchable_index_col =
     to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,''));

CREATE INDEX textsearch_idx ON pgweb USING GIN(textsearchable_index_col);

NOTE:

  • To create an index based on a GIN (Universal Inverted Index), the column must be of type tsvector s.Therefore, you need to explicitly type the column.
  • The two-parameter version using the to_tsvector() function specifies the full-text retrieval configuration, so the same full-text retrieval configuration must be used to _tsvector() to hit the index.That is, WHERE to_tsvector ('english', body) @@'a & B'can use the index, but WHERE to_tsvector (body) @@'a & B' cannot.
  • When using a separate column to store the tsvectors representation, it is necessary to create a trigger to ensure that the tsvectors are listed as the current value when the title or body changes. See documentation for details.
  • GIN indexes only store words (word positions) for tsvector s and do not store their weight labels.Therefore, when using queries involving weights, you need a recheck on the table rows.

About sorting

In addition to the normal ORDER BY condition, PostgreSQL provides two optional sorting functions ts_rank (weights float4[],] vector TS vector, query TS query [, normalization integer]) returns float4 and ts_rank_cd (weights float4[],] vector TS vector, query TS query [, normalization integer]) returns float4 for full-text retrieval to achieve weight-based sorting.

SELECT title, ts_rank_cd(textsearch, query) AS rank
FROM apod, to_tsquery('neutrino|(dark & matter)') query
WHERE query @@ textsearch
ORDER BY rank DESC
LIMIT 10;
                     title                     |   rank
-----------------------------------------------+----------
 Neutrinos in the Sun                          |      3.1
 The Sudbury Neutrino Detector                 |      2.4
 A MACHO View of Galactic Dark Matter          |  2.01317
 Hot Gas and Dark Matter                       |  1.91171
 The Virgo Cluster: Hot Plasma and Dark Matter |  1.90953
 Rafting for Solar Neutrinos                   |      1.9
 NGC 4650A: Strange Galaxy and Dark Matter     |  1.85774
 Hot Gas and Dark Matter                       |   1.6123
 Ice Fishing for Cosmic Neutrinos              |      1.6
 Weak Lensing Distorts the Universe            | 0.818218

Additionally, PostgreSQL version 9.6 or higher can be used RUM index Sort.(Note that this is an extension and is not included by default).

Chinese Full Text Retrieval in PostgreSQL

PostgreSQL's default word breaker dictionary does not contain a Chinese word breaker dictionary, so we must introduce it manually.A better project at the moment is zhparser At the same time, this plugin is also included by default by RDS of Aliyun.There's nothing to say about installing and enabling.It is worth mentioning that the word segmentation configuration parameters.

After CREATE EXTENSION, you must configure the word breaking parameters to correct word breaking and lookup or nothing will be found.One of the configuration strategies provided by the official documentation is:

CREATE TEXT SEARCH CONFIGURATION testzhcfg (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION testzhcfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;

The letters n,v,a,i,e,l represent a token policy. Only these token mapping s are enabled, and the rest are blocked.Specific supported parameters and meanings can be displayed with \dFp+zhparser:

postgres=# \dFp+ zhparser 
      Text search parser "public.zhparser"
     Method      |    Function     | Description 
-----------------+-----------------+-------------
 Start parse     | zhprs_start     | 
 Get next token  | zhprs_getlexeme | 
 End parse       | zhprs_end       | 
 Get headline    | prsd_headline   | (internal)
 Get token types | zhprs_lextype   | 

Token types for parser "public.zhparser"
 Token name |      Description       
------------+------------------------
 a          | adjective,Adjective
 b          | differentiation,Differential Words
 c          | conjunction,Conjunction
 d          | adverb,adverb
 e          | exclamation,Interjection
 f          | position,Location Words
 g          | root,Root
 h          | head,Anterior connector
 i          | idiom,idiom
 j          | abbreviation,Abbreviation
 k          | tail,Posterior Connecting Component
 l          | tmp,Idiom
 m          | numeral,Numeral
 n          | noun,noun
 o          | onomatopoeia,An onomatopoeia
 p          | prepositional,Preposition
 q          | quantity,Classifier
 r          | pronoun,Pronoun
 s          | space,place
 t          | time,Temporal morphemes
 u          | auxiliary,Auxiliary word
 v          | verb,verb
 w          | punctuation,Punctuation
 x          | unknown,Unknown word
 y          | modal,Statement label designator
 z          | status,State Words
(26 rows)

WITH simple means that the dictionary uses a built-in simple dictionary, that is, only lowercase conversion.Dictionary and token mapping can be defined flexibly as needed to achieve functions such as masking words and merging synonyms.

For example, let's look at the following example:

-- The following full-text retrieval configuration comes from zhparser Examples of documents
ALTER TEXT SEARCH CONFIGURATION zhparser ADD MAPPING FOR n,v,a,i,e,l WITH simple;
postgres=# SELECT to_tsquery('zhparser','Jianghuai Duvui');
     to_tsquery     
--------------------
 'Du' & 'Volt' & 'WEI'
(1 row)

You can see that the phrase "Jianghuai" has been ignored in the query. Let's turn on j(abbreviation) to see the result:

postgres=# ALTER TEXT SEARCH CONFIGURATION zhparser ADD MAPPING FOR j WITH simple;
ALTER TEXT SEARCH CONFIGURATION
postgres=# SELECT to_tsquery('zhparser','Jianghuai Duvui');
         to_tsquery          
-----------------------------
 'Yangtze and Huai rivers' & 'Du' & 'Volt' & 'WEI'
(1 row)

Therefore, to set reasonable token types in actual use, too little will lead to inaccurate search results, too much will lead to poor performance.In addition, there are other options such as compound short words: zhparser.multi_short = f to control the result of word segmentation, which can be turned on as appropriate according to actual use.

Keywords: Programming PostgreSQL SQL REST

Added by pumaf1 on Fri, 17 May 2019 00:39:42 +0300