Label
PostgreSQL, Fuzzy Query, Regular Query, pg_trgm, bytea, gin, Functional Index
background
Pre-fuzziness (with prefix fuzziness), post-fuzziness (with suffix fuzziness), pre-and post-fuzziness (without prefix fuzziness), and regular matching are common requirements in the field of text search.
PostgreSQL has a strong text search capability. Besides supporting full-text search, it also supports fuzzy query and regular query. The built-in pg_trgm plug-in is not available in general databases and may not have been heard of by many people. At the same time, the functions of expression index and GIN index are built-in.
Different fuzzy query requirements have different optimization methods.
PostgreSQL, like other databases, can be accelerated with btree for forward and backward ambiguities. Post-ambiguity can be accelerated by using functional index of inversion function.
For forward and backward fuzziness and regular matching, one method is to use pg_trgm plug-in and use GIN index to speed up the fuzziness and regular query (the effect of inputting three or more characters is very good). Another method is to customize GIN expression index, which is suitable for customized fuzzy query.
I. Optimization of Front-Fuzzy and Back-Fuzzy
1. Pre-Fuzzy (Pre-Fuzzy) Optimization Method
Pre-ambiguous queries can be supported with b-tree. Only suitable for collate="C" queries, when the default database lc_collate<>C, index and query need to specify collate "C" explicitly.
The collate s of indexes and query conditions must be consistent in order to use indexes.
Example
test=# create table test(id int, info text); CREATE TABLE test=# insert into test select generate_series(1,1000000),md5(random()::text); INSERT 0 1000000 test=# create index idx on test(info collate "C"); CREATE INDEX test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "C"; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.057..0.093 rows=18 loops=1) Output: id, info Index Cond: ((test.info >= 'abcd'::text) AND (test.info < 'abce'::text)) Filter: (test.info ~~ 'abcd%'::text COLLATE "C") Buffers: shared hit=18 read=3 Planning time: 0.424 ms Execution time: 0.124 ms (7 rows)
2. Optimizing Method of Postfuzzy (Fuzzy with Suffix)
Using reverse function index, post-ambiguous queries can be supported. Only suitable for collate="C" queries, when the default database lc_collate<>C, index and query need to specify collate "C" explicitly.
The collate s of indexes and query conditions must be consistent in order to use indexes.
Example
test=# create index idx1 on test(reverse(info) collate "C"); CREATE INDEX test=# select * from test limit 1; id | info ----+---------------------------------- 1 | b3275976cdd437a033d4329775a52514 (1 row) test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) like '4152%' collate "C"; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Index Scan using idx1 on public.test (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.061..0.097 rows=18 loops=1) Output: id, info Index Cond: ((reverse(test.info) >= '4152'::text) AND (reverse(test.info) < '4153'::text)) Filter: (reverse(test.info) ~~ '4152%'::text COLLATE "C") Buffers: shared hit=18 read=3 Planning time: 0.128 ms Execution time: 0.122 ms (7 rows) test=# select * from test where reverse(info) like '4152%' collate "C"; id | info --------+---------------------------------- 847904 | abe2ecd90393b5275df8e34a39702514 414702 | 97f66d26545329321164042657d02514 191232 | 7820972c6220c2b01d46c11ebb532514 752742 | 93232ac39c6632e2540df44627c42514 217302 | 39e518893a1a7b1e691619bd1fc42514 1 | b3275976cdd437a033d4329775a52514 615718 | 4948f94c484c13dc6c4fae8a3db52514 308815 | fc2918ceff7c7a4dafd2e04031062514 149521 | 546d963842ea5ca593e622c810262514 811093 | 4b6eca2eb6b665af67b2813e91a62514 209000 | 1dfd0d4e326715c1739f031cca992514 937616 | 8827fd81f5b673fb5afecbe3e11b2514 419553 | bd6e01ce360af16137e8b6abc8ab2514 998324 | 7dff51c19dc5e5d9979163e7d14c2514 771518 | 8a54e30003a48539fff0aedc73ac2514 691566 | f90368348e3b6bf983fcbe10db2d2514 652274 | 8bf4a97b5f122a5540a21fa85ead2514 233437 | 739ed715fc203d47e37e79b5bcbe2514 (18 rows)
3. Integrative optimization method of forward and backward ambiguity
Using the pg_trgm index, you can support forward and backward ambiguous queries. Note:
(Fuzzy with prefix) Enter at least one character (Fuzzy with suffix) and at least two characters. Only good index filtering effect. To support Chinese, the database lc_collate and lc_ctype cannot be "C".
The collate s of indexes and query conditions must be consistent in order to use indexes.
test=# \l+ test List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description ------+----------+----------+------------+------------+-------------------+--------+------------+------------- test | postgres | UTF8 | zh_CN.utf8 | zh_CN.utf8 | | 245 MB | pg_default | (1 row) test=# create extension pg_trgm; test=# create table test001(c1 text); CREATE TABLE
Functions for Generating Random Chinese Strings
test=# create or replace function gen_hanzi(int) returns text as $$ declare res text; begin if $1 >=1 then select string_agg(chr(19968+(random()*20901)::int), '') into res from generate_series(1,$1); return res; end if; return null; end; $$ language plpgsql strict; CREATE FUNCTION
Generating Random Data
test=# insert into test001 select gen_hanzi(20) from generate_series(1,100000); INSERT 0 100000 test=# create index idx_test001_1 on test001 using gin (c1 gin_trgm_ops); CREATE INDEX test=# select * from test001 limit 5; c1 ------------------------------------------ Noise-provoking Office of Xing He's flourishing and eloquent. It's a good idea to be very skillful and well-organized at the same time. It's a good idea to be a good friend of younger generations and to be a good friend of younger generations of younger generations of younger generations of younger generations of younger generations of younger generations of younger generations of younger generations of younger generations of younger generations of younger generations of younger generations of younger generations of younger generations of younger generations. The two sorrows of Jiazhao Peng and Zheng Zheng violate the law. Swimming, crashing and penetrating, and staring at each other. Publishing and punishing the dragon in Qiaopu is full of vigor and vigour. (5 rows)
Fuzzy Query
test=# Explain (analysis, verbose, timing, costs, buffers) select * from test001 where C1 like'you%'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.030..0.034 rows=3 loops=1) Output: c1 Recheck Cond: (test001.c1 ~~ 'you%'::text) Heap Blocks: exact=3 Buffers: shared hit=7 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.020..0.020 rows=3 loops=1) Index Cond: (test001.c1 ~~ 'you%'::text) Buffers: shared hit=4 Planning time: 0.119 ms Execution time: 0.063 ms (10 rows) test=# Explain (analysis, verbose, timing, costs, buffers) select * from test001 where C1 like'% skeleton'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.031..0.034 rows=1 loops=1) Output: c1 Recheck Cond: (test001.c1 ~~ '%Chinese style'::text) Rows Removed by Index Recheck: 1 Heap Blocks: exact=2 Buffers: shared hit=6 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.020..0.020 rows=2 loops=1) Index Cond: (test001.c1 ~~ '%Chinese style'::text) Buffers: shared hit=4 Planning time: 0.136 ms Execution time: 0.062 ms (11 rows)
2. Fuzzy optimization
pg_trgm plug-in is used to support ambiguous queries. Be careful:
If pg_trgm is needed to support Chinese fuzzy queries, the database lc_collate and lc_ctype cannot be "C".
For example, input three or more characters, otherwise the effect is not good.
Example
test=# Explain (analysis, verbose, timing, costs, buffers) select * from test001 where C1 like'% Xinxing He%'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.038..0.038 rows=1 loops=1) Output: c1 Recheck Cond: (test001.c1 ~~ '%Xin Xinghe%'::text) Heap Blocks: exact=1 Buffers: shared hit=5 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.025..0.025 rows=1 loops=1) Index Cond: (test001.c1 ~~ '%Xin Xinghe%'::text) Buffers: shared hit=4 Planning time: 0.170 ms Execution time: 0.076 ms (10 rows) test=# Explain (analysis, verbose, timing, costs, buffers) select * from test001 where C1 like'% Xing%'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test001 (cost=7615669.08..7615679.20 rows=10 width=61) (actual time=147.524..178.232 rows=1 loops=1) Output: c1 Recheck Cond: (test001.c1 ~~ '%Xing Xing%'::text) Rows Removed by Index Recheck: 99999 Heap Blocks: exact=1137 Buffers: shared hit=14429 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..7615669.08 rows=10 width=0) (actual time=147.377..147.377 rows=100000 loops=1) Index Cond: (test001.c1 ~~ '%Xing Xing%'::text) Buffers: shared hit=13292 Planning time: 0.133 ms Execution time: 178.265 ms (11 rows)
3. Optimization of Regular Matching
At present, the effect of pg_trgm on the regular matching of Chinese (or multi-byte characters) is not good, and that of ascii characters is good.
Rows Removed by Index Recheck is very small, indicating that the index is very filterable.
Example
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 ~ '12[0-9]{3,9}'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on public.test001 (cost=65.08..75.20 rows=10 width=61) (actual time=0.196..0.196 rows=0 loops=1) Output: c1 Recheck Cond: (test001.c1 ~ '12[0-9]{3,9}'::text) Rows Removed by Index Recheck: 1 Heap Blocks: exact=1 Buffers: shared hit=50 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..65.08 rows=10 width=0) (actual time=0.183..0.183 rows=1 loops=1) Index Cond: (test001.c1 ~ '12[0-9]{3,9}'::text) Buffers: shared hit=49 Planning time: 0.452 ms Execution time: 0.221 ms (11 rows) test=# Explain (analysis, verbose, timing, costs, buffers) select * from test001 where C1 ~'Xinxing He'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test001 (cost=7615669.08..7615679.20 rows=10 width=61) (actual time=143.846..232.278 rows=1 loops=1) Output: c1 Recheck Cond: (test001.c1 ~ 'Xin Xinghe'::text) Rows Removed by Index Recheck: 99999 Heap Blocks: exact=1137 Buffers: shared hit=14429 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..7615669.08 rows=10 width=0) (actual time=143.688..143.688 rows=100000 loops=1) Index Cond: (test001.c1 ~ 'Xin Xinghe'::text) Buffers: shared hit=13292 Planning time: 0.254 ms Execution time: 232.312 ms (11 rows)
Principle of pg_trgm Fuzzy Query
First, pg_trgm adds two spaces to the front end of the string and one space to the end.
Then, each consecutive three characters is a TOKEN, which is disassembled.
Finally, GIN inverted index is established for TOKEN.
To view the TOKEN of a string, you can use the following method.
test=# select show_trgm('123'); show_trgm ------------------------- {" 1"," 12",123,"23 "} (1 row)
Reasons for Requirement of Number of Fuzzy Characters around pg_trgm
When using pg_trgm, if you want to get the best results, you'd better satisfy these conditions.
1. Fuzzy queries with prefixes, such as a%, require at least one character. (The search is for token ='a')
2. Fuzzy queries with suffixes, such as% ab, need to provide at least two characters. (The search is for token='ab')
3. Fuzzy queries, such as% abcd, need to provide at least three characters. (This uses an array search for token(s) containing {"a", "ab", "abc", "bcd"}
Why?
Because the TOKEN generated by pg_trgm is three characters, only under the above three conditions can the corresponding TOKEN be matched.
test=# select show_trgm('123'); show_trgm ------------------------- {" 1"," 12",123,"23 "} (1 row)
4. Optimization of Fuzzy Query with Less than 3 Input Characters
pg_trgm can't satisfy the requirement when we need to fuzzily search one or two characters before and after, but we can use the expression GIN index.
Using expressions, the string is split into a single word, two consecutive character arrays, and the GIN index can be established for logarithmic groups.
Example
test=# create or replace function split001(text) returns text[] as $$ declare res text[]; begin select regexp_split_to_array($1,'') into res; for i in 1..length($1)-1 loop res := array_append(res, substring($1,i,2)); end loop; return res; end; $$ language plpgsql strict immutable; CREATE FUNCTION test=# create index idx_test001_2 on test001 using gin (split001(c1)); test=# Explain (analysis, verbose, timing, costs, buffers) select * from test001 where split001 (c1) @ > array ['hello']; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on public.test001 (cost=8.87..550.12 rows=500 width=61) (actual time=0.041..0.041 rows=0 loops=1) Output: c1 Recheck Cond: (split001(test001.c1) @> '{Hello}'::text[]) Buffers: shared hit=4 -> Bitmap Index Scan on idx_test001_2 (cost=0.00..8.75 rows=500 width=0) (actual time=0.039..0.039 rows=0 loops=1) Index Cond: (split001(test001.c1) @> '{Hello}'::text[]) Buffers: shared hit=4 Planning time: 0.104 ms Execution time: 0.068 ms (9 rows) test=# Explain (analysis, verbose, timing, costs, buffers) select * from test001 where split001 (c1) @ > array ['you']; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test001 (cost=8.87..550.12 rows=500 width=61) (actual time=0.063..0.183 rows=86 loops=1) Output: c1 Recheck Cond: (split001(test001.c1) @> '{you}'::text[]) Heap Blocks: exact=80 Buffers: shared hit=84 -> Bitmap Index Scan on idx_test001_2 (cost=0.00..8.75 rows=500 width=0) (actual time=0.048..0.048 rows=86 loops=1) Index Cond: (split001(test001.c1) @> '{you}'::text[]) Buffers: shared hit=4 Planning time: 0.101 ms Execution time: 0.217 ms (10 rows) test=# Select * from test001 where split001 (c1) @ > array ['you']; c1 ------------------------------------------ //The great and dirty breasts of Hong Xian and Xian Xian collapsed and your captain collapsed in the vicinity of Hong Xian and Xian Xian.and your captain's captain's captain's captain's captain's captain's captain's captain's captain's captain's captain's captain's captain's captain's captain's Captain //With a warm face, the eel always accompanies you. ......
V. Similar Query Optimization
Fuzzy query and regular matching are both to find records that are fully qualified. Another requirement is similar query.
For example, the postgresql string, input p0stgresgl, can also be matched according to similarity.
Example
test=# create index idx_test001_3 on test001 using gist (c1 gist_trgm_ops); CREATE INDEX test=# Explain (analysis, verbose, timing, costs, buffers) SELECT, C1 <->'cockroach cockroach cockroach cockroach cockroach cockroach cockroach cockroach cockroach cockroach cockroach cockroach cockroach cockroach cockroach cockroach cockroach cockroach cockroach cockroach as'AS dist FROM test001 t ORDER BY dist LIMIT 5; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.28..0.52 rows=5 width=89) (actual time=37.462..37.639 rows=5 loops=1) Output: t.*, ((c1 <-> 'Bromoconaze in the milk of cockroach cockroach'::text)) Buffers: shared hit=1631 -> Index Scan using idx_test001_3 on public.test001 t (cost=0.28..4763.28 rows=100000 width=89) (actual time=37.461..37.636 rows=5 loops=1) Output: t.*, (c1 <-> 'Bromoconaze in the milk of cockroach cockroach'::text) Order By: (t.c1 <-> 'Bromoconaze in the milk of cockroach cockroach'::text) Buffers: shared hit=1631 Planning time: 0.089 ms Execution time: 37.668 ms (9 rows) test=# SELECT, C1 <->'Cockroach gills'AS dist FROM test001 t ORDER BY dist LIMIT 5; t | dist --------------------------------------------+---------- (The gills of the cockroaches tell you that the milk of the cockroaches is bromoconaze.) | 0.307692 (Coral peas, cups, cups and lead batteries are in full swing.) | 0.976744 (He intimidated and praised the toughness of the cuttlefish.) | 0.976744 (There is a great deal of anxiety in the world.) | 0.976744 (The shield of the Yuanxia Gorge bears the burden of the valley, the pulse, and the parochial silence.) | 0.976744 (5 rows)
Summary
1. If there are only pre-Ambiguous query requirements, use collate "C" b-tree index.
2. If there are only post-ambiguous query requirements, use the b-tree index of the reverse() expression of collate "C".
3. If you have ambiguous query requirements and include Chinese, please use lc_collate, lc_ctype <> "C" database, and use gin index of pg_trgm plug-in.
4. If you have ambiguous query requirements and do not include Chinese, please use gin index of pg_trgm plug-in.
5. If there is a query requirement for regular expressions containing ascii characters, use the gin index of the pg_trgm plug-in.
6. If there is a need for fuzzy query with input conditions less than 3 characters, we can use GIN expression index to search by array inclusion, and the performance is as good as before.
7. Performance
There are 100 million records, 15 random Chinese characters per record, a total of 8 GB. Fuzzy query performance before and after testing.
1. Generating test data
vi test.sql insert into test001 select gen_hanzi(15) from generate_series(1,2500000); pgbench -n -r -P 1 -f ./test.sql -c 40 -j 40 -t 1 test test=# select count(*) from test001; count ----------- 100000000 (1 row) test=# select * from test001 limit 10; c1 -------------------------------- The flatfish are very choosy. The winding line of the town's dike is rolling and winding. There is a great deal of interest in the world. It's very beautiful and graceful. To be open-minded, to be strong, to be jealous Coconut coconut bun Ba Weng is one of the most important people in the world. Lofty and elegant dressing It's a gorgeous cuckoo. A sudden wound on the cake and a round rope. (10 rows)
2. Creating Index
test=# set maintenance_work_mem ='32GB'; test=# create index idx_test001_1 on test001 using gin (c1 gin_trgm_ops);
3. Fuzzy Query Performance Test
3.1 Pre-ambiguity
Response time: 9 ms
Return to line 4701
select * from test001 where c1 like 'you%'; test=# Explain (analysis, verbose, timing, costs, buffers) select * from test001 where C1 like'you%'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on public.test001 (cost=89.50..10161.50 rows=10000 width=46) (actual time=1.546..8.868 rows=4701 loops=1) Output: c1 Recheck Cond: (test001.c1 ~~ 'you%'::text) Rows Removed by Index Recheck: 85 Heap Blocks: exact=4776 Buffers: shared hit=4784 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..87.00 rows=10000 width=0) (actual time=0.879..0.879 rows=4786 loops=1) Index Cond: (test001.c1 ~~ 'you%'::text) Buffers: shared hit=8 Planning time: 0.099 ms Execution time: 9.166 ms (11 rows)
3.2 Post-ambiguity
Response time: 0.25 milliseconds
Return 2 rows
select * from test001 where c1 like '%Chinese style'; test=# Explain (analysis, verbose, timing, costs, buffers) select * from test001 where C1 like'% bow'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test001 (cost=89.50..10161.50 rows=10000 width=46) (actual time=0.049..0.223 rows=2 loops=1) Output: c1 Recheck Cond: (test001.c1 ~~ '%Chinese style'::text) Rows Removed by Index Recheck: 87 Heap Blocks: exact=89 Buffers: shared hit=94 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..87.00 rows=10000 width=0) (actual time=0.031..0.031 rows=89 loops=1) Index Cond: (test001.c1 ~~ '%Chinese style'::text) Buffers: shared hit=5 Planning time: 0.113 ms Execution time: 0.249 ms (11 rows)
3.3 Ambiguity
Response time: 0.2 milliseconds
Return one line
select * from test001 where c1 like '%Cockroach%'; test=# Explain (analysis, verbose, timing, costs, buffers) select * from test001 where C1 like'% cockroach'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test001 (cost=89.50..10161.50 rows=10000 width=46) (actual time=0.044..0.175 rows=1 loops=1) Output: c1 Recheck Cond: (test001.c1 ~~ '%Cockroach%'::text) Rows Removed by Index Recheck: 81 Heap Blocks: exact=82 Buffers: shared hit=87 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..87.00 rows=10000 width=0) (actual time=0.027..0.027 rows=82 loops=1) Index Cond: (test001.c1 ~~ '%Cockroach%'::text) Buffers: shared hit=5 Planning time: 0.112 ms Execution time: 0.201 ms (11 rows)