Massive Data, Smlar-Aliyun RDS PosgreSQL Best Practice

Label

PostgreSQL, Heming Distance, Smlar, GiST Index

background

http://www.cnblogs.com/lushilin/p/6549665.html

Application of SimHash

Through the above steps, we can use SimHash algorithm to generate a vector fingerprint for each web page, so the question arises, how to judge the similarity of two texts?
This is mainly applied to Heming distance.

(1) What is Heming Distance
The number of bits with different values for the corresponding bits of two codewords is called the Hamming distance of the two codewords. In an effective encoding set, the minimum Hamming distance of any two codewords is called the Hamming distance of the encoding set. Examples are as follows: 10101 and 00110 are different from the first, the fourth and the fifth in order, and the Heming distance is 3.

(2) Geometric Significance of Heming Distance
N-bit codewords can be represented by a vertex of a hypercube in n-dimensional space. The Hamming distance between two codewords is an edge between two vertices of a hypercube and the shortest distance between the two vertices.

(3) Application scenarios of Heming distance
Error Detection and Correction for Coding

Fingerprints extracted by SimHash algorithm (Simhash is more suitable for long text 500 words + and short text may deviate greatly, which needs to be tested according to the actual scene). Finally, we use Heming distance to find similarities. In the data given by google's paper, 64-bit signatures can be considered as similar or duplicate when Heming distance is 3. Of course, this is the case. Values are only reference values and may be tested differently for their own applications.

The problem of similarity is basically solved here, but according to this idea, the problem of efficiency is still unsolved under the tens of billions of data, because data is constantly added in, it is impossible for every data to be compared with the database data, according to this idea, the processing speed will be slower and slower, and the linear growth.

For the efficiency of mass data de-duplication, we can divide 64-bit fingerprints into 4 16-bit data blocks. According to the drawer principle, in the case of Hamming distance of 3, if the two documents are similar, then it must have one block of data is equal.

Does the database support such efficient retrieval?

PostgreSQL is supported anyway, through the black technology smlar plug-in.

I. Demand

II. Architectural Design

In PostgreSQL, there are many design methods for searching data with Heming distance less than N from massive data. The energy consumption ratio of each method is different, and readers can choose it on demand.

1 Violence Computation

1. Single computer multi-core parallel computing, violent scanning. It uses the multi-core parallel capability provided by Aliyun RDS PostgreSQL 10 to scan violently.

2. Multi-computer multi-core parallel computing and violent scanning. Using the multi-level parallel computing capabilities provided by Aliyun HybridDB for PostgreSQL, violent scanning.

3. Using GPU and FPGA to speed up violent computing.

PostgreSQL provides an extended interface for computing data using the capabilities of GPU and FPGA.

4. Computing instructions with CPU vectors and computing violence.

PostgreSQL provides an extended interface to speed up the computation of instructions using CPU vectors.

2 Index

Indexing is an efficient method, such as the PostgreSQL smlar plug-in, which has been used in Ali's shopping guide platform for massive similarity queries of real-time shopping guide.

If smlar is to speed up the search of hemming distance, more scientific methods, such as slicing, are needed.

Direct use of location is problematic because the first process of smlar is block-level convergence, while Hamming code is bit64 encoding. In a data block, there are several records, zero and 1 may occur at any location, and any data block contains 0 and 1, so the first filter cannot be completed.

We can use slices to reduce this possibility. For example, one for every 2 BITs, or one for every 4 BITs, or more.

Usually when the Heming distance is greater than 3, there is no correlation.

DEMO and Performance

1 Violence Computation

1. Full scan, parallel scan

Create test tables

create table hm (  
  id int,        -- id  
  hmval bit(64)  -- Hamming HASH  
);

Write 10 million test data

postgres=# insert into hm select id, val::int8::bit(64) from (select id, sqrt(random())::numeric*9223372036854775807*2-9223372036854775807::numeric as val from generate_series(1,10000000) t(id)) t;  
INSERT 0 10000000  
  
postgres=# select * from hm limit 10;  
 id |                              hmval                                 
----+------------------------------------------------------------------  
  1 | 0000101001110110110101010111101011100110101010000111100011110111  
  2 | 0110011100110101101000001010101111010001011101100111111011001110  
  3 | 1010110111001011011110110000111111101101101111010111111100101110  
  4 | 0110011110110000001011000010010000101011100101010100111000101001  
  5 | 0101110100101111010110010110000000101110000010001011010110110000  
  6 | 0011010000100000101011011100000101111110010110111101100001100001  
  7 | 1011110011101101101000011101011101010111011001011010110111101000  
  8 | 1110010011000101001101110010001111110100001101010101111101110010  
  9 | 0110111111110011101001001000101101011011111100010010111010001111  
 10 | 0011100011000010101011010001111000000110100011100100111011011001  
(10 rows)

Setting the Parallel Degree of Violence

postgres=# set force_parallel_mode = on;  
postgres=# set min_parallel_table_scan_size = 0;  
postgres=# set parallel_setup_cost = 0;  
postgres=# set parallel_tuple_cost = 0;  
postgres=# alter table hm set (parallel_workers = 128);  
postgres=# set max_parallel_workers_per_gather = 64;

Parallel query of records with Hamming distance less than 4 takes 463 milliseconds.

postgres=# select * from hm where length(replace(bitxor(bit'0011110001011010110010001011010101001000111110000111110010010110', hmval)::text,'0','')) < 4;  
 id |                              hmval                                 
----+------------------------------------------------------------------  
 16 | 0011110001011010110010001011010101001000111110000111110010010110  
(1 row)  
  
Time: 463.314 ms

It takes 16 seconds to query records whose Heming distance is less than 4.

postgres=# select * from hm where length(replace(bitxor(bit'0011110001011010110010001011010101001000111110000111110010010110', hmval)::text,'0','')) < 4;  
 id |                              hmval                                 
----+------------------------------------------------------------------  
 16 | 0011110001011010110010001011010101001000111110000111110010010110  
(1 row)  
  
Time: 16791.215 ms (00:16.791)

There are more efficient ways to find the different digits of two BIT s. In theory, it can reach less than 100 milliseconds.

https://www.postgresql.org/message-id/flat/ab1ea6540903121110l2a3021d4h6632b206e2419898%40mail.gmail.com#ab1ea6540903121110l2a3021d4h6632b206e2419898@mail.gmail.com

2 Index

Aliyun RDS PostgreSQL provides a smlar plug-in for efficient calculation of array similarity.

We need to convert Heming HASH into arrays. According to the previous design, we use 8 BIT slices to support index queries for values within 8 Heming distances.

Before slicing, verify the filterability after slicing:

postgres=# select relpages from pg_class where relname='hm';  
 relpages   
----------  
    63695  
(1 row)  
  
1,When a single piece is 1, it goes without saying that each block contains.  
  
postgres=# select count(*) from (select substring(ctid::text,'(\d+),') from hm where substring(hmval,1,1)='0' group by 1)t;  
 count   
-------  
 63695  
(1 row)  
  
2,When a single slice is 8, nearly half of the blocks contain it.  
  
postgres=# select count(*) from (select substring(ctid::text,'(\d+),') from hm where substring(hmval,1,8)='00000000' group by 1)t;  
 count   
-------  
 29100  
(1 row)  
  
3,When a single chip is 16, only more than 100 blocks are included.  
  
postgres=# select count(*) from (select substring(ctid::text,'(\d+),') from hm where substring(hmval,1,16)='0000000000000000' group by 1)t;  
 count   
-------  
   160  
(1 row)

Performance verification of 8-slice method

Create plug-ins

create extension smlar;

Create test tables

create table hm1 (id int, hmval bit(64), hmarr text[]);

Generate 10 million test data, generate test data, according to the means of segmentation, record as TEXT array.

insert into hm1   
select   
  id,   
  val::bit(64),   
  regexp_split_to_array('1_'||substring(val,1,8)||',2_'||substring(val,9,8)||',3_'||substring(val,17,8)||',4_'||substring(val,25,8)||',5_'||substring(val,33,8)||',6_'||substring(val,41,8)||',7_'||substring(val,49,8)||',8_'||substring(val,57,8), ',')    
from   
(select id, (sqrt(random())::numeric*9223372036854775807*2-9223372036854775807::numeric)::int8::bit(64)::text as val from generate_series(1,10000000) t(id)) t;  
  
postgres=# select * from hm1 limit 10;  
 id |                              hmval                               |                                           hmarr                                             
----+------------------------------------------------------------------+-------------------------------------------------------------------------------------------  
  1 | 0000001110101101100110011000100111100100001100100101101010010011 | {1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}  
  2 | 0001001000010101001100100010101010111001001000000110101101100100 | {1_00010010,2_00010101,3_00110010,4_00101010,5_10111001,6_00100000,7_01101011,8_01100100}  
  3 | 0011111111010100011001001010110110100010101110101001101111010000 | {1_00111111,2_11010100,3_01100100,4_10101101,5_10100010,6_10111010,7_10011011,8_11010000}  
  4 | 1100110010011001001110101110111111111111010000100000010011000010 | {1_11001100,2_10011001,3_00111010,4_11101111,5_11111111,6_01000010,7_00000100,8_11000010}  
  5 | 0011000011010001011111010101010111100110000110000011101100000101 | {1_00110000,2_11010001,3_01111101,4_01010101,5_11100110,6_00011000,7_00111011,8_00000101}  
  6 | 0111101101111110101000010110101101110011011110100100010111011001 | {1_01111011,2_01111110,3_10100001,4_01101011,5_01110011,6_01111010,7_01000101,8_11011001}  
  7 | 0010001011111111100010101011110001001101001011100100011000010000 | {1_00100010,2_11111111,3_10001010,4_10111100,5_01001101,6_00101110,7_01000110,8_00010000}  
  8 | 1110001111100011011110110111101111010101000111000100111111111101 | {1_11100011,2_11100011,3_01111011,4_01111011,5_11010101,6_00011100,7_01001111,8_11111101}  
  9 | 0111110010111000010111001000000101111000000110110110000011101110 | {1_01111100,2_10111000,3_01011100,4_10000001,5_01111000,6_00011011,7_01100000,8_11101110}  
 10 | 0111001101100010001101101111000000100100000000010001010011100101 | {1_01110011,2_01100010,3_00110110,4_11110000,5_00100100,6_00000001,7_00010100,8_11100101}  
(10 rows)

Create smlar index

postgres=# create index idx_hm1 on hm1 using gin(hmarr _text_sml_ops );

The search distance is less than 1 VALUE. The smlar index takes 63 milliseconds.

postgres=# set smlar.type = overlap;  
postgres=# set smlar.threshold = 7;  
  
select    
    *,    
    smlar( hmarr, '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}')    
  from    
    hm1    
  where    
    hmarr % '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}'      
    and length(replace(bitxor(bit'0000001110101101100110011000100111100100001100100101101010010011', hmval)::text,'0','')) < 2  
  limit 100;  
  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select    
    *,    
    smlar( hmarr, '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}')    
  from    
    hm1    
  where    
    hmarr % '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}'      
    and length(replace(bitxor(bit'0000001110101101100110011000100111100100001100100101101010010011', hmval)::text,'0','')) < 2  
  limit 100;  
                                                                            QUERY PLAN                                                                               
-------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=117.83..420.48 rows=100 width=169) (actual time=62.928..62.929 rows=1 loops=1)  
   Output: id, hmval, hmarr, (smlar(hmarr, '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}'::text[]))  
   Buffers: shared hit=166  
   ->  Bitmap Heap Scan on public.hm1  (cost=117.83..10205.17 rows=3333 width=169) (actual time=62.927..62.927 rows=1 loops=1)  
         Output: id, hmval, hmarr, smlar(hmarr, '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}'::text[])  
         Recheck Cond: (hm1.hmarr % '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}'::text[])  
         Filter: (length(replace((bitxor(B'0000001110101101100110011000100111100100001100100101101010010011'::"bit", hm1.hmval))::text, '0'::text, ''::text)) < 2)  
         Heap Blocks: exact=1  
         Buffers: shared hit=166  
         ->  Bitmap Index Scan on idx_hm1  (cost=0.00..117.00 rows=10000 width=0) (actual time=62.898..62.898 rows=1 loops=1)  
               Index Cond: (hm1.hmarr % '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}'::text[])  
               Buffers: shared hit=165  
 Planning time: 0.147 ms  
 Execution time: 62.975 ms  
(14 rows)  
  
postgres=# select                  
    *,    
    smlar( hmarr, '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}')    
  from    
    hm1    
  where    
    hmarr % '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}'      
    and length(replace(bitxor(bit'0000001110101101100110011000100111100100001100100101101010010011', hmval)::text,'0','')) < 2  
  limit 100;  
 id |                              hmval                               |                                           hmarr                                           | smlar   
----+------------------------------------------------------------------+-------------------------------------------------------------------------------------------+-------  
  1 | 0000001110101101100110011000100111100100001100100101101010010011 | {1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011} |     8  
(1 row)  
Time: 61.227 ms

If we only need to query Hamming distances within 4, we can actually use 16 groupings, or we can use mixed tangent.

Performance Verification of 6-slice Mixed Slicing Method

The cut method is 8,16,8,8,16,8. Support queries within 6 Hamming distances.

create table hm2 (id int, hmval bit(64), hmarr text[]);  
  
insert into hm2   
select   
  id,   
  val::bit(64),   
  regexp_split_to_array('1_'||substring(val,1,8)||',2_'||substring(val,9,16)||',3_'||substring(val,25,8)||',4_'||substring(val,33,8)||',5_'||substring(val,41,16)||',6_'||substring(val,57,8), ',')    
from   
(select id, (sqrt(random())::numeric*9223372036854775807*2-9223372036854775807::numeric)::int8::bit(64)::text as val from generate_series(1,10000000) t(id)) t;  
  
postgres=# select * from hm2 limit 10;  
 id |                              hmval                               |                                        hmarr                                          
----+------------------------------------------------------------------+-------------------------------------------------------------------------------------  
  1 | 1100111011000001100100100111111110100011100111111101101001101010 | {1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}  
  2 | 0111111000101011000111010011011000000010010001111001000111011101 | {1_01111110,2_0010101100011101,3_00110110,4_00000010,5_0100011110010001,6_11011101}  
  3 | 0111111000101111000101011100100000001111011101101100110100000101 | {1_01111110,2_0010111100010101,3_11001000,4_00001111,5_0111011011001101,6_00000101}  
  4 | 0111010101010010100000110001100011110010111000001011000010010010 | {1_01110101,2_0101001010000011,3_00011000,4_11110010,5_1110000010110000,6_10010010}  
  5 | 1111101100110100101111000011001011111110111000100110101001100001 | {1_11111011,2_0011010010111100,3_00110010,4_11111110,5_1110001001101010,6_01100001}  
  6 | 0011110000100010101001000001100010000010111011100010011001000110 | {1_00111100,2_0010001010100100,3_00011000,4_10000010,5_1110111000100110,6_01000110}  
  7 | 0000111111001110100110011110000110001101110111111111111010111001 | {1_00001111,2_1100111010011001,3_11100001,4_10001101,5_1101111111111110,6_10111001}  
  8 | 0110100010010100111100110110000011101110101001001111010101011111 | {1_01101000,2_1001010011110011,3_01100000,4_11101110,5_1010010011110101,6_01011111}  
  9 | 0111001111001100101011001001100100000000111100000110110001000011 | {1_01110011,2_1100110010101100,3_10011001,4_00000000,5_1111000001101100,6_01000011}  
 10 | 1101111101011000111100101010101000100001101100101110100001111000 | {1_11011111,2_0101100011110010,3_10101010,4_00100001,5_1011001011101000,6_01111000}  
(10 rows)  
  
create index idx_hm2 on hm2 using gin(hmarr _text_sml_ops );

The query Hamming distance is less than or equal to 1, increased to 2 milliseconds.

postgres=# set smlar.type = overlap;  
postgres=# set smlar.threshold = 5;  
  
postgres=# select    
    *,    
    smlar( hmarr, '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}')    
  from    
    hm2   
  where    
    hmarr % '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}'      
    and length(replace(bitxor(bit'1100111011000001100100100111111110100011100111111101101001101010', hmval)::text,'0','')) < 2  
  limit 100;  
 id |                              hmval                               |                                        hmarr                                        | smlar   
----+------------------------------------------------------------------+-------------------------------------------------------------------------------------+-------  
  1 | 1100111011000001100100100111111110100011100111111101101001101010 | {1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010} |     6  
(1 row)  
Time: 1.954 ms  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select    
    *,    
    smlar( hmarr, '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}')    
  from    
    hm2   
  where    
    hmarr % '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}'      
    and length(replace(bitxor(bit'1100111011000001100100100111111110100011100111111101101001101010', hmval)::text,'0','')) < 2  
  limit 100;  
                                                                            QUERY PLAN                                                                               
-------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=103.83..406.06 rows=100 width=153) (actual time=2.414..2.416 rows=1 loops=1)  
   Output: id, hmval, hmarr, (smlar(hmarr, '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}'::text[]))  
   Buffers: shared hit=102  
   ->  Bitmap Heap Scan on public.hm2  (cost=103.83..10177.17 rows=3333 width=153) (actual time=2.414..2.415 rows=1 loops=1)  
         Output: id, hmval, hmarr, smlar(hmarr, '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}'::text[])  
         Recheck Cond: (hm2.hmarr % '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}'::text[])  
         Filter: (length(replace((bitxor(B'1100111011000001100100100111111110100011100111111101101001101010'::"bit", hm2.hmval))::text, '0'::text, ''::text)) < 2)  
         Heap Blocks: exact=1  
         Buffers: shared hit=102  
         ->  Bitmap Index Scan on idx_hm2  (cost=0.00..103.00 rows=10000 width=0) (actual time=2.374..2.374 rows=1 loops=1)  
               Index Cond: (hm2.hmarr % '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}'::text[])  
               Buffers: shared hit=101  
 Planning time: 0.149 ms  
 Execution time: 2.463 ms  
(14 rows)

Performance verification of 4-slice method

create table hm3 (id int, hmval bit(64), hmarr text[]);  
  
insert into hm3   
select   
  id,   
  val::bit(64),   
  regexp_split_to_array('1_'||substring(val,1,16)||',2_'||substring(val,17,16)||',3_'||substring(val,33,16)||',4_'||substring(val,41,16), ',')    
from   
(select id, (sqrt(random())::numeric*9223372036854775807*2-9223372036854775807::numeric)::int8::bit(64)::text as val from generate_series(1,10000000) t(id)) t;  
  
postgres=# select * from hm3 limit 10;  
 id |                              hmval                               |                                     hmarr                                       
----+------------------------------------------------------------------+-------------------------------------------------------------------------------  
  1 | 0101011111111010000001001011101101100011111101111101101100000011 | {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}  
  2 | 1101011000010000000000000000111011011111011101110100000010101011 | {1_1101011000010000,2_0000000000001110,3_1101111101110111,4_0111011101000000}  
  3 | 0101000010110110110010001010100010101001001010111111011000110011 | {1_0101000010110110,2_1100100010101000,3_1010100100101011,4_0010101111110110}  
  4 | 0111000111100011111000100111000011101111110000011110101101000100 | {1_0111000111100011,2_1110001001110000,3_1110111111000001,4_1100000111101011}  
  5 | 0010111010101011111010011110110010011110111111110011101110010011 | {1_0010111010101011,2_1110100111101100,3_1001111011111111,4_1111111100111011}  
  6 | 0110111110011100100110010111010000000011100011000011110001010110 | {1_0110111110011100,2_1001100101110100,3_0000001110001100,4_1000110000111100}  
  7 | 0100110100111001110011011110100111101110101001000101010110110110 | {1_0100110100111001,2_1100110111101001,3_1110111010100100,4_1010010001010101}  
  8 | 0110010111001100111000011011011100001100111111101111011010100010 | {1_0110010111001100,2_1110000110110111,3_0000110011111110,4_1111111011110110}  
  9 | 0110111010110000001010101111000101110000010011100011100101000100 | {1_0110111010110000,2_0010101011110001,3_0111000001001110,4_0100111000111001}  
 10 | 0101101000000110100101100011111111000101110001010011100110101011 | {1_0101101000000110,2_1001011000111111,3_1100010111000101,4_1100010100111001}  
(10 rows)  
  
create index idx_hm3 on hm3 using gin(hmarr _text_sml_ops );

The query Hamming distance is less than or equal to 1, increased to 0.2 milliseconds.

postgres=# set smlar.type = overlap;  
postgres=# set smlar.threshold = 3;  
  
postgres=# select    
    *,    
    smlar( hmarr, '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}')    
  from    
    hm3  
  where    
    hmarr % '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}'      
    and length(replace(bitxor(bit'0101011111111010000001001011101101100011111101111101101100000011', hmval)::text,'0','')) < 2  
  limit 100;  
 id |                              hmval                               |                                     hmarr                                     | smlar   
----+------------------------------------------------------------------+-------------------------------------------------------------------------------+-------  
  1 | 0101011111111010000001001011101101100011111101111101101100000011 | {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011} |     4  
(1 row)  
  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select    
    *,    
    smlar( hmarr, '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}')    
  from    
    hm3  
  where    
    hmarr % '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}'      
    and length(replace(bitxor(bit'0101011111111010000001001011101101100011111101111101101100000011', hmval)::text,'0','')) < 2  
  limit 100;  
                                                                            QUERY PLAN                                                                               
-------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=99.83..401.19 rows=100 width=134) (actual time=0.169..0.170 rows=1 loops=1)  
   Output: id, hmval, hmarr, (smlar(hmarr, '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}'::text[]))  
   Buffers: shared hit=14  
   ->  Bitmap Heap Scan on public.hm3  (cost=99.83..10144.17 rows=3333 width=134) (actual time=0.168..0.169 rows=1 loops=1)  
         Output: id, hmval, hmarr, smlar(hmarr, '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}'::text[])  
         Recheck Cond: (hm3.hmarr % '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}'::text[])  
         Filter: (length(replace((bitxor(B'0101011111111010000001001011101101100011111101111101101100000011'::"bit", hm3.hmval))::text, '0'::text, ''::text)) < 2)  
         Heap Blocks: exact=1  
         Buffers: shared hit=14  
         ->  Bitmap Index Scan on idx_hm3  (cost=0.00..99.00 rows=10000 width=0) (actual time=0.145..0.145 rows=1 loops=1)  
               Index Cond: (hm3.hmarr % '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}'::text[])  
               Buffers: shared hit=13  
 Planning time: 0.101 ms  
 Execution time: 0.200 ms  
(14 rows)

Query Hamming distance less than or equal to 4, still returned in milliseconds.

postgres=# set smlar.type = overlap;  
postgres=# set smlar.threshold = 0;  
  
postgres=# select    
    *,    
    smlar( hmarr, '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}')    
  from    
    hm3  
  where    
    hmarr % '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}'      
    and length(replace(bitxor(bit'0101011111111010000001001011101101100011111101111101101100000011', hmval)::text,'0','')) < 5  
  limit 100;  
 id |                              hmval                               |                                     hmarr                                     | smlar   
----+------------------------------------------------------------------+-------------------------------------------------------------------------------+-------  
  1 | 0101011111111010000001001011101101100011111101111101101100000011 | {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011} |     4  
(1 row)  
Time: 6.983 ms

No index, 23 seconds.

postgres=# select * from hm3 where length(replace(bitxor(bit'0101011111111010000001001011101101100011111101111101101100000011', hmval)::text,'0','')) < 5;  
 id |                              hmval                               |                                     hmarr                                       
----+------------------------------------------------------------------+-------------------------------------------------------------------------------  
  1 | 0101011111111010000001001011101101100011111101111101101100000011 | {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}  
(1 row)  
  
Time: 22954.686 ms

Performance has been improved from 23 seconds to 0.2 milliseconds compared to the absence of indexes. It has increased by 114,800 times.

automatic segmentation

Some people will say, how to automatically generate simhash shredded arrays?

Don't worry, PostgreSQL provides a powerful UDF function, which can set up UDF and TRIGGER, and automatically generate shredded arrays when writing data.

Example

create table hm4 (id int, hmval bit(64), hmarr text[]);  

create or replace function sp(val bit(64)) returns text[] as $$
select regexp_split_to_array('1_'||substring(val::text,1,10)||',2_'||substring(val::text,11,10)||',3_'||substring(val::text,21,10)||',4_'||substring(val::text,31,10)||',5_'||substring(val::text,41,10)||',6_'||substring(val::text,51,14), ',') ;            
$$ language sql strict;

postgres=# select sp(123::bit(64));
                                         sp                                          
-------------------------------------------------------------------------------------
 {1_0000000000,2_0000000000,3_0000000000,4_0000000000,5_0000000000,6_00000001111011}
(1 row)

-- Write 100 million records

insert into hm4  
select   
  id,   
  val::bit(64),   
  sp(val::bit(64))    
from   
(select id, (sqrt(random())::numeric*9223372036854775807*2-9223372036854775807::numeric)::int8::bit(64)::text as val from generate_series(1,100000000) t(id)) t;  

-- Indexes

create index idx_hm4 on hm4 using gin(hmarr _text_sml_ops );  

-- Searching for records with Hamming distance less than or equal to 1, performance bar

postgres=# set smlar.type = overlap;  
postgres=# set smlar.threshold = 5; 

select    
    *,    
    smlar( hmarr, sp(123::bit(64)))     -- Queries and 123::bit(64)Records with Heming Distance Less than 2
  from    
    hm3  
  where    
    hmarr % sp(123::bit(64))      -- Queries and 123::bit(64)Records with Heming Distance Less than 2
    and length(replace(bitxor(123::bit(64), hmval)::text,'0','')) < 2      -- Queries and 123::bit(64)Records with Heming Distance Less than 2
  limit 100;

Create a trigger to write to simhash automatically to the cut-off group

create or replace function tg() returns trigger as $$
declare
begin
  NEW.hmarr := sp(NEW.hmval);
  return NEW;
end;
$$ language plpgsql strict;

postgres=# create trigger tg before insert or update on hm4 for each row execute procedure tg();
CREATE TRIGGER

-- The effect is very good.

postgres=# truncate hm4;
TRUNCATE TABLE
postgres=# insert into hm4 values (1,1::bit(64));
INSERT 0 1
postgres=# select * from hm4;
 id |                              hmval                               |                                        hmarr                                        
----+------------------------------------------------------------------+-------------------------------------------------------------------------------------
  1 | 0000000000000000000000000000000000000000000000000000000000000001 | {1_0000000000,2_0000000000,3_0000000000,4_0000000000,5_0000000000,6_00000000000001}
(1 row)

postgres=# update hm4 set hmval=123456::bit(64);
UPDATE 1
postgres=# select * from hm4;
 id |                              hmval                               |                                        hmarr                                        
----+------------------------------------------------------------------+-------------------------------------------------------------------------------------
  1 | 0000000000000000000000000000000000000000000000011110001001000000 | {1_0000000000,2_0000000000,3_0000000000,4_0000000000,5_0000000111,6_10001001000000}
(1 row)

Just give me a compliment.

IV. Technical Points

1. Aliyun RDS PostgreSQL smlar plug-in, using GIN index, block-level convergence, double filtering. The response speed is 0.2 milliseconds. In 10 million data, records within 2 Hamming distances are retrieved.

2. Aliyun RDS PostgreSQL 10 uses multi-core parallel, violent scanning and 10 million data to retrieve data with a distance of less than N, about 450 milliseconds.

3. Aliyun HybridDB for PostgreSQL, which uses parallel computers, extends computing power horizontally, and can also achieve violent scanning. The query efficiency is calculated according to the actual number of nodes.

V. Cloud end products

Aliyun RDS PostgreSQL

Aliyun HybridDB for PostgreSQL

6. Similar scenarios and cases

E-commerce Content Reduplication Content Screening Application (Real-time Recognition Reprint Piracy Infringement?) - Optimizing and Indexing Technology for Text, Picture Set, Commodity Set, Array Similarity Decision)

"Building Real-time User Portrait Recommendation System Based on Aliyun RDS PostgreSQL"

Summary

Similar text is retrieved from tens of millions of simhash data using ALIYUN RDS PostgreSQL's SMAR plug-in, and the performance is improved from 23 seconds to 0.2 milliseconds when there is no index. It has increased by 114,800 times.

It's not unexpected to be unhappy.