PgSQL. Application Case. HDB for PG Features (Data Diskette and High Efficiency Filtering for Arbitrary Columns)

background

The data also have eight characters of birthdays, do you believe it? How do elements and columns, rows and rows interrelate? Is the query slow? Don't believe that everything is destined by heaven. How to change the data? After reading this article, you can do it.

One talent, nine efforts. Fate is destined. The destiny is decided by heaven. It was also said that my fate was beyond my control. It seems that the Chinese ancients had a thorough study of the things congenitally destined, and had a good view of them. However, they also made up for their congenital deficiencies by means of acquired efforts, or by means of transportation reform.

In fact, when I was preparing to write this article, I found that the arrangement of data in the database, the storage of data and the traditional culture related to Chinese numerology are very similar. There are also congenital factors and claims of acquired remedies.

What's the matter? Listen to me carefully.

In order to speed up the efficiency of data retrieval, we usually need to create an index to improve the accuracy of data location. For example, to query someone's monthly call streaming data, without an index, we need to search all the data and match them one by one. By indexing, you can directly locate the records that need to be queried.

Especially when storage and computation are separated, the larger the amount of search, the larger the amount of data transmitted in the network. The bottleneck is obvious.

In addition, in the field of OLAP, a large amount of data needs to be processed. If all the data are indexed, the cost of index introduction is still considerable.

So is there any other way to reduce the amount of scans without indexing?

Storage layer statistics and filtering push-downs

I believe you must have thought that statistics, yes, we can store data by block statistics, such as the data range within each block.

There are several very common technical implementations:

1. PostgreSQL BRIN Index.

PostgreSQL Internet of Things Black Technology - Index of Several Hundreds of Times of Weight Loss (BRIN index)

https://www.postgresql.org/docs/10/static/brin-intro.html

PostgreSQL brin index is a block-level index, which records the statistics of each block or batch of consecutive blocks.

When searching by this column, irrelevant blocks are filtered through metadata.

2. cstore_fdw column storage plug-in. In fact, it is also stored in columns arranged by BATCH, and the metadata (maximum and minimum) of each BATCH can be used for filtering during scanning.

https://github.com/citusdata/cstore_fdw

Skip indexes: Stores min/max statistics for row groups, and uses them to skip over unrelated rows.

Using Skip Indexes

cstore_fdw partitions each column into multiple blocks. Skip indexes store minimum and maximum values for each of these blocks. While scanning the table, if min/max values of the block contradict the WHERE clause, then the block is completely skipped. This way, the query processes less data and hence finishes faster.

To use skip indexes more efficiently, you should load the data after sorting it on a column that is commonly used in the WHERE clause. This ensures that there is a minimum overlap between blocks and the chance of them being skipped is higher.

In practice, the data generally has an inherent dimension (for example a time field) on which it is naturally sorted. Usually, the queries also have a filter clause on that column (for example you want to query only the last week's data), and hence you don't need to sort the data in such cases.

When searching by this column, irrelevant blocks are filtered through metadata.

Example

Some 300 GB External appearance, sampling skip index Scanning, accelerating scanning.  
//It takes 103 milliseconds.      
      
explain (analyze,verbose,timing,costs,buffers) select c400,sum(c2) from ft_tbl1 where c400=1 group by c400;      
      
         Filter: (ft_tbl1.c400 = 1)      
         Rows Removed by Filter: 89996        
         CStore File: /data01/digoal/pg_root1921/cstore_fdw/13146/41038      
         CStore File Size: 325166400400      
         Buffers: shared hit=8004      
 Planning time: 52.524 ms      
 Execution time: 103.555 ms      
(13 rows)      
      
//Do not use where c400=1,  
//89 seconds      
explain (analyze,verbose,timing,costs,buffers) select c400,sum(c2) from ft_tbl1  group by c400;      
      
         CStore File: /data01/digoal/pg_root1921/cstore_fdw/13146/41038      
         CStore File Size: 325166400400      
         Buffers: shared hit=8004      
 Planning time: 52.691 ms      
 Execution time: 89428.721 ms

It should be mentioned that the plug-in cstore_fdw does not support parallel computing at present. In fact, the FDW interface of PostgreSQL already supports parallel computing. cstore_fdw can support parallel computing with only a modification.

as follows

https://www.postgresql.org/docs/10/static/fdw-callbacks.html#fdw-callbacks-parallel

Filtration efficiency and linear correlation

Note that not all columns are well filtered because of data storage. For instance:

Writing of a column is very random, resulting in a random distribution of values, so the data range contained in a data block may be relatively large, and the storage element information filtering of this column is very poor.

create table a(id int, c1 int);      
insert into a select generate_series(1,1000000), random()*1000000;

The data are distributed as follows

postgres=# select substring(ctid::text, '(\d+),')::int8 blkid, min(c1) min_c1, max(c1) max_c1, min(id) min_id, max(id) max_id from a group by 1 order by 1;      
 blkid | min_c1 | max_c1 | min_id | max_id        
-------+--------+--------+--------+---------      
     0 |   2697 | 998322 |      1 |     909      
     1 |   1065 | 998817 |    910 |    1818      
     2 |    250 | 998025 |   1819 |    2727      
     3 |     62 | 997316 |   2728 |    3636      
     4 |   1556 | 998640 |   3637 |    4545      
     5 |    999 | 999536 |   4546 |    5454      
     6 |   1385 | 999196 |   5455 |    6363      
     7 |   1809 | 999042 |   6364 |    7272      
     8 |   3044 | 999606 |   7273 |    8181      
     9 |   1719 | 999186 |   8182 |    9090      
    10 |    618 | 997031 |   9091 |    9999      
    11 |     80 | 997581 |  10000 |   10908      
    12 |    781 | 997710 |  10909 |   11817      
    13 |   1539 | 998857 |  11818 |   12726      
    14 |   2097 | 999932 |  12727 |   13635      
    15 |    114 | 999913 |  13636 |   14544      
    16 |    136 | 999746 |  14545 |   15453      
    17 |   2047 | 997439 |  15454 |   16362      
    18 |   1955 | 996937 |  16363 |   17271      
    19 |   1487 | 999705 |  17272 |   18180      
    20 |     97 | 999549 |  18181 |   19089      
    21 |    375 | 999161 |  19090 |   19998      
    22 |    645 | 994457 |  19999 |   20907      
    23 |   4468 | 998612 |  20908 |   21816      
    24 |    865 | 996342 |  21817 |   22725      
    25 |    402 | 998151 |  22726 |   23634      
    26 |    429 | 998823 |  23635 |   24543      
    27 |   1305 | 999521 |  24544 |   25452      
    28 |    974 | 998874 |  25453 |   26361      
    29 |   1056 | 999271 |  26362 |   27270      
. . . . . .

For ID columns, the distribution is very clear (with good linear correlation) and the metadata storage is filtered well. The C1 column has very scattered distribution and poor filterability for storing metadata.

For example, I want to look up the data with id=10000, look up block 11 directly, and skip the scanning of other data blocks.

If I want to look up c1=10000 data, I need to look up many data blocks, because there are few data blocks that can be skipped.

How to Improve the Filtration-Storage Arrangement of Each Column

For a single column, the method to improve filtering is very simple, and can be stored in sequence.

For example, in the previous test table, we need to improve the filter performance of C1, which can be achieved by rearranging it according to C1.

After rearrangement, the correlation between C1 column and physical storage (line number) becomes 1 or - 1, that is, linear correlation, so filtering is particularly good.

postgres=# create temp table tmp_a (like a);      
CREATE TABLE      
postgres=# insert into tmp_a select * from a order by c1;      
INSERT 0 1000000      
postgres=# truncate a;      
TRUNCATE TABLE      
postgres=# insert into a select * from tmp_a;      
INSERT 0 1000000      
postgres=# end;      
COMMIT      
postgres=# select substring(ctid::text, '(\d+),')::int8 blkid, min(c1) min_c1, max(c1) max_c1, min(id) min_id, max(id) max_id from a group by 1 order by 1;      
 blkid | min_c1 | max_c1 | min_id | max_id        
-------+--------+--------+--------+---------      
     0 |      0 |    923 |   2462 |  999519      
     1 |    923 |   1846 |   1487 |  997619      
     2 |   1847 |   2739 |    710 |  999912      
     3 |   2741 |   3657 |   1930 |  999053      
     4 |   3658 |   4577 |   1635 |  999579      
     5 |   4577 |   5449 |    852 |  999335      
     6 |   5450 |   6410 |    737 |  998277      
     7 |   6414 |   7310 |   3262 |  999024      
     8 |   7310 |   8245 |    927 |  997907      
     9 |   8246 |   9146 |    441 |  999209      
    10 |   9146 |  10015 |    617 |  999828      
    11 |  10016 |  10920 |   1226 |  998264      
    12 |  10923 |  11859 |   1512 |  997404      
    13 |  11862 |  12846 |    151 |  998737      
    14 |  12847 |  13737 |   1007 |  999250      
. . . . . .       
      
c1 Linear correlation between column and physical storage (line number)      
postgres=# select correlation from pg_stats where tablename='a' and attname='c1';      
 correlation       
-------------      
           1      
(1 row)

Unfortunately, with this arrangement, the filterability of the ID field becomes worse.

Why is that?

Relative linear correlation of two columns in global/full table

In fact, the correlation between ID and C1 columns controls the problem of ID columns becoming discrete after sorting by C1.

What is the correlation between ID and C1?

postgres=# select corr(c1,id) from (select row_number() over(order by c1) c1, row_number() over(order by id) id from a) t;      
         corr                
-----------------------      
 -0.000695987373950136      
(1 row)

The poor global (full table) correlation between c1 and id leads to this problem.

(It can be understood that the eight-character inconsistency between the two fields)

Relative linear correlation between two columns of local/partial records

If the whole table is sorted by C1 or ID, the dispersion of the other column becomes very high.

However, in some cases, there may be situations where some records have good correlation between fields A and B, while others have poor correlation.

Example

On the basis of previous records, insert a batch of records.

postgres=# insert into a select id, id*2 from generate_series(1,100000) t(id);      
INSERT 0 100000

This part of the data id, the correlation of the C1 field is 1. (Local correlation)

postgres=# select ctid from a offset 1000000 limit 1;      
    ctid          
------------      
 (1113,877)      
(1 row)      
      
postgres=# select corr(c1,id) from (select row_number() over(order by c1) c1, row_number() over(order by id) id from a where ctid >'(1113,877)') t;      
 corr       
------      
    1      
(1 row)

Global relevance has also improved a lot.

postgres=# select corr(c1,id) from (select row_number() over(order by c1) c1, row_number() over(order by id) id from a) t;      
       corr              
-------------------      
 0.182542794451908      
(1 row)

Partial Demand Change Method

Data scattered storage brings problems: even access to a small amount of data will cause a large number of IO reads, the principle is as follows:

"Statistical Principles and Solutions behind the Scaling-up of heap scan IO by Index Sequential Scanning"

Data storage is destined by heaven (it's decided when we write), but we can change our order on demand. For example, one business is the operator's call pipeline, and the query requirement is usually one month's call pipeline based on a certain mobile phone number. In fact, the data is written to the database immediately when it is generated, so the storage is scattered. Query time consumes a lot of IO.

Example

User call data is written in real time, and user data is Brownian distribution.

create table phone_list(phone_from char(11), phone_to char(11), crt_time timestamp, duration interval);  
create index idx_phone_list on phone_list(phone_from, crt_time);  
  
insert into phone_list   
select   
  lpad((random()*1000)::int8::text, 11, '1'),   
  lpad((random()*1000)::int8::text, 11, '1'),   
  now()+(id||' second')::interval,  
  ((random()*1000)::int||' second')::interval  
from generate_series(1,10000000) t(id);  
  
postgres=# select * from phone_list limit 10;  
 phone_from  |  phone_to   |          crt_time          | duration   
-------------+-------------+----------------------------+----------  
 14588832692 | 11739044013 | 2017-08-11 10:17:04.752157 | 00:03:25  
 15612918106 | 11808103578 | 2017-08-11 10:17:05.752157 | 00:11:33  
 14215811756 | 15983559210 | 2017-08-11 10:17:06.752157 | 00:08:05  
 13735246090 | 15398474974 | 2017-08-11 10:17:07.752157 | 00:13:18  
 19445131039 | 17771201972 | 2017-08-11 10:17:08.752157 | 00:00:10  
 11636458384 | 16356298444 | 2017-08-11 10:17:09.752157 | 00:06:30  
 15771059012 | 14717265381 | 2017-08-11 10:17:10.752157 | 00:13:45  
 19361008150 | 14468133189 | 2017-08-11 10:17:11.752157 | 00:05:58  
 13424293799 | 16589177297 | 2017-08-11 10:17:12.752157 | 00:16:29  
 12243665890 | 13538149386 | 2017-08-11 10:17:13.752157 | 00:16:03  
(10 rows)

Inquiries are inefficient. It takes 26 milliseconds to retrieve 29937 calls from mobile phones.

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from phone_list where phone_from='11111111111' order by crt_time;  
                                                                   QUERY PLAN                                                                     
------------------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_phone_list on public.phone_list  (cost=0.56..31443.03 rows=36667 width=48) (actual time=0.016..24.348 rows=29937 loops=1)  
   Output: phone_from, phone_to, crt_time, duration  
   Index Cond: (phone_list.phone_from = '11111111111'::bpchar)  
   Buffers: shared hit=25843  
 Planning time: 0.082 ms  
 Execution time: 25.821 ms  
(6 rows)

Renovation method, local adjustment on demand.

The requirement is to efficiently query the phone and monthly call details, so we need to rearrange the user's monthly data (usually partitioned monthly).

See the usage of the partition table: PostgreSQL 10.0 preview Enhancement - Built-in Partition Table

postgres=# cluster phone_list using idx_phone_list ;

The efficiency of query has been improved dramatically and the change of order has been successful.

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from phone_list where phone_from='11111111111' order by crt_time;  
                                                                  QUERY PLAN                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_phone_list on public.phone_list  (cost=0.56..31443.03 rows=36667 width=48) (actual time=0.012..4.590 rows=29937 loops=1)  
   Output: phone_from, phone_to, crt_time, duration  
   Index Cond: (phone_list.phone_from = '11111111111'::bpchar)  
   Buffers: shared hit=432  
 Planning time: 0.038 ms  
 Execution time: 5.968 ms  
(6 rows)

You are the hand of God, and the fate of data is in your hands.

How to Improve the Filtration-Storage Arrangement of Each Column

In order to achieve the best filtering performance (each column can be filtered well), global sorting can not meet the requirements.

In fact, you need to sort locally, for example, in the previous example, the first 1 million rows are sorted by C1, and the last 100,000 rows are sorted by ID.

In this way, the ID with 100,000 records is very filterable, and C1 with 1.1 million records is also very filterable.

But numbers are all logical, as if a person's name is also divided into five squares.

Through the remedy acquired, we can change our transportation. Like data arrangement, data rearrangement can affect global filtering and local filtering. Is it interesting?

According to your query target needs, rearrange the data, together to change the transport bar.

Relative Linear Dependence of Multiple Columns in Compound Sorting

How can multiple columns achieve good aggregation for each column?

1. The most sophisticated method, multi-column sorting, but the effect is not necessarily good. In order to achieve better results, we need to adjust the order of columns. The algorithm is as follows:

I remember writing a document like this before:

A Simple Algorithms Can Help Internet of Things, Financial Users Save 98% of Data Storage Cost (PostgreSQL,Greenplum Help You Do it)

In fact, it is also the essence of storage arrangement. By arranging and combining, we can calculate the linear correlation of each two columns. According to this, we can find out the best multi-column sorting combination, so as to improve the overall correlation (improve the compression ratio).

The same applies to improving the filterability of all columns mentioned in this article.

2. k-means algorithm, for multi-column aggregation calculation, completes the best local distribution, so that it can achieve the filtering of each column is very commendable.

K-Means Data Aggregation Algorithms

quintessence

1. Relative correlation between local and global columns. Determines the dispersion of another column after sorting by one column.

2. The purpose of orchestration is to allow as many columns as possible to be stored in an orderly manner so that the most rows can be filtered.

3. Global correlation determines the dispersion of another column when sorted by one column.

4. Local correlation determines the linear correlation of two columns in some records.

5. Arranged by local correlation, more columns can be stored in order as much as possible, so that the most rows can be filtered. However, the algorithm is more complex, which rows need to be calculated together and stored in what order to obtain the best filtering performance.

6. Regarding the arrangement of multi-column (or array) data, method 1 calculates the linear correlation of each two columns (elements) through arrangement and combination, and finds out the best combination of multi-column sorting according to this, so as to improve the overall correlation (increase the compression ratio).

7. After arrangement, if the column with poor linear correlation to storage (line number) has better selectivity (more DISTINCT VALUE) and the service has the need to filter data, it is suggested that index should be built.

8. Regarding the data arrangement of multiple columns (or arrays), method 2 calculates the type of data by kmean, and stores each type aggregately, so as to improve the local aggregation and filtering of data. This method is the most elegant.

9. By choreographing and combining with the BRIN index of PG, the efficient filtering of any column can be realized.

Case Study of Data Renewal

1. Multi-column Change of Order

Low-level methods, A Simple Algorithms Can Help Internet of Things, Financial Users Save 98% of Data Storage Cost (PostgreSQL,Greenplum Help You Do it)

Advanced methods, K-Means Data Aggregation Algorithms

Examples of advanced methods

-- Write 100 million records
-- Heaven's fate, scattered, five elements disorder, inefficient inquiry. 
postgres=# create table tab(c1 int, c2 int, c3 int, c4 int, c5 int);
CREATE TABLE
postgres=# insert into tab select * from (select id,100000000-id,50000000-id, sqrt(id*2), sqrt(id) from generate_series(1,100000000) t(id)) t order by random();
INSERT 0 100000000
postgres=# select ctid,* from tab limit 10;
  ctid  |    c1    |    c2    |    c3     |  c4   |  c5  
--------+----------+----------+-----------+-------+------
 (0,1)  | 76120710 | 23879290 | -26120710 | 12339 | 8725
 (0,2)  | 98295593 |  1704407 | -48295593 | 14021 | 9914
 (0,3)  | 56133647 | 43866353 |  -6133647 | 10596 | 7492
 (0,4)  |   787639 | 99212361 |  49212361 |  1255 |  887
 (0,5)  | 89844299 | 10155701 | -39844299 | 13405 | 9479
 (0,6)  | 92618459 |  7381541 | -42618459 | 13610 | 9624
 (0,7)  | 93340303 |  6659697 | -43340303 | 13663 | 9661
 (0,8)  | 17164665 | 82835335 |  32835335 |  5859 | 4143
 (0,9)  |  2694394 | 97305606 |  47305606 |  2321 | 1641
 (0,10) | 41736122 | 58263878 |   8263878 |  9136 | 6460
(10 rows)
  
-- Change your life, press K-MEAN Aggregate and adjust five elements, adopt BRIN Indexing achieves efficient filtering of any column.
-- Keep each column consistent in all directions, for example(a,b) (1,100)(2,101), (100,9)(105,15)，If categorized into two categories, in filtering A Field selection is good, filtering B Selectivity is also good for fields.  
postgres=# create table tbl1(like tab);
CREATE TABLE

-- Because data is stored in blocks, BRIN The index has the smallest granularity as blocks, so the maximum number of clusters can be the number of blocks in the table. For example, 636943 data blocks can be categorized as 636943.
-- Categorizing more than 636943 is meaningless, and categorizing less is possible, for example BRIN The index stores one meta-information per 10 consecutive data blocks, so we can choose to classify it into 63694 categories.  
postgres=# select relpages from pg_class where relname='tab';
 relpages 
----------
    636943
(1 row)
postgres=# insert into tbl1 select c1,c2,c3,c4,c5 from (select kmeans(array[c1,c2,c3,c4,c5],63694) over() km, * from tab) t order by km;
  
-- Create any column BRIN Indexes
create index idx_tab_1 on tab using brin(c1,c2,c3) with (pages_per_range=1);
create index idx_tbl1_1 on tbl1 using brin(c1,c2,c3) with (pages_per_range=1);

Using BRIN index, after changing the data, search any column range, improve efficiency, praise

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tab where c1 between 1 and 100000;
                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.tab  (cost=4184.33..906532.40 rows=83439 width=20) (actual time=165.626..1582.402 rows=100000 loops=1)
   Output: c1, c2, c3, c4, c5
   Recheck Cond: ((tab.c1 >= 1) AND (tab.c1 <= 100000))
   Rows Removed by Index Recheck: 14427159
   Heap Blocks: lossy=92530
   Buffers: shared hit=96745
   ->  Bitmap Index Scan on idx_tab_1  (cost=0.00..4163.47 rows=17693671 width=0) (actual time=165.307..165.307 rows=925300 loops=1)
         Index Cond: ((tab.c1 >= 1) AND (tab.c1 <= 100000))
         Buffers: shared hit=4215
 Planning time: 0.088 ms
 Execution time: 1588.852 ms
(11 rows)

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl1 where c1 between 1 and 100000;
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.tbl1  (cost=4159.34..111242.78 rows=95550 width=20) (actual time=157.084..169.314 rows=100000 loops=1)
   Output: c1, c2, c3, c4, c5
   Recheck Cond: ((tbl1.c1 >= 1) AND (tbl1.c1 <= 100000))
   Rows Removed by Index Recheck: 9
   Heap Blocks: lossy=637
   Buffers: shared hit=4852
   ->  Bitmap Index Scan on idx_tbl1_1  (cost=0.00..4135.45 rows=95613 width=0) (actual time=157.074..157.074 rows=6370 loops=1)
         Index Cond: ((tbl1.c1 >= 1) AND (tbl1.c1 <= 100000))
         Buffers: shared hit=4215
 Planning time: 0.083 ms
 Execution time: 174.069 ms
(11 rows)

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tab where c2 between 1 and 100000;
                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.tab  (cost=4183.50..902041.63 rows=82011 width=20) (actual time=165.901..1636.587 rows=100000 loops=1)
   Output: c1, c2, c3, c4, c5
   Recheck Cond: ((tab.c2 >= 1) AND (tab.c2 <= 100000))
   Rows Removed by Index Recheck: 14446835
   Heap Blocks: lossy=92655
   Buffers: shared hit=96870
   ->  Bitmap Index Scan on idx_tab_1  (cost=0.00..4163.00 rows=17394342 width=0) (actual time=165.574..165.574 rows=926550 loops=1)
         Index Cond: ((tab.c2 >= 1) AND (tab.c2 <= 100000))
         Buffers: shared hit=4215
 Planning time: 0.087 ms
 Execution time: 1643.089 ms
(11 rows)

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl1 where c2 between 1 and 100000;
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.tbl1  (cost=4156.97..101777.70 rows=86127 width=20) (actual time=157.245..169.934 rows=100000 loops=1)
   Output: c1, c2, c3, c4, c5
   Recheck Cond: ((tbl1.c2 >= 1) AND (tbl1.c2 <= 100000))
   Rows Removed by Index Recheck: 115
   Heap Blocks: lossy=638
   Buffers: shared hit=4853
   ->  Bitmap Index Scan on idx_tbl1_1  (cost=0.00..4135.44 rows=86193 width=0) (actual time=157.227..157.227 rows=6380 loops=1)
         Index Cond: ((tbl1.c2 >= 1) AND (tbl1.c2 <= 100000))
         Buffers: shared hit=4215
 Planning time: 0.084 ms
 Execution time: 174.692 ms
(11 rows)

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tab where c3 between 1 and 10000;
                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.tab  (cost=4141.01..672014.67 rows=9697 width=20) (actual time=191.075..10765.038 rows=10000 loops=1)
   Output: c1, c2, c3, c4, c5
   Recheck Cond: ((tab.c3 >= 1) AND (tab.c3 <= 10000))
   Rows Removed by Index Recheck: 99990000
   Heap Blocks: lossy=636943
   Buffers: shared hit=641158
   ->  Bitmap Index Scan on idx_tab_1  (cost=0.00..4138.58 rows=2062044 width=0) (actual time=190.292..190.292 rows=6369430 loops=1)
         Index Cond: ((tab.c3 >= 1) AND (tab.c3 <= 10000))
         Buffers: shared hit=4215
 Planning time: 0.086 ms
 Execution time: 10766.036 ms
(11 rows)

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl1 where c3 between 1 and 10000;
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.tbl1  (cost=4137.85..17069.21 rows=10133 width=20) (actual time=150.710..152.040 rows=10000 loops=1)
   Output: c1, c2, c3, c4, c5
   Recheck Cond: ((tbl1.c3 >= 1) AND (tbl1.c3 <= 10000))
   Rows Removed by Index Recheck: 205
   Heap Blocks: lossy=65
   Buffers: shared hit=4280
   ->  Bitmap Index Scan on idx_tbl1_1  (cost=0.00..4135.32 rows=10205 width=0) (actual time=150.692..150.692 rows=650 loops=1)
         Index Cond: ((tbl1.c3 >= 1) AND (tbl1.c3 <= 10000))
         Buffers: shared hit=4215
 Planning time: 0.083 ms
 Execution time: 152.546 ms
(11 rows)

2. Array Renewal

"Index Scan Optimizing - GIN Data Reorganization Optimizing (Aggregated by Elements) Imagine Playing a Multilevel Magic Cube"

From One-Dimensional Arrangement to Multi-Dimensional Arrangement, From Planar Storage to 3D Storage-The Way to Optimize Data Storage

K-Means Data Aggregation Algorithms

3. Spatiotemporal Data Renewal

"Time, Space, Object Multidimensional Attribute Massive Data Arbitrary Multidimensional Efficient Retrieval - Aliyun RDS PostgreSQL Best Practice"

4. Renewal of Securities System

"PostgreSQL Time Series Best Practices - Stock Exchange System Database Design - Aliyun RDS PostgreSQL Best Practices"

Related technology

1. Column Storage Plug-in cstore

https://github.com/citusdata/cstore_fdw

https://www.postgresql.org/docs/10/static/fdw-callbacks.html#fdw-callbacks-parallel

2,A Simple Algorithms Can Help Internet of Things, Financial Users Save 98% of Data Storage Cost (PostgreSQL,Greenplum Help You Do it)

3,PostgreSQL Internet of Things Black Technology - Index of Several Hundreds of Times of Weight Loss (BRIN index)

https://www.postgresql.org/docs/10/static/brin-intro.html

4. metascan is a database function developed by Aliyun PostgreSQL Kernel Team, which has been used RDS PostgreSQL and HybridDB for PostgreSQL In the future, it can also be integrated into the storage engine level, pushing the FILTER of data down to the storage layer. According to the query conditions provided by users, the scanning amount of data can be reduced and the efficiency can be improved without indexing.

We have tested that query performance has improved by three to five times (compared with no index). Writing performance is at least twice as good as indexing.

Cloud products

Aliyun RDS PostgreSQL

Aliyun HybridDB for PostgreSQL

Keywords: PostgreSQL Database GreenPlum github

Added by mattclements on Sun, 02 Jun 2019 01:06:38 +0300

Programming VIP

PgSQL. Application Case. HDB for PG Features (Data Diskette and High Efficiency Filtering for Arbitrary Columns)

background

Storage layer statistics and filtering push-downs

Filtration efficiency and linear correlation

How to Improve the Filtration-Storage Arrangement of Each Column

Relative linear correlation of two columns in global/full table

Relative linear correlation between two columns of local/partial records

Partial Demand Change Method

How to Improve the Filtration-Storage Arrangement of Each Column

Relative Linear Dependence of Multiple Columns in Compound Sorting

quintessence

Case Study of Data Renewal

Related technology

Cloud products

Popular Keywords