PostgreSQL 11.1 trillion tpcb performance test on Aliyun ECS + ESSD + zfs/lvm2 strip + block_size=32K

Label

PostgreSQL, pgbench, tpcb, tpcc, tpch, lvm2, zfs, bands

background

Several recent PostgreSQL OLTP and OLAP tests:

PostgreSQL 11 tpcc Test (103,000 tpmC on ECS) - use sysbench-tpcc by Percona-Lab

(TPC-H test SF=10,SF=200) PostgreSQL 11 vs 10 vs Deepgreen

PostgreSQL 10 billion tpcb performance on ECS

PostgreSQL on Aliyun ECS + ESSD - 100 billion tpcb, 1000W tpcc test

Coverage:

1,SF=10, SF=200 TPCH

2,1000W TPCC

310 billion TPCB

400 billion TPCB

5,1 trillion TPCB (about 125 TB single table). What this article will test)

In this paper, 16 ESSD cloud disks are used. Two file systems, ZFS and EXT4, are used in the test.

Environmental Science

1,ecs,CentOS 7.4 x64

2,CPU

lscpu  
  
Architecture:          x86_64  
CPU op-mode(s):        32-bit, 64-bit  
Byte Order:            Little Endian  
CPU(s):                64  
On-line CPU(s) list:   0-63  
Thread(s) per core:    2  
Core(s) per socket:    32  
Socket(s):             1  
NUMA node(s):          1  
Vendor ID:             GenuineIntel  
CPU family:            6  
Model:                 85  
Model name:            Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz  
Stepping:              4  
CPU MHz:               2499.996  
BogoMIPS:              4999.99  
Hypervisor vendor:     KVM  
Virtualization type:   full  
L1d cache:             32K  
L1i cache:             32K  
L2 cache:              1024K  
L3 cache:              33792K  
NUMA node0 CPU(s):     0-63  
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1  

3, kernel

uname -a  
Linux pg11-320tb-zfs 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux  

4, memory

free -g  
              total        used        free      shared  buff/cache   available  
Mem:            503           2         500           0           0         498  
Swap:             0           0           0  

5, clock

echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource  

6. Block equipment

lsblk  
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT  
vda    253:0    0  200G  0 disk   
└─vda1 253:1    0  200G  0 part /  
vdb    253:16   0   20T  0 disk   
vdc    253:32   0   20T  0 disk   
vdd    253:48   0   20T  0 disk   
vde    253:64   0   20T  0 disk   
vdf    253:80   0   20T  0 disk   
vdg    253:96   0   20T  0 disk   
vdh    253:112  0   20T  0 disk   
vdi    253:128  0   20T  0 disk   
vdj    253:144  0   20T  0 disk   
vdk    253:160  0   20T  0 disk   
vdl    253:176  0   20T  0 disk   
vdm    253:192  0   20T  0 disk   
vdn    253:208  0   20T  0 disk   
vdo    253:224  0   20T  0 disk   
vdp    253:240  0   20T  0 disk   
vdq    253:256  0   20T  0 disk   

Configuring OS parameters of ECS virtual machine

1. Kernel parameters

vi /etc/sysctl.conf    
    
# add by digoal.zhou        
fs.aio-max-nr = 1048576        
fs.file-max = 76724600        
        
# Optional: kernel.core_pattern=/data01/corefiles/core_%e_%u_%t_%s.%p                 
# / Data 01/corefiles is built beforehand, with 777 permissions. If it is a soft link, the corresponding directory is changed to 777.        
        
kernel.sem = 4096 2147483647 2147483646 512000            
# The semaphores, ipcs-l or-u view, need 17 semaphores for each group of 16 processes.        
        
kernel.shmall = 107374182              
# All shared memory segments are limited in size (80% of recommended memory), in pages.        
kernel.shmmax = 274877906944           
# Maximum single shared memory segment size (recommended half of memory). Version 9.2 has significantly reduced the use of shared memory in bytes.        
kernel.shmmni = 819200                 
# How many shared memory segments can be generated altogether, and at least two shared memory segments per PG database cluster        
        
net.core.netdev_max_backlog = 10000        
net.core.rmem_default = 262144               
# The default setting of the socket receive buffer in bytes.        
net.core.rmem_max = 4194304                  
# The maximum receive socket buffer size in bytes        
net.core.wmem_default = 262144               
# The default setting (in bytes) of the socket send buffer.        
net.core.wmem_max = 4194304                  
# The maximum send socket buffer size in bytes.        
net.core.somaxconn = 4096        
net.ipv4.tcp_max_syn_backlog = 4096        
net.ipv4.tcp_keepalive_intvl = 20        
net.ipv4.tcp_keepalive_probes = 3        
net.ipv4.tcp_keepalive_time = 60        
net.ipv4.tcp_mem = 8388608 12582912 16777216        
net.ipv4.tcp_fin_timeout = 5        
net.ipv4.tcp_synack_retries = 2        
net.ipv4.tcp_syncookies = 1            
# Open SYN Cookies. When SYN waiting queue overflow occurs, cookie s are enabled to prevent a small number of SYN attacks.        
net.ipv4.tcp_timestamps = 1            
# Reduce time_wait        
net.ipv4.tcp_tw_recycle = 0            
# If = 1, the quick recovery of TIME-WAIT sockets in TCP connections is turned on, but the NAT environment may cause the connection to fail. It is recommended that the server close it.        
net.ipv4.tcp_tw_reuse = 1              
# Open reuse. Allow TIME-WAIT sockets to be reused for new TCP connections        
net.ipv4.tcp_max_tw_buckets = 262144        
net.ipv4.tcp_rmem = 8192 87380 16777216        
net.ipv4.tcp_wmem = 8192 65536 16777216        
        
net.nf_conntrack_max = 1200000        
net.netfilter.nf_conntrack_max = 1200000        
        
vm.dirty_background_bytes = 409600000               
#  When the dirty page reaches this value, the dirty page before (dirty_expire_centisecs/100) is automatically brushed to disk by the pdflush (or other) scheduling process in the background of the system.        
#  The default is 10%. Large memory machines recommend adjusting to specify directly how many bytes        
        
vm.dirty_expire_centisecs = 3000                     
#  Dirty pages older than this will be brushed to disk. 3000 means 30 seconds.        
vm.dirty_ratio = 95                                  
#  If the system process brushes dirty pages too slowly, making the system dirty pages more than 95% of memory, then if the user process writes to disk (such as fsync, fdatasync, etc.), it needs to actively brush out the system dirty pages.        
#  Effectively prevent user processes from brushing dirty pages. It is very effective in the case of single machine with multiple instances and using CGROUP to restrict single instance IOPS.          
        
vm.dirty_writeback_centisecs = 100                    
#  The wake-up interval for a pdflush (or other) backstage dirty page brushing process is 100 for one second.        
        
vm.swappiness = 0        
#  No swap partition        
        
vm.mmap_min_addr = 65536        
vm.overcommit_memory = 0             
#  When allocating memory, a small amount of over malloc is allowed. If set to 1, there is always enough memory, and the test environment with less memory can use 1.          
        
vm.overcommit_ratio = 90             
#  When overcommit_memory = 2, it is used to participate in calculating the allowable allocated memory size.        
vm.swappiness = 0                    
#  Close swap partition        
vm.zone_reclaim_mode = 0             
# Disable numa, or in vmlinux.         
net.ipv4.ip_local_port_range = 40000 65535            
# Locally Automated TCP, UDP Port Number Range        
fs.nr_open=20480000        
# The upper limit of file handle allowed to open by a single process        
        
# Note the following parameters        
# vm.extra_free_kbytes = 4096000        
# vm.min_free_kbytes = 2097152  # Vm.min_free_kbytes recommends allocating 1G vm.min_free_kbytes per 32G of memory       
# If it is a small memory machine, the above two values are not recommended.        
# vm.nr_hugepages = 66536            
#  It is recommended to use large page size / proc / meminfo Huge page size when shared buffer settings exceed 64GB        
# vm.lowmem_reserve_ratio = 1 1 1        
# For memory larger than 64G, it is recommended to set it, otherwise the default value is 256 256 32.    

2. Resource constraints

vi /etc/security/limits.conf    
    
# If nofile exceeds 1048 576, you must first set fs.nr_open of sysctl to a larger value and then continue to set nofile after it takes effect.        
        
* soft    nofile  1024000        
* hard    nofile  1024000        
* soft    nproc   unlimited        
* hard    nproc   unlimited        
* soft    core    unlimited        
* hard    core    unlimited        
* soft    memlock unlimited        
* hard    memlock unlimited       

3. Close large transparent pages and use precise clocks (optional)

vi /etc/rc.local    
    
touch /var/lock/subsys/local    
         
if test -f /sys/kernel/mm/transparent_hugepage/enabled; then        
   echo never > /sys/kernel/mm/transparent_hugepage/enabled        
fi      
  
# Clock  
echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource  
  
su - postgres -c "pg_ctl start"      

Deployment of PostgreSQL 11

rpm -ivh https://download.postgresql.org/pub/repos/yum/11/redhat/rhel-7-x86_64/pgdg-centos11-11-2.noarch.rpm    
    
rpm -ivh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm     
    
yum -y install coreutils glib2 lrzsz dstat sysstat e4fsprogs xfsprogs ntp readline-devel zlib-devel openssl-devel pam-devel libxml2-devel libxslt-devel python-devel tcl-devel gcc gcc-c++ make smartmontools flex bison perl-devel perl-ExtUtils* openldap-devel jadetex  openjade bzip2 git iotop lvm2 perf      
    
yum install -y postgresql11*    

Block device deployment strategy 1 - zfs

zfsonlinux

PostgreSQL on ECS Efficient Continuous Backup Design - By ZFS on Linux

PostgreSQL OLTP on ZFS Performance Optimization

1. zfs yum configuration

yum install -y http://download.zfsonlinux.org/epel/zfs-release.el7_4.noarch.rpm  

2. kernel-devel corresponding to the current kernel

http://vault.centos.org

uname -a  
Linux pg11-320tb-zfs 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux  
rpm -ivh http://vault.centos.org/7.4.1708/updates/x86_64/Packages/kernel-devel-3.10.0-693.2.2.el7.x86_64.rpm  

3. Installation of zfs

yum install -y zfs   

Check logs for errors. Normally there are no errors

Testing for Availability

modprobe zfs  
  
zfs get -o all  

System startup will automatically load zfs

Create zpool

essd has three copies at the bottom and no need to use zfs RAID function.

parted -a optimal -s /dev/vdb mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdc mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdd mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vde mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdf mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdg mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdh mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdi mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdj mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdk mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdl mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdm mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdn mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdo mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdp mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdq mklabel gpt mkpart primary 1MiB 100%FREE  
zpool create zp1 -f -o ashift=13 vdb1 vdc1 vdd1 vde1 vdf1 vdg1 vdh1 vdi1 vdj1 vdk1 vdl1 vdm1 vdn1 vdo1 vdp1 vdq1  
zpool get all zp1  
NAME  PROPERTY                       VALUE                          SOURCE  
zp1   size                           318T                           -  
zp1   capacity                       0%                             -  
zp1   altroot                        -                              default  
zp1   health                         ONLINE                         -  
zp1   guid                           12407519490197584982           -  
zp1   version                        -                              default  
zp1   bootfs                         -                              default  
zp1   delegation                     on                             default  
zp1   autoreplace                    off                            default  
zp1   cachefile                      -                              default  
zp1   failmode                       wait                           default  
zp1   listsnapshots                  off                            default  
zp1   autoexpand                     off                            default  
zp1   dedupditto                     0                              default  
zp1   dedupratio                     1.00x                          -  
zp1   free                           318T                           -  
zp1   allocated                      960K                           -  
zp1   readonly                       off                            -  
zp1   ashift                         13                             local  
zp1   comment                        -                              default  
zp1   expandsize                     -                              -  
zp1   freeing                        0                              -  
zp1   fragmentation                  0%                             -  
zp1   leaked                         0                              -  
zp1   multihost                      off                            default  
zp1   feature@async_destroy          enabled                        local  
zp1   feature@empty_bpobj            enabled                        local  
zp1   feature@lz4_compress           active                         local  
zp1   feature@multi_vdev_crash_dump  enabled                        local  
zp1   feature@spacemap_histogram     active                         local  
zp1   feature@enabled_txg            active                         local  
zp1   feature@hole_birth             active                         local  
zp1   feature@extensible_dataset     active                         local  
zp1   feature@embedded_data          active                         local  
zp1   feature@bookmarks              enabled                        local  
zp1   feature@filesystem_limits      enabled                        local  
zp1   feature@large_blocks           enabled                        local  
zp1   feature@large_dnode            enabled                        local  
zp1   feature@sha512                 enabled                        local  
zp1   feature@skein                  enabled                        local  
zp1   feature@edonr                  enabled                        local  
zp1   feature@userobj_accounting     active                         local  

Create zfs

zfs create -o mountpoint=/data01 -o recordsize=8K -o atime=off -o primarycache=metadata -o logbias=throughput -o secondarycache=none zp1/data01  
  
  
zfs set canmount=off zp1  

Optimizing Two Parameters

cd /sys/module/zfs/parameters/  
echo 1 > zfs_prefetch_disable  
echo 15 > zfs_arc_shrink_shift   

View the current parameters

zfs get all zp1/data01  
NAME        PROPERTY              VALUE                  SOURCE  
zp1/data01  type                  filesystem             -  
zp1/data01  creation              Wed Sep 19 10:26 2018  -  
zp1/data01  used                  192K                   -  
zp1/data01  available             308T                   -  
zp1/data01  referenced            192K                   -  
zp1/data01  compressratio         1.00x                  -  
zp1/data01  mounted               yes                    -  
zp1/data01  quota                 none                   default  
zp1/data01  reservation           none                   default  
zp1/data01  recordsize            8K                     local  
zp1/data01  mountpoint            /data01                local  
zp1/data01  sharenfs              off                    default  
zp1/data01  checksum              on                     default  
zp1/data01  compression           off                    default  
zp1/data01  atime                 off                    local  
zp1/data01  devices               on                     default  
zp1/data01  exec                  on                     default  
zp1/data01  setuid                on                     default  
zp1/data01  readonly              off                    default  
zp1/data01  zoned                 off                    default  
zp1/data01  snapdir               hidden                 default  
zp1/data01  aclinherit            restricted             default  
zp1/data01  createtxg             81                     -  
zp1/data01  canmount              on                     default  
zp1/data01  xattr                 on                     default  
zp1/data01  copies                1                      default  
zp1/data01  version               5                      -  
zp1/data01  utf8only              off                    -  
zp1/data01  normalization         none                   -  
zp1/data01  casesensitivity       sensitive              -  
zp1/data01  vscan                 off                    default  
zp1/data01  nbmand                off                    default  
zp1/data01  sharesmb              off                    default  
zp1/data01  refquota              none                   default  
zp1/data01  refreservation        none                   default  
zp1/data01  guid                  3373300831209850945    -  
zp1/data01  primarycache          metadata               local  
zp1/data01  secondarycache        none                   default  
zp1/data01  usedbysnapshots       0B                     -  
zp1/data01  usedbydataset         192K                   -  
zp1/data01  usedbychildren        0B                     -  
zp1/data01  usedbyrefreservation  0B                     -  
zp1/data01  logbias               throughput             local  
zp1/data01  dedup                 off                    default  
zp1/data01  mlslabel              none                   default  
zp1/data01  sync                  standard               default  
zp1/data01  dnodesize             legacy                 default  
zp1/data01  refcompressratio      1.00x                  -  
zp1/data01  written               192K                   -  
zp1/data01  logicalused           76K                    -  
zp1/data01  logicalreferenced     76K                    -  
zp1/data01  volmode               default                default  
zp1/data01  filesystem_limit      none                   default  
zp1/data01  snapshot_limit        none                   default  
zp1/data01  filesystem_count      none                   default  
zp1/data01  snapshot_count        none                   default  
zp1/data01  snapdev               hidden                 default  
zp1/data01  acltype               off                    default  
zp1/data01  context               none                   default  
zp1/data01  fscontext             none                   default  
zp1/data01  defcontext            none                   default  
zp1/data01  rootcontext           none                   default  
zp1/data01  relatime              off                    default  
zp1/data01  redundant_metadata    all                    default  
zp1/data01  overlay               off                    default  

Initialize the database

1, directory

mkdir /data01/pg11  
  
chown postgres:postgres /data01/pg11  

2. Environmental variables

su - postgres  
  
vi .bash_profile  
  
export PS1="$USER@`/bin/hostname -s`-> "        
export PGPORT=1921        
export PGDATA=/data01/pg11/pg_root$PGPORT        
export LANG=en_US.utf8        
export PGHOME=/usr/pgsql-11      
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH        
export DATE=`date +"%Y%m%d%H%M"`      
export PATH=$PGHOME/bin:$PATH:.        
export MANPATH=$PGHOME/share/man:$MANPATH        
export PGHOST=$PGDATA        
export PGUSER=postgres        
export PGDATABASE=postgres        
alias rm='rm -i'        
alias ll='ls -lh'        
unalias vi   

3. Initialization

initdb -D $PGDATA -U postgres -E SQL_ASCII --lc-collate=C --lc-ctype=en_US.utf8 --wal-segsize=1024   

Large page configuration

zfs can bypass file system cache, so it is recommended to configure a larger postgresql shared buffer and use large page

PostgreSQL Huge Page Use Suggestions - Large Memory Host, Example Note

300GB/2MB=153600  
sysctl -w vm.nr_hugepages=159600     
  
echo "vm.nr_hugepages=159600" >> /etc/sysctl.conf  

postgresql.auto.conf

listen_addresses = '0.0.0.0'    
port = 1921    
max_connections = 2000    
superuser_reserved_connections = 3    
unix_socket_directories = '., /var/run/postgresql, /tmp'    
tcp_keepalives_idle = 60    
tcp_keepalives_interval = 10    
tcp_keepalives_count = 10    
huge_pages = on  # Using large page  
shared_buffers = 300GB    
max_prepared_transactions = 2000    
work_mem = 32MB    
maintenance_work_mem = 2GB    
dynamic_shared_memory_type = posix    
vacuum_cost_delay = 0    
bgwriter_delay = 10ms    
bgwriter_lru_maxpages = 1000    
bgwriter_lru_multiplier = 10.0    
effective_io_concurrency = 0    
max_worker_processes = 128    
max_parallel_workers = 64    
max_parallel_maintenance_workers = 64   
max_parallel_workers_per_gather = 0    
parallel_leader_participation = on    
min_parallel_table_scan_size=0  
min_parallel_index_scan_size=0  
parallel_setup_cost=0  
parallel_tuple_cost=0  
wal_level = minimal      
synchronous_commit = off    
full_page_writes=off       #  zfs has checksum, cow. Close fpw. If BLOCKDEV can guarantee 8K atom writing, it can also be closed.   
wal_writer_delay = 10ms    
checkpoint_timeout = 30min    
max_wal_size = 600GB    
min_wal_size = 150GB    
checkpoint_completion_target = 0.1    
max_wal_senders = 0    
effective_cache_size = 200GB    
log_destination = 'csvlog'    
logging_collector = on    
log_directory = 'log'    
log_filename = 'postgresql-%a.log'    
log_truncate_on_rotation = on    
log_rotation_age = 1d    
log_rotation_size = 0    
log_checkpoints = on     
log_connections = on    
log_disconnections = on    
log_error_verbosity = verbose     
log_line_prefix = '%m [%p] '    
log_timezone = 'PRC'    
log_autovacuum_min_duration = 0    
autovacuum_max_workers = 16    
autovacuum_freeze_max_age = 1200000000    
autovacuum_multixact_freeze_max_age = 1400000000    
autovacuum_vacuum_cost_delay = 0ms    
vacuum_freeze_table_age = 1150000000    
vacuum_multixact_freeze_table_age = 1150000000    
datestyle = 'iso, mdy'    
timezone = 'PRC'    
lc_messages = 'C'    
lc_monetary = 'C'    
lc_numeric = 'C'    
lc_time = 'C'    
default_text_search_config = 'pg_catalog.english'    
jit = off    
cpu_tuple_cost=0.00018884145574257426      
cpu_index_tuple_cost = 0.00433497085216479990      
cpu_operator_cost = 0.00216748542608239995      
seq_page_cost=0.014329      
random_page_cost = 0.016   
  

restart database

pg_ctl restart -m fast  

zfs on linux performance issues

In this case, (vacuum i s very slow, you can see that the problem pgbench-i-s 1000 can be reproduced)

vacuum process D status, stack as follows

[<ffffffffc0174132>] cv_wait_common+0xb2/0x150 [spl]
[<ffffffffc0174208>] __cv_wait_io+0x18/0x20 [spl]
[<ffffffffc073c42b>] zio_wait+0x10b/0x1b0 [zfs]
[<ffffffffc0687124>] dmu_buf_hold_array_by_dnode+0x154/0x4a0 [zfs]
[<ffffffffc06885f2>] dmu_read_uio_dnode+0x52/0x100 [zfs]
[<ffffffffc06886ec>] dmu_read_uio_dbuf+0x4c/0x70 [zfs]
[<ffffffffc07242d5>] zfs_read+0x135/0x3f0 [zfs]
[<ffffffffc0743990>] zpl_read_common_iovec.constprop.9+0x80/0xd0 [zfs]
[<ffffffffc0743aa6>] zpl_aio_read+0xc6/0xf0 [zfs]
[<ffffffff812001ad>] do_sync_read+0x8d/0xd0
[<ffffffff81200bac>] vfs_read+0x9c/0x170
[<ffffffff81201a6f>] SyS_read+0x7f/0xe0
[<ffffffff816b5009>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

Block device deployment strategy 1 - lvm2, ext4

1. Shut down and eliminate zfs

pg_ctl stop -m immediate  
  
zfs destroy zp1/data01  
zpool destroy zp1  

2. Cleaning block header information

wipefs -f -a /dev/vd[b-q]  
  
  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdb  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdc  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdd  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vde  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdf  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdg  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdh  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdi  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdj  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdk  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdl  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdm  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdn  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdo  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdp  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdq  
  
  
parted -a optimal -s /dev/vdb mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdc mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdd mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vde mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdf mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdg mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdh mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdi mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdj mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdk mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdl mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdm mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdn mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdo mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdp mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdq mklabel gpt mkpart primary 1MiB 100%FREE  
  
  
wipefs -f -a /dev/vd[b-q]1  

3, create PV

pvcreate /dev/vd[b-q]1  

4, create vg

vgcreate -A y -s 128M vgdata01 /dev/vd[b-q]1    

5. Create logical volumes and configure stripes

lvcreate -A y -i 16 -I 8 -L 4TiB -n lv03 vgdata01    
lvcreate -A y -i 16 -I 8 -L 220TiB -n lv01 vgdata01    
lvcreate -A y -i 16 -I 8 -l 100%FREE -n lv02 vgdata01    

6, view

[root@pg11-320tb-zfs ~]# pvs  
  PV         VG       Fmt  Attr PSize   PFree  
  /dev/vdb   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdc   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdd   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vde   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdf   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdg   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdh   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdi   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdj   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdk   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdl   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdm   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdn   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdo   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdp   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdq   vgdata01 lvm2 a--  <20.00t    0   
[root@pg11-320tb-zfs ~]# vgs  
  VG       #PV #LV #SN Attr   VSize    VFree  
  vgdata01  16   3   0 wz--n- <320.00t    0   
[root@pg11-320tb-zfs ~]# lvs  
  LV   VG       Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert  
  lv01 vgdata01 -wi-a----- 220.00t                                                      
  lv02 vgdata01 -wi-a----- <96.00t                                                      
  lv03 vgdata01 -wi-a-----   4.00t   

7. Create ext4 file system and configure stripes

mkfs.ext4 /dev/mapper/vgdata01-lv01 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stride=2,stripe_width=32 -b 4096 -T largefile -L lv01    
mkfs.ext4 /dev/mapper/vgdata01-lv02 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stride=2,stripe_width=32 -b 4096 -T largefile -L lv02    
mkfs.ext4 /dev/mapper/vgdata01-lv03 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stride=2,stripe_width=32 -b 4096 -T largefile -L lv03    

8. Configuration mounting

vi /etc/fstab     
LABEL=lv01 /data01 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0  
LABEL=lv02 /data02 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0  
LABEL=lv03 /data03 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0  
    
    
mkdir /data01    
mkdir /data02    
mkdir /data03    
    
    
mount -a    
[root@pg11-320tb-zfs ~]# df -h  
Filesystem                 Size  Used Avail Use% Mounted on  
/dev/vda1                  197G  2.1G  185G   2% /  
devtmpfs                   252G     0  252G   0% /dev  
tmpfs                      252G     0  252G   0% /dev/shm  
tmpfs                      252G  596K  252G   1% /run  
tmpfs                      252G     0  252G   0% /sys/fs/cgroup  
tmpfs                       51G     0   51G   0% /run/user/0  
/dev/mapper/vgdata01-lv01  220T   20K  220T   1% /data01  
/dev/mapper/vgdata01-lv02   96T   20K   96T   1% /data02  
/dev/mapper/vgdata01-lv03  4.0T   89M  4.0T   1% /data03  

9. Create database data, table space, WAL log directory

[root@pg11-320tb-zfs ~]# mkdir /data01/pg11  
[root@pg11-320tb-zfs ~]# mkdir /data02/pg11  
[root@pg11-320tb-zfs ~]# mkdir /data03/pg11  
[root@pg11-320tb-zfs ~]# chown postgres:postgres /data0*/pg11  

10. Configuring environment variables

su - postgres  
  
export PS1="$USER@`/bin/hostname -s`-> "  
export PGPORT=1921  
export PGDATA=/data01/pg11/pg_root$PGPORT  
export LANG=en_US.utf8  
export PGHOME=/usr/pgsql-11  
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH  
export DATE=`date +"%Y%m%d%H%M"`  
export PATH=$PGHOME/bin:$PATH:.  
export MANPATH=$PGHOME/share/man:$MANPATH  
export PGHOST=$PGDATA  
export PGUSER=postgres  
export PGDATABASE=postgres  
alias rm='rm -i'  
alias ll='ls -lh'  
unalias vi  

11. Initialization of database

initdb -D $PGDATA -X /data03/pg11/pg_wal1921 -U postgres -E SQL_ASCII --lc-collate=C --lc-ctype=en_US.utf8 --wal-segsize=1024   

12. Creating table spaces

mkdir /data01/pg11/tbs1  
mkdir /data02/pg11/tbs2  
  
  
create tablespace tbs1 location '/data01/pg11/tbs1';  
create tablespace tbs2 location '/data02/pg11/tbs2';  

1 trillion tpcb test

Initialization data

nohup pgbench -i -s 10000000 -I dtg -n --tablespace=tbs1 >./init.log 2>&1 &  

It takes 810688 seconds, about 1.233 million lines/s.

999999100000 of 1000000000000 tuples (99%) done (elapsed 810688.06 s, remaining 0.73 s)
999999200000 of 1000000000000 tuples (99%) done (elapsed 810688.14 s, remaining 0.65 s)
999999300000 of 1000000000000 tuples (99%) done (elapsed 810688.22 s, remaining 0.57 s)
999999400000 of 1000000000000 tuples (99%) done (elapsed 810688.30 s, remaining 0.49 s)
999999500000 of 1000000000000 tuples (99%) done (elapsed 810688.38 s, remaining 0.41 s)
999999600000 of 1000000000000 tuples (99%) done (elapsed 810688.46 s, remaining 0.32 s)
999999700000 of 1000000000000 tuples (99%) done (elapsed 810688.54 s, remaining 0.24 s)
999999800000 of 1000000000000 tuples (99%) done (elapsed 810688.61 s, remaining 0.16 s)
999999900000 of 1000000000000 tuples (99%) done (elapsed 810688.69 s, remaining 0.08 s)
1000000000000 of 1000000000000 tuples (100%) done (elapsed 810688.77 s, remaining 0.00 s)
done.

Data Occupancy 120TB

postgres=# \l+
                                                                 List of databases
   Name    |  Owner   | Encoding  | Collate |   Ctype    |   Access privileges   |  Size  | Tablespace |                Description                 
-----------+----------+-----------+---------+------------+-----------------------+--------+------------+--------------------------------------------
 postgres  | postgres | SQL_ASCII | C       | en_US.utf8 |                       | 120 TB | pg_default | default administrative connection database
 template0 | postgres | SQL_ASCII | C       | en_US.utf8 | =c/postgres          +| 15 MB  | pg_default | unmodifiable empty database
           |          |           |         |            | postgres=CTc/postgres |        |            | 
 template1 | postgres | SQL_ASCII | C       | en_US.utf8 | =c/postgres          +| 15 MB  | pg_default | default template for new databases
           |          |           |         |            | postgres=CTc/postgres |        |            | 
(3 rows)



postgres=# \dt+
                          List of relations
 Schema |       Name       | Type  |  Owner   |  Size   | Description 
--------+------------------+-------+----------+---------+-------------
 public | pgbench_accounts | table | postgres | 120 TB  | 
 public | pgbench_branches | table | postgres | 344 MB  | 
 public | pgbench_history  | table | postgres | 0 bytes | 
 public | pgbench_tellers  | table | postgres | 4201 MB | 
 public | t                | table | postgres | 804 MB  | 
(5 rows)


postgres=# \db+
                                        List of tablespaces
    Name    |  Owner   |      Location       | Access privileges | Options |  Size   | Description 
------------+----------+---------------------+-------------------+---------+---------+-------------
 pg_default | postgres |                     |                   |         | 850 MB  | 
 pg_global  | postgres |                     |                   |         | 2206 kB | 
 tbs1       | postgres | /data01/digoal/tbs1 |                   |         | 120 TB  | 
 tbs2       | postgres | /data02/digoal/tbs2 |                   |         | 213 MB  | 
(4 rows)

Add and delete field tests, second level

Adding and deleting fields only need to change metadata and complete in seconds.

PostgreSQL adds fields with default values, which are also done in seconds.

PostgreSQL 11 preview - Adding non-null default values does not require rewrite table - online add column with default value

postgres=# \timing
Timing is on.
postgres=# select * from pgbench_accounts limit 10;
 aid | bid | abalance |                                        filler                                        
-----+-----+----------+--------------------------------------------------------------------------------------
   1 |   1 |        0 |                                                                                     
   2 |   1 |        0 |                                                                                     
   3 |   1 |        0 |                                                                                     
   4 |   1 |        0 |                                                                                     
   5 |   1 |        0 |                                                                                     
   6 |   1 |        0 |                                                                                     
   7 |   1 |        0 |                                                                                     
   8 |   1 |        0 |                                                                                     
   9 |   1 |        0 |                                                                                     
  10 |   1 |        0 |                                                                                     
(10 rows)

Time: 498.051 ms

-- Adding fields in seconds

postgres=# alter table pgbench_accounts add column col1 text;
ALTER TABLE
Time: 1254.611 ms (00:01.255)
  
-- Adding non-null default value fields at second level
postgres=# alter table pgbench_accounts add column col2 text default 'hello digoal';
ALTER TABLE
Time: 1253.689 ms (00:01.254)


postgres=# select * from pgbench_accounts limit 10;
 aid | bid | abalance |                                        filler                                        | col1 |     col2     
-----+-----+----------+--------------------------------------------------------------------------------------+------+--------------
   1 |   1 |        0 |                                                                                      |      | hello digoal
   2 |   1 |        0 |                                                                                      |      | hello digoal
   3 |   1 |        0 |                                                                                      |      | hello digoal
   4 |   1 |        0 |                                                                                      |      | hello digoal
   5 |   1 |        0 |                                                                                      |      | hello digoal
   6 |   1 |        0 |                                                                                      |      | hello digoal
   7 |   1 |        0 |                                                                                      |      | hello digoal
   8 |   1 |        0 |                                                                                      |      | hello digoal
   9 |   1 |        0 |                                                                                      |      | hello digoal
  10 |   1 |        0 |                                                                                      |      | hello digoal
(10 rows)

Time: 502.608 ms
postgres=# explain analyze select * from pgbench_accounts limit 10;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.00 rows=10 width=168) (actual time=0.022..0.026 rows=10 loops=1)
   ->  Seq Scan on pgbench_accounts  (cost=0.00..184941625.46 rows=673387096855 width=168) (actual time=0.020..0.022 rows=10 loops=1)
 Planning Time: 0.057 ms
 Execution Time: 0.041 ms
(4 rows)

Time: 0.453 ms

-- Analysis table
postgres=# analyze pgbench_accounts ;
ANALYZE
Time: 67373.884 ms (01:07.374)

//Delete fields in milliseconds
postgres=# alter table pgbench_accounts drop column col1;
ALTER TABLE
Time: 7.610 ms
postgres=# alter table pgbench_accounts drop column col2;
ALTER TABLE
Time: 0.546 ms

Create index

After loading initialization data, create an index

1. Modifying Parallelism

psql  
  
analyze;  
alter table pgbench_accounts set (parallel_workers=64);  
alter table pgbench_tellers set (parallel_workers=64);  
alter table pgbench_branches set (parallel_workers=64);  

2. Creating Index

nohup pgbench -i -s 10000000 -I p -n --index-tablespace=tbs2 >./init_pkey.log 2>&1 &  

64 parallelism was turned on, and resource consumption at the beginning

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.20    0.00   38.52   16.31    0.00   41.97

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
vdb            2783.00  6362.00 3972.00  169.00 49328.00 52360.00    49.11     5.35    1.29    0.93    9.91   0.24  98.90
vdc            2760.00  6361.00 4020.00  167.00 49296.00 51764.00    48.27     5.19    1.23    0.88    9.59   0.24  98.70
vdd            2747.00  6361.00 4010.00  169.00 49332.00 51860.00    48.43     5.18    1.24    0.88    9.64   0.24  98.80
vde            2757.00  6362.00 4032.00  169.00 49344.00 51864.00    48.18     5.13    1.22    0.87    9.54   0.23  98.50
vdf            2732.00  6360.00 4012.00  191.00 49336.00 52532.00    48.47     5.21    1.24    0.85    9.50   0.24  99.10
vdg            2716.00  6361.00 4039.00  191.00 49320.00 52036.00    47.92     5.28    1.25    0.86    9.35   0.23  99.20
vdh            2742.00  6361.00 4038.00  191.00 49340.00 52032.00    47.94     5.46    1.29    0.91    9.26   0.23  98.80
vdi            2749.00  6361.00 4041.00  193.00 49328.00 52544.00    48.12     5.35    1.26    0.88    9.40   0.23  99.20
vdj            2776.00  6385.00 3953.00  169.00 49344.00 52020.00    49.18     5.41    1.31    0.93   10.12   0.24  99.10
vdk            2767.00  6384.00 3999.00  171.00 49328.00 52028.00    48.61     5.52    1.32    0.96    9.76   0.24  99.10
vdl            2775.00  6386.00 3984.00  172.00 49328.00 52032.00    48.78     5.56    1.34    0.97    9.94   0.24  99.10
vdm            2759.00  6385.00 4039.00  172.00 49296.00 52416.00    48.31     5.58    1.32    0.95    9.87   0.23  98.90
vdn            2776.00  6369.00 3967.00  163.00 49352.00 51840.00    49.00     5.48    1.32    0.95   10.33   0.24  99.70
vdo            2776.00  6370.00 3978.00  163.00 49352.00 52220.00    49.06     5.42    1.31    0.93   10.56   0.24  99.30
vdp            2782.00  6370.00 4004.00  162.00 49356.00 51840.00    48.58     5.60    1.34    0.97   10.44   0.24  99.70
vdq            2759.00  6370.00 4033.00  161.00 49352.00 51828.00    48.25     5.61    1.34    0.97   10.48   0.24  99.50
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00 108392.00 104717.00 790740.00 837136.00    15.28  1293.57    5.82    0.94   10.87   0.00 100.90
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00


top - 07:41:20 up 12 days, 22:02,  2 users,  load average: 61.15, 26.97, 10.80
Tasks: 607 total,  39 running, 568 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.4 us, 54.3 sy,  0.0 ni, 21.2 id, 21.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 52819500+total,  2192660 free, 33286761+used, 19313472+buff/cache
KiB Swap:        0 total,        0 free,        0 used. 19267680+avail Mem 
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                 
65466 digoal    20   0  0.296t  25764   2044 R  68.4  0.0   1:11.05 postgres: parallel worker for PID 65299  
65450 digoal    20   0  0.296t  25756   2076 D  63.2  0.0   1:10.51 postgres: parallel worker for PID 65299  
65460 digoal    20   0  0.296t  25764   2048 R  63.2  0.0   1:10.37 postgres: parallel worker for PID 65299  
65469 digoal    20   0  0.296t  25752   2056 R  63.2  0.0   1:10.48 postgres: parallel worker for PID 65299  
65474 digoal    20   0  0.296t  25764   2052 R  63.2  0.0   1:10.36 postgres: parallel worker for PID 65299  
65478 digoal    20   0  0.296t  25764   2060 R  63.2  0.0   1:10.64 postgres: parallel worker for PID 65299  
65479 digoal    20   0  0.296t  25752   2056 R  63.2  0.0   1:10.47 postgres: parallel worker for PID 65299  
65484 digoal    20   0  0.296t  25760   2056 R  63.2  0.0   1:10.63 postgres: parallel worker for PID 65299  
65485 digoal    20   0  0.296t  25748   2068 R  63.2  0.0   1:11.10 postgres: parallel worker for PID 65299  



Total DISK READ :     834.93 M/s | Total DISK WRITE :    1006.90 M/s
Actual DISK READ:     835.23 M/s | Actual DISK WRITE:     994.87 M/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                                                                                                                      
65461 be/4 digoal     13.90 M/s   12.74 M/s  0.00 % 51.43 % postgres: parallel worker for PID 65299
65457 be/4 digoal     13.86 M/s   12.95 M/s  0.00 % 51.25 % postgres: parallel worker for PID 65299
65505 be/4 digoal     14.82 M/s   13.05 M/s  0.00 % 51.09 % postgres: parallel worker for PID 65299
65466 be/4 digoal     12.57 M/s   13.48 M/s  0.00 % 50.72 % postgres: parallel worker for PID 65299
65462 be/4 digoal     13.16 M/s   13.23 M/s  0.00 % 50.70 % postgres: parallel worker for PID 65299
65491 be/4 digoal      8.85 M/s   12.99 M/s  0.00 % 50.59 % postgres: parallel worker for PID 65299
65451 be/4 digoal     12.33 M/s   13.48 M/s  0.00 % 50.57 % postgres: parallel worker for PID 65299
65477 be/4 digoal     12.37 M/s   13.20 M/s  0.00 % 50.38 % postgres: parallel worker for PID 65299
65459 be/4 digoal      8.45 M/s   19.33 M/s  0.00 % 50.27 % postgres: parallel worker for PID 65299
65488 be/4 digoal     12.34 M/s   12.74 M/s  0.00 % 50.21 % postgres: parallel worker for PID 65299
65495 be/4 digoal     13.83 M/s   13.26 M/s  0.00 % 50.19 % postgres: parallel worker for PID 65299
65450 be/4 digoal      9.20 M/s   19.45 M/s  0.00 % 50.14 % postgres: parallel worker for PID 65299
65503 be/4 digoal     14.02 M/s   19.66 M/s  0.00 % 50.13 % postgres: parallel worker for PID 65299

Waiting for events

postgres=# select wait_event,wait_event_type,count(*) from pg_stat_activity where wait_event is not null group by 1,2;
     wait_event      | wait_event_type | count 
---------------------+-----------------+-------
 BufFileRead         | IO              |    59
 BufFileWrite        | IO              |     1
 CheckpointerMain    | Activity        |     1
 BgWriterHibernate   | Activity        |     1
 AutoVacuumMain      | Activity        |     1
 LogicalLauncherMain | Activity        |     1
 WalWriterMain       | Activity        |     1
(7 rows)
  
-[ RECORD 7 ]----+---------------------------------------------------------------------------------
datid            | 13220
datname          | postgres
pid              | 65448
usesysid         | 10
usename          | postgres
application_name | pgbench
client_addr      | 
client_hostname  | 
client_port      | 
backend_start    | 2018-10-02 07:38:46.003833+08
xact_start       | 2018-10-02 07:38:46.003114+08
query_start      | 2018-10-02 07:38:46.003114+08
state_change     | 2018-10-02 07:38:46.00439+08
wait_event_type  | IO
wait_event       | BufFileRead
state            | active
backend_xid      | 
backend_xmin     | 598
query            | alter table pgbench_accounts add primary key (aid) using index tablespace "tbs2"
backend_type     | parallel worker

Temporary space usage, about 19 TB, after the worker's work, began to merge the index.

digoal@pg11-320tb-zfs-> cd $PGDATA/base
digoal@pg11-320tb-zfs-> du -sh *
16M     1
16M     13219
16M     13220
19T     pgsql_tmp

After each parallel worker process completes its task, it begins to merge indexes and merge speed.

Total DISK READ :     116.21 M/s | Total DISK WRITE :     169.91 M/s
Actual DISK READ:     116.21 M/s | Actual DISK WRITE:     197.28 M/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                                                                                                                       
65299 be/4 digoal    116.21 M/s  169.91 M/s  0.00 %  8.13 % postgres: postgres postgres [local] ALTER TABLE
65298 be/4 digoal      0.00 B/s    0.00 B/s  0.00 %  0.00 % pgbench -i -s 10000000 -I p -n --index-tablespace=tbs2
51030 be/4 digoal      0.00 B/s    0.00 B/s  0.00 %  0.00 % postgres                 
51032 be/4 digoal      0.00 B/s    0.00 B/s  0.00 %  0.00 % postgres: logger
51034 be/4 digoal      0.00 B/s    0.00 B/s  0.00 %  0.00 % postgres: checkpointer
51035 be/4 digoal      0.00 B/s    0.00 B/s  0.00 %  0.00 % postgres: background writer
51036 be/4 digoal      0.00 B/s    0.00 B/s  0.00 %  0.00 % postgres: walwriter
51037 be/4 digoal      0.00 B/s    0.00 B/s  0.00 %  0.00 % postgres: autovacuum launcher
51038 be/4 digoal      0.00 B/s    0.00 B/s  0.00 %  0.00 % postgres: stats collector
51039 be/4 digoal      0.00 B/s    0.00 B/s  0.00 %  0.00 % postgres: logical replication launcher

Index Occupancy 20 TB

postgres=# \di+
                                      List of relations
 Schema |         Name          | Type  |  Owner   |      Table       |  Size   | Description 
--------+-----------------------+-------+----------+------------------+---------+-------------
 public | pgbench_accounts_pkey | index | postgres | pgbench_accounts | 20 TB   | 
 public | pgbench_branches_pkey | index | postgres | pgbench_branches | 213 MB  | 
 public | pgbench_tellers_pkey  | index | postgres | pgbench_tellers  | 2125 MB | 
(3 rows)

Index creation takes time

7130 minutes

start
2018-10-02 07:51:00

End
2018-10-07 06:41:08

Press script

read-only

vi ro.sql  
  

\set aid random_gaussian(1, :range, 10.0)  
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;  
pgbench -M prepared -n -r -P 1 -f ./ro.sql -c 64 -j 64 -T 360 -D range=100000000
pgbench -M prepared -n -r -P 1 -f ./ro.sql -c 64 -j 64 -T 360 -D range=1000000000
pgbench -M prepared -n -r -P 1 -f ./ro.sql -c 64 -j 64 -T 360 -D range=10000000000
pgbench -M prepared -n -r -P 1 -f ./ro.sql -c 64 -j 64 -T 360 -D range=100000000000
pgbench -M prepared -n -r -P 1 -f ./ro.sql -c 64 -j 64 -T 360 -D range=1000000000000

Read and write

vi rw.sql  
  
\set aid random_gaussian(1, :range, 10.0)  
\set bid random(1, 1 * :scale)  
\set tid random(1, 10 * :scale)  
\set delta random(-5000, 5000)  
BEGIN;  
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;  
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;  
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;  
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;  
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);  
END;  
pgbench -M prepared -n -r -P 1 -f ./rw.sql -c 64 -j 64 -T 360 -D scale=10000000 -D range=100000000
pgbench -M prepared -n -r -P 1 -f ./rw.sql -c 64 -j 64 -T 360 -D scale=10000000 -D range=1000000000
pgbench -M prepared -n -r -P 1 -f ./rw.sql -c 64 -j 64 -T 360 -D scale=10000000 -D range=10000000000
pgbench -M prepared -n -r -P 1 -f ./rw.sql -c 64 -j 64 -T 360 -D scale=10000000 -D range=100000000000
pgbench -M prepared -n -r -P 1 -f ./rw.sql -c 64 -j 64 -T 360 -D scale=10000000 -D range=1000000000000

1,1 trillion tpcb read-only test - 1 trillion data active

TPS:

QPS:

transaction type: ./ro.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 7737610
latency average = 2.977 ms
latency stddev = 35.840 ms
tps = 21492.371917 (including connections establishing)
tps = 21495.359217 (excluding connections establishing)
statement latencies in milliseconds:
         0.002  \set aid random_gaussian(1, :range, 10.0)
         2.975  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;

1,1 trillion tpcb read and write test - 1 trillion data active

TPS:

QPS:

transaction type: ./rw.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 5812634
latency average = 3.963 ms
latency stddev = 48.480 ms
tps = 16143.312370 (including connections establishing)
tps = 16145.557184 (excluding connections establishing)
statement latencies in milliseconds:
         0.003  \set aid random_gaussian(1, :range, 10.0)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.025  BEGIN;
         3.511  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.063  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         0.155  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         0.119  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.047  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         0.041  END;

2,1 trillion tpcb read-only test - 100 billion data active

TPS:

QPS:

transaction type: ./ro.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 8317050
latency average = 2.770 ms
latency stddev = 34.609 ms
tps = 23101.921465 (including connections establishing)
tps = 23105.640572 (excluding connections establishing)
statement latencies in milliseconds:
         0.002  \set aid random_gaussian(1, :range, 10.0)
         2.766  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;

2,1 trillion tpcb read and write test - 100 billion data active

TPS:

QPS:

transaction type: ./rw.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 6526147
latency average = 3.529 ms
latency stddev = 50.638 ms
tps = 18126.367839 (including connections establishing)
tps = 18134.592726 (excluding connections establishing)
statement latencies in milliseconds:
         0.002  \set aid random_gaussian(1, :range, 10.0)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.025  BEGIN;
         3.102  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.061  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         0.159  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         0.091  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.046  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         0.041  END;

3,1 trillion tpcb read-only test - 10 billion data active

TPS:

QPS:

transaction type: ./ro.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 212661629
latency average = 0.108 ms
latency stddev = 12.493 ms
tps = 590692.703049 (including connections establishing)
tps = 590774.219034 (excluding connections establishing)
statement latencies in milliseconds:
         0.001  \set aid random_gaussian(1, :range, 10.0)
         0.107  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;

3,1 trillion tpcb read and write test - 10 billion data active

TPS:

QPS:

transaction type: ./rw.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 27002477
latency average = 0.853 ms
latency stddev = 39.222 ms
tps = 75002.036277 (including connections establishing)
tps = 75012.139249 (excluding connections establishing)
statement latencies in milliseconds:
         0.003  \set aid random_gaussian(1, :range, 10.0)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.001  \set delta random(-5000, 5000)
         0.044  BEGIN;
         0.211  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.076  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         0.198  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         0.115  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.072  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         0.134  END;

4,1 trillion tpcb read-only test - 1 billion data active

TPS:

QPS:

transaction type: ./ro.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 374399291
latency average = 0.061 ms
latency stddev = 5.647 ms
tps = 1039962.270864 (including connections establishing)
tps = 1040949.958600 (excluding connections establishing)
statement latencies in milliseconds:
         0.002  \set aid random_gaussian(1, :range, 10.0)
         0.061  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;

4,1 trillion tpcb read and write tests - 1 billion data active

TPS:

QPS:

transaction type: ./rw.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 29574604
latency average = 0.779 ms
latency stddev = 16.197 ms
tps = 82148.432097 (including connections establishing)
tps = 82160.286498 (excluding connections establishing)
statement latencies in milliseconds:
         0.003  \set aid random_gaussian(1, :range, 10.0)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.001  \set delta random(-5000, 5000)
         0.043  BEGIN;
         0.144  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.074  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         0.207  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         0.106  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.070  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         0.131  END;

5,1 trillion tpcb read-only test - 100 million data active

TPS: 1068052

QPS: 1068052

transaction type: ./ro.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 384510720
latency average = 0.060 ms
latency stddev = 4.332 ms
tps = 1068052.373377 (including connections establishing)
tps = 1068206.696327 (excluding connections establishing)
statement latencies in milliseconds:
         0.002  \set aid random_gaussian(1, :range, 10.0)
         0.059  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;

5,1 trillion tpcb read and write tests - 100 million data active

TPS:

QPS:

transaction type: ./rw.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 28314309
latency average = 0.814 ms
latency stddev = 16.406 ms
tps = 78647.191352 (including connections establishing)
tps = 78658.751759 (excluding connections establishing)
statement latencies in milliseconds:
         0.003  \set aid random_gaussian(1, :range, 10.0)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.001  \set delta random(-5000, 5000)
         0.043  BEGIN;
         0.184  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.076  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         0.217  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         0.096  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.069  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         0.125  END;

Performance summary

Environment: Aliyun ECS + 320T ESSD

Table SIZE: 120 TB write time 810688 seconds, about 1.233 million lines/s

Index SIZE: 20 TB creation takes 427 800 seconds

Index depth: Level 4 (BLOCK_SIZE=32K, each page can store more item s than 8k, so 1 trillion index levels are lower than 100 billion (8K))

Single Table Data Volume TEST CASE QPS TPS
1 trillion tpcb active data 100 million read-only 1068052 1068052
1 trillion tpcb active data 1 billion read-only 1039962 1039962
1 trillion tpcb active data 10 billion read-only 590692 590692
1 trillion tpcb active data 100 billion read-only 23101 23101
1 trillion tpcb active data 100 billion read-only 21492 21492
1 trillion tpcb active data 100 million read and write 393235 78647
1 trillion tpcb active data 1 billion read and write 410740 82148
1 trillion tpcb active data 10 billion read and write 375010 75002
1 trillion tpcb active data 100 billion read and write 90630 18126
1 trillion tpcb active data 1000 billion read and write 80715 16143

Adding fields (including default values) takes 1.25 seconds.

Deleting fields takes 1 millisecond.

Appendix - pgbench_accounts partition, parallel loading of test data, dynamic query

What is the problem of 1 trillion single sheets?

1. Table 125 TB, creating index takes more time. PG 11 introduces parallel index creation to solve this problem.

2. Table 125 TB, garbage collection time is prolonged. The use of zheap engine for PG 12 was completely eliminated.

3. Table 125 TB, FREEZE is time-consuming and may not even be able to complete in 2 billion transactions. Future version of PG, using more than 32 bits of XID, completely solved.

4. Table 125 TB must be placed in a single directory, which may lead to the upper limit of file system (INODE, capacity, etc.).

5. Table 125 TB is inconvenient for data cleaning. If there is the concept of time dimension aging, partition table can better manage hot and cold data, such as pg_pathman.

pgbench is converted to partitioned tables.

1. pg_pathman is recommended for low performance loss. Built-in partition function, there are still performance problems.

PostgreSQL 9.x, 10, 11 hash partition table usage example

PostgreSQL 10 built-in partition vs pg_pathman perf profiling

Partition Table Lock Granularity Difference - pg_pathman VS native partition table

<PostgreSQL queries involve performance problems caused by too many partitioned tables - performance diagnostics and optimization (lots of BIND, spin lock, SLEEP processes)>

Using internal partitioning, it is recommended to use dynamic SQL to avoid BIND problems.

Partition demo

PostgreSQL pgbench tpcb data generation and SQL partial source code interpretation

"PostgreSQL pgbench tpcb massive database testing-partition table testing optimization"

Loading data

1, table

pgbench -i -I dt --tablespace=tbs1 -s 10000000  

2, zoning

create table p (like pgbench_accounts) partition by RANGE ( aid ) tablespace tbs1;  
  
do language plpgsql $$                                                           
declare  
  i_rows_perpartition int8 := 244140625;  
begin  
  for i in 0..4096 loop  
    execute format ('create table pgbench_accounts%s partition of p for values from (%s) to (%s) tablespace tbs1', i, i*i_rows_perpartition, (i+1)*i_rows_perpartition);  
  end loop;  
end;  
$$;  
  
drop table pgbench_accounts;  
  
alter table p rename to pgbench_accounts;  
  
-- alter table pgbench_accounts add constraint pk_pgbench_accounts_aid primary key (aid) using index tablespace tbs2;  

3. Loading tasks

drop table task;  
create table task(id int primary key);  
insert into task select i from generate_series(0,4095) t(i);  

4. Initialization record

create table init_accounts(aid int8);  
insert into init_accounts select generate_series(0,244140624);  

5. Parallel State UDF

create or replace function tpcb_init_accounts() returns void as $$  
declare  
  v_id int;  
begin  
  with tmp as (select * from task limit 1 for update skip locked),  
    tmp1 as (delete from task using tmp where task.id=tmp.id)  
    select id into v_id from tmp;  
    
  if found then  
    execute format ('insert into pgbench_accounts%s select aid+%s*244140625::int8, ((aid+%s*244140625::int8)-1)/100000 + 1, 0 from init_accounts on conflict do nothing', v_id, v_id, v_id);  
  end if;  
end;  
$$ language plpgsql strict;  

6. Parallel loading of data

vi test.sql  
select tpcb_init_accounts();  
  
nohup pgbench -M prepared -n -r -f ./test.sql -c 64 -j 64 -t 100 >./init.log 2>&1 &  

Initialization index

Task sheet

drop table task;  
create table task(id int primary key);  
insert into task select i from generate_series(0,4095) t(i);  

2. Parallel creation of index UDF

create or replace function tpcb_init_accounts_pkey() returns void as $$  
declare  
  v_id int;  
begin  
  with tmp as (select * from task limit 1 for update skip locked),  
    tmp1 as (delete from task using tmp where task.id=tmp.id)  
    select id into v_id from tmp;  
    
  if found then  
    execute format ('analyze pgbench_accounts%s', v_id);  
    execute format ('alter table pgbench_accounts%s add constraint pk_pgbench_accounts%s_aid primary key (aid) using index tablespace tbs2', v_id, v_id);  
  end if;  
end;  
$$ language plpgsql strict;  

3. Parallel Index Creation

vi test.sql  
select tpcb_init_accounts_pkey();  
  
nohup pgbench -M prepared -n -r -f ./test.sql -c 64 -j 64 -t 100 >./init.log 2>&1 &  

Summary

The block size of 1K and 8K is the largest 32TB in a single table. (Since the block num of ctid is the address of 32BIT, the maximum capacity calculated by 8K block is 32TB. The single table 1 trillion tested in this paper has exceeded 32TB, so we need to choose a larger BLOCK SIZE, 32K can reach 256TB single table.)

Add -- with - block size at compile time=

./configure --with-blocksize=32

2. How can such a large database be backed up efficiently and restored at a point in time?

Full backup: 1. ZFS snapshot, send the snapshot to the backup machine (10,000 Mbp network can run full network card bandwidth). 2. Use pg_basebackup to back up the full amount. 3. Use pg_rman to back up the full amount. 4. Use cloud disk snapshots to back up the full amount.

Incremental backup: 1. ZFS snapshot, send the snapshot increment to the backup machine. 2. pg_basebackup can only backup the full amount. 3. Use pg_rman backup increment (distinguish the data blocks modified since the last backup by BLOCK LSN number). 4. Use cloud disk snapshot backup increments.

Archive Backup: Back up wal files for archiving.

Time Point Recovery: 1. zfs Snapshot Cloning + Filing Recovery to Time Point. 2. Full recovery + archiving recovery to the point of time. 4. Full + Incremental + Filing Recovery to Time Point.

3. This test of tpcb, concurrent 64, the first ten seconds of bind time consumed more.

4. It is recommended to partition large tables using pg_pathman. How large tables need to be partitioned?

"HTAP database PostgreSQL scenarios and performance testing 45 - (OLTP) linear relationship between data volume and performance (1 billion + no attenuation), and how large a single table needs partitioning"

PostgreSQL 9.5 + Efficient Partition Table Implementation - pg_pathman

Reference resources

PostgreSQL 11 tpcc Test (103,000 tpmC on ECS) - use sysbench-tpcc by Percona-Lab

(TPC-H test SF=10,SF=200) PostgreSQL 11 vs 10 vs Deepgreen

PostgreSQL 10 billion tpcb performance on ECS

PostgreSQL on Aliyun ECS + ESSD - 100 billion tpcb, 1000W tpcc test

Keywords: Database PostgreSQL SQL RPM

Added by stanleycwb on Tue, 14 May 2019 14:43:51 +0300