Label
PostgreSQL, pgbench, tpcb, tpcc, tpch, lvm2, zfs, bands
background
Several recent PostgreSQL OLTP and OLAP tests:
PostgreSQL 11 tpcc Test (103,000 tpmC on ECS) - use sysbench-tpcc by Percona-Lab
(TPC-H test SF=10,SF=200) PostgreSQL 11 vs 10 vs Deepgreen
PostgreSQL 10 billion tpcb performance on ECS
PostgreSQL on Aliyun ECS + ESSD - 100 billion tpcb, 1000W tpcc test
Coverage:
1,SF=10, SF=200 TPCH
2,1000W TPCC
310 billion TPCB
400 billion TPCB
5,1 trillion TPCB (about 125 TB single table). What this article will test)
In this paper, 16 ESSD cloud disks are used. Two file systems, ZFS and EXT4, are used in the test.
Environmental Science
1,ecs,CentOS 7.4 x64
2,CPU
lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz Stepping: 4 CPU MHz: 2499.996 BogoMIPS: 4999.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 33792K NUMA node0 CPU(s): 0-63 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1
3, kernel
uname -a Linux pg11-320tb-zfs 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
4, memory
free -g total used free shared buff/cache available Mem: 503 2 500 0 0 498 Swap: 0 0 0
5, clock
echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
6. Block equipment
lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT vda 253:0 0 200G 0 disk └─vda1 253:1 0 200G 0 part / vdb 253:16 0 20T 0 disk vdc 253:32 0 20T 0 disk vdd 253:48 0 20T 0 disk vde 253:64 0 20T 0 disk vdf 253:80 0 20T 0 disk vdg 253:96 0 20T 0 disk vdh 253:112 0 20T 0 disk vdi 253:128 0 20T 0 disk vdj 253:144 0 20T 0 disk vdk 253:160 0 20T 0 disk vdl 253:176 0 20T 0 disk vdm 253:192 0 20T 0 disk vdn 253:208 0 20T 0 disk vdo 253:224 0 20T 0 disk vdp 253:240 0 20T 0 disk vdq 253:256 0 20T 0 disk
Configuring OS parameters of ECS virtual machine
1. Kernel parameters
vi /etc/sysctl.conf # add by digoal.zhou fs.aio-max-nr = 1048576 fs.file-max = 76724600 # Optional: kernel.core_pattern=/data01/corefiles/core_%e_%u_%t_%s.%p # / Data 01/corefiles is built beforehand, with 777 permissions. If it is a soft link, the corresponding directory is changed to 777. kernel.sem = 4096 2147483647 2147483646 512000 # The semaphores, ipcs-l or-u view, need 17 semaphores for each group of 16 processes. kernel.shmall = 107374182 # All shared memory segments are limited in size (80% of recommended memory), in pages. kernel.shmmax = 274877906944 # Maximum single shared memory segment size (recommended half of memory). Version 9.2 has significantly reduced the use of shared memory in bytes. kernel.shmmni = 819200 # How many shared memory segments can be generated altogether, and at least two shared memory segments per PG database cluster net.core.netdev_max_backlog = 10000 net.core.rmem_default = 262144 # The default setting of the socket receive buffer in bytes. net.core.rmem_max = 4194304 # The maximum receive socket buffer size in bytes net.core.wmem_default = 262144 # The default setting (in bytes) of the socket send buffer. net.core.wmem_max = 4194304 # The maximum send socket buffer size in bytes. net.core.somaxconn = 4096 net.ipv4.tcp_max_syn_backlog = 4096 net.ipv4.tcp_keepalive_intvl = 20 net.ipv4.tcp_keepalive_probes = 3 net.ipv4.tcp_keepalive_time = 60 net.ipv4.tcp_mem = 8388608 12582912 16777216 net.ipv4.tcp_fin_timeout = 5 net.ipv4.tcp_synack_retries = 2 net.ipv4.tcp_syncookies = 1 # Open SYN Cookies. When SYN waiting queue overflow occurs, cookie s are enabled to prevent a small number of SYN attacks. net.ipv4.tcp_timestamps = 1 # Reduce time_wait net.ipv4.tcp_tw_recycle = 0 # If = 1, the quick recovery of TIME-WAIT sockets in TCP connections is turned on, but the NAT environment may cause the connection to fail. It is recommended that the server close it. net.ipv4.tcp_tw_reuse = 1 # Open reuse. Allow TIME-WAIT sockets to be reused for new TCP connections net.ipv4.tcp_max_tw_buckets = 262144 net.ipv4.tcp_rmem = 8192 87380 16777216 net.ipv4.tcp_wmem = 8192 65536 16777216 net.nf_conntrack_max = 1200000 net.netfilter.nf_conntrack_max = 1200000 vm.dirty_background_bytes = 409600000 # When the dirty page reaches this value, the dirty page before (dirty_expire_centisecs/100) is automatically brushed to disk by the pdflush (or other) scheduling process in the background of the system. # The default is 10%. Large memory machines recommend adjusting to specify directly how many bytes vm.dirty_expire_centisecs = 3000 # Dirty pages older than this will be brushed to disk. 3000 means 30 seconds. vm.dirty_ratio = 95 # If the system process brushes dirty pages too slowly, making the system dirty pages more than 95% of memory, then if the user process writes to disk (such as fsync, fdatasync, etc.), it needs to actively brush out the system dirty pages. # Effectively prevent user processes from brushing dirty pages. It is very effective in the case of single machine with multiple instances and using CGROUP to restrict single instance IOPS. vm.dirty_writeback_centisecs = 100 # The wake-up interval for a pdflush (or other) backstage dirty page brushing process is 100 for one second. vm.swappiness = 0 # No swap partition vm.mmap_min_addr = 65536 vm.overcommit_memory = 0 # When allocating memory, a small amount of over malloc is allowed. If set to 1, there is always enough memory, and the test environment with less memory can use 1. vm.overcommit_ratio = 90 # When overcommit_memory = 2, it is used to participate in calculating the allowable allocated memory size. vm.swappiness = 0 # Close swap partition vm.zone_reclaim_mode = 0 # Disable numa, or in vmlinux. net.ipv4.ip_local_port_range = 40000 65535 # Locally Automated TCP, UDP Port Number Range fs.nr_open=20480000 # The upper limit of file handle allowed to open by a single process # Note the following parameters # vm.extra_free_kbytes = 4096000 # vm.min_free_kbytes = 2097152 # Vm.min_free_kbytes recommends allocating 1G vm.min_free_kbytes per 32G of memory # If it is a small memory machine, the above two values are not recommended. # vm.nr_hugepages = 66536 # It is recommended to use large page size / proc / meminfo Huge page size when shared buffer settings exceed 64GB # vm.lowmem_reserve_ratio = 1 1 1 # For memory larger than 64G, it is recommended to set it, otherwise the default value is 256 256 32.
2. Resource constraints
vi /etc/security/limits.conf # If nofile exceeds 1048 576, you must first set fs.nr_open of sysctl to a larger value and then continue to set nofile after it takes effect. * soft nofile 1024000 * hard nofile 1024000 * soft nproc unlimited * hard nproc unlimited * soft core unlimited * hard core unlimited * soft memlock unlimited * hard memlock unlimited
3. Close large transparent pages and use precise clocks (optional)
vi /etc/rc.local touch /var/lock/subsys/local if test -f /sys/kernel/mm/transparent_hugepage/enabled; then echo never > /sys/kernel/mm/transparent_hugepage/enabled fi # Clock echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource su - postgres -c "pg_ctl start"
Deployment of PostgreSQL 11
rpm -ivh https://download.postgresql.org/pub/repos/yum/11/redhat/rhel-7-x86_64/pgdg-centos11-11-2.noarch.rpm rpm -ivh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm yum -y install coreutils glib2 lrzsz dstat sysstat e4fsprogs xfsprogs ntp readline-devel zlib-devel openssl-devel pam-devel libxml2-devel libxslt-devel python-devel tcl-devel gcc gcc-c++ make smartmontools flex bison perl-devel perl-ExtUtils* openldap-devel jadetex openjade bzip2 git iotop lvm2 perf yum install -y postgresql11*
Block device deployment strategy 1 - zfs
zfsonlinux
PostgreSQL on ECS Efficient Continuous Backup Design - By ZFS on Linux
PostgreSQL OLTP on ZFS Performance Optimization
1. zfs yum configuration
yum install -y http://download.zfsonlinux.org/epel/zfs-release.el7_4.noarch.rpm
2. kernel-devel corresponding to the current kernel
uname -a Linux pg11-320tb-zfs 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
rpm -ivh http://vault.centos.org/7.4.1708/updates/x86_64/Packages/kernel-devel-3.10.0-693.2.2.el7.x86_64.rpm
3. Installation of zfs
yum install -y zfs
Check logs for errors. Normally there are no errors
Testing for Availability
modprobe zfs zfs get -o all
System startup will automatically load zfs
Create zpool
essd has three copies at the bottom and no need to use zfs RAID function.
parted -a optimal -s /dev/vdb mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdc mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdd mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vde mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdf mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdg mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdh mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdi mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdj mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdk mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdl mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdm mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdn mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdo mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdp mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdq mklabel gpt mkpart primary 1MiB 100%FREE
zpool create zp1 -f -o ashift=13 vdb1 vdc1 vdd1 vde1 vdf1 vdg1 vdh1 vdi1 vdj1 vdk1 vdl1 vdm1 vdn1 vdo1 vdp1 vdq1
zpool get all zp1 NAME PROPERTY VALUE SOURCE zp1 size 318T - zp1 capacity 0% - zp1 altroot - default zp1 health ONLINE - zp1 guid 12407519490197584982 - zp1 version - default zp1 bootfs - default zp1 delegation on default zp1 autoreplace off default zp1 cachefile - default zp1 failmode wait default zp1 listsnapshots off default zp1 autoexpand off default zp1 dedupditto 0 default zp1 dedupratio 1.00x - zp1 free 318T - zp1 allocated 960K - zp1 readonly off - zp1 ashift 13 local zp1 comment - default zp1 expandsize - - zp1 freeing 0 - zp1 fragmentation 0% - zp1 leaked 0 - zp1 multihost off default zp1 feature@async_destroy enabled local zp1 feature@empty_bpobj enabled local zp1 feature@lz4_compress active local zp1 feature@multi_vdev_crash_dump enabled local zp1 feature@spacemap_histogram active local zp1 feature@enabled_txg active local zp1 feature@hole_birth active local zp1 feature@extensible_dataset active local zp1 feature@embedded_data active local zp1 feature@bookmarks enabled local zp1 feature@filesystem_limits enabled local zp1 feature@large_blocks enabled local zp1 feature@large_dnode enabled local zp1 feature@sha512 enabled local zp1 feature@skein enabled local zp1 feature@edonr enabled local zp1 feature@userobj_accounting active local
Create zfs
zfs create -o mountpoint=/data01 -o recordsize=8K -o atime=off -o primarycache=metadata -o logbias=throughput -o secondarycache=none zp1/data01 zfs set canmount=off zp1
Optimizing Two Parameters
cd /sys/module/zfs/parameters/ echo 1 > zfs_prefetch_disable echo 15 > zfs_arc_shrink_shift
View the current parameters
zfs get all zp1/data01 NAME PROPERTY VALUE SOURCE zp1/data01 type filesystem - zp1/data01 creation Wed Sep 19 10:26 2018 - zp1/data01 used 192K - zp1/data01 available 308T - zp1/data01 referenced 192K - zp1/data01 compressratio 1.00x - zp1/data01 mounted yes - zp1/data01 quota none default zp1/data01 reservation none default zp1/data01 recordsize 8K local zp1/data01 mountpoint /data01 local zp1/data01 sharenfs off default zp1/data01 checksum on default zp1/data01 compression off default zp1/data01 atime off local zp1/data01 devices on default zp1/data01 exec on default zp1/data01 setuid on default zp1/data01 readonly off default zp1/data01 zoned off default zp1/data01 snapdir hidden default zp1/data01 aclinherit restricted default zp1/data01 createtxg 81 - zp1/data01 canmount on default zp1/data01 xattr on default zp1/data01 copies 1 default zp1/data01 version 5 - zp1/data01 utf8only off - zp1/data01 normalization none - zp1/data01 casesensitivity sensitive - zp1/data01 vscan off default zp1/data01 nbmand off default zp1/data01 sharesmb off default zp1/data01 refquota none default zp1/data01 refreservation none default zp1/data01 guid 3373300831209850945 - zp1/data01 primarycache metadata local zp1/data01 secondarycache none default zp1/data01 usedbysnapshots 0B - zp1/data01 usedbydataset 192K - zp1/data01 usedbychildren 0B - zp1/data01 usedbyrefreservation 0B - zp1/data01 logbias throughput local zp1/data01 dedup off default zp1/data01 mlslabel none default zp1/data01 sync standard default zp1/data01 dnodesize legacy default zp1/data01 refcompressratio 1.00x - zp1/data01 written 192K - zp1/data01 logicalused 76K - zp1/data01 logicalreferenced 76K - zp1/data01 volmode default default zp1/data01 filesystem_limit none default zp1/data01 snapshot_limit none default zp1/data01 filesystem_count none default zp1/data01 snapshot_count none default zp1/data01 snapdev hidden default zp1/data01 acltype off default zp1/data01 context none default zp1/data01 fscontext none default zp1/data01 defcontext none default zp1/data01 rootcontext none default zp1/data01 relatime off default zp1/data01 redundant_metadata all default zp1/data01 overlay off default
Initialize the database
1, directory
mkdir /data01/pg11 chown postgres:postgres /data01/pg11
2. Environmental variables
su - postgres vi .bash_profile export PS1="$USER@`/bin/hostname -s`-> " export PGPORT=1921 export PGDATA=/data01/pg11/pg_root$PGPORT export LANG=en_US.utf8 export PGHOME=/usr/pgsql-11 export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH export DATE=`date +"%Y%m%d%H%M"` export PATH=$PGHOME/bin:$PATH:. export MANPATH=$PGHOME/share/man:$MANPATH export PGHOST=$PGDATA export PGUSER=postgres export PGDATABASE=postgres alias rm='rm -i' alias ll='ls -lh' unalias vi
3. Initialization
initdb -D $PGDATA -U postgres -E SQL_ASCII --lc-collate=C --lc-ctype=en_US.utf8 --wal-segsize=1024
Large page configuration
zfs can bypass file system cache, so it is recommended to configure a larger postgresql shared buffer and use large page
PostgreSQL Huge Page Use Suggestions - Large Memory Host, Example Note
300GB/2MB=153600
sysctl -w vm.nr_hugepages=159600 echo "vm.nr_hugepages=159600" >> /etc/sysctl.conf
postgresql.auto.conf
listen_addresses = '0.0.0.0' port = 1921 max_connections = 2000 superuser_reserved_connections = 3 unix_socket_directories = '., /var/run/postgresql, /tmp' tcp_keepalives_idle = 60 tcp_keepalives_interval = 10 tcp_keepalives_count = 10 huge_pages = on # Using large page shared_buffers = 300GB max_prepared_transactions = 2000 work_mem = 32MB maintenance_work_mem = 2GB dynamic_shared_memory_type = posix vacuum_cost_delay = 0 bgwriter_delay = 10ms bgwriter_lru_maxpages = 1000 bgwriter_lru_multiplier = 10.0 effective_io_concurrency = 0 max_worker_processes = 128 max_parallel_workers = 64 max_parallel_maintenance_workers = 64 max_parallel_workers_per_gather = 0 parallel_leader_participation = on min_parallel_table_scan_size=0 min_parallel_index_scan_size=0 parallel_setup_cost=0 parallel_tuple_cost=0 wal_level = minimal synchronous_commit = off full_page_writes=off # zfs has checksum, cow. Close fpw. If BLOCKDEV can guarantee 8K atom writing, it can also be closed. wal_writer_delay = 10ms checkpoint_timeout = 30min max_wal_size = 600GB min_wal_size = 150GB checkpoint_completion_target = 0.1 max_wal_senders = 0 effective_cache_size = 200GB log_destination = 'csvlog' logging_collector = on log_directory = 'log' log_filename = 'postgresql-%a.log' log_truncate_on_rotation = on log_rotation_age = 1d log_rotation_size = 0 log_checkpoints = on log_connections = on log_disconnections = on log_error_verbosity = verbose log_line_prefix = '%m [%p] ' log_timezone = 'PRC' log_autovacuum_min_duration = 0 autovacuum_max_workers = 16 autovacuum_freeze_max_age = 1200000000 autovacuum_multixact_freeze_max_age = 1400000000 autovacuum_vacuum_cost_delay = 0ms vacuum_freeze_table_age = 1150000000 vacuum_multixact_freeze_table_age = 1150000000 datestyle = 'iso, mdy' timezone = 'PRC' lc_messages = 'C' lc_monetary = 'C' lc_numeric = 'C' lc_time = 'C' default_text_search_config = 'pg_catalog.english' jit = off cpu_tuple_cost=0.00018884145574257426 cpu_index_tuple_cost = 0.00433497085216479990 cpu_operator_cost = 0.00216748542608239995 seq_page_cost=0.014329 random_page_cost = 0.016
restart database
pg_ctl restart -m fast
zfs on linux performance issues
In this case, (vacuum i s very slow, you can see that the problem pgbench-i-s 1000 can be reproduced)
vacuum process D status, stack as follows
[<ffffffffc0174132>] cv_wait_common+0xb2/0x150 [spl] [<ffffffffc0174208>] __cv_wait_io+0x18/0x20 [spl] [<ffffffffc073c42b>] zio_wait+0x10b/0x1b0 [zfs] [<ffffffffc0687124>] dmu_buf_hold_array_by_dnode+0x154/0x4a0 [zfs] [<ffffffffc06885f2>] dmu_read_uio_dnode+0x52/0x100 [zfs] [<ffffffffc06886ec>] dmu_read_uio_dbuf+0x4c/0x70 [zfs] [<ffffffffc07242d5>] zfs_read+0x135/0x3f0 [zfs] [<ffffffffc0743990>] zpl_read_common_iovec.constprop.9+0x80/0xd0 [zfs] [<ffffffffc0743aa6>] zpl_aio_read+0xc6/0xf0 [zfs] [<ffffffff812001ad>] do_sync_read+0x8d/0xd0 [<ffffffff81200bac>] vfs_read+0x9c/0x170 [<ffffffff81201a6f>] SyS_read+0x7f/0xe0 [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff
Block device deployment strategy 1 - lvm2, ext4
1. Shut down and eliminate zfs
pg_ctl stop -m immediate zfs destroy zp1/data01 zpool destroy zp1
2. Cleaning block header information
wipefs -f -a /dev/vd[b-q] dd bs=1024 count=1000 if=/dev/zero of=/dev/vdb dd bs=1024 count=1000 if=/dev/zero of=/dev/vdc dd bs=1024 count=1000 if=/dev/zero of=/dev/vdd dd bs=1024 count=1000 if=/dev/zero of=/dev/vde dd bs=1024 count=1000 if=/dev/zero of=/dev/vdf dd bs=1024 count=1000 if=/dev/zero of=/dev/vdg dd bs=1024 count=1000 if=/dev/zero of=/dev/vdh dd bs=1024 count=1000 if=/dev/zero of=/dev/vdi dd bs=1024 count=1000 if=/dev/zero of=/dev/vdj dd bs=1024 count=1000 if=/dev/zero of=/dev/vdk dd bs=1024 count=1000 if=/dev/zero of=/dev/vdl dd bs=1024 count=1000 if=/dev/zero of=/dev/vdm dd bs=1024 count=1000 if=/dev/zero of=/dev/vdn dd bs=1024 count=1000 if=/dev/zero of=/dev/vdo dd bs=1024 count=1000 if=/dev/zero of=/dev/vdp dd bs=1024 count=1000 if=/dev/zero of=/dev/vdq parted -a optimal -s /dev/vdb mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdc mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdd mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vde mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdf mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdg mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdh mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdi mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdj mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdk mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdl mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdm mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdn mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdo mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdp mklabel gpt mkpart primary 1MiB 100%FREE parted -a optimal -s /dev/vdq mklabel gpt mkpart primary 1MiB 100%FREE wipefs -f -a /dev/vd[b-q]1
3, create PV
pvcreate /dev/vd[b-q]1
4, create vg
vgcreate -A y -s 128M vgdata01 /dev/vd[b-q]1
5. Create logical volumes and configure stripes
lvcreate -A y -i 16 -I 8 -L 4TiB -n lv03 vgdata01 lvcreate -A y -i 16 -I 8 -L 220TiB -n lv01 vgdata01 lvcreate -A y -i 16 -I 8 -l 100%FREE -n lv02 vgdata01
6, view
[root@pg11-320tb-zfs ~]# pvs PV VG Fmt Attr PSize PFree /dev/vdb vgdata01 lvm2 a-- <20.00t 0 /dev/vdc vgdata01 lvm2 a-- <20.00t 0 /dev/vdd vgdata01 lvm2 a-- <20.00t 0 /dev/vde vgdata01 lvm2 a-- <20.00t 0 /dev/vdf vgdata01 lvm2 a-- <20.00t 0 /dev/vdg vgdata01 lvm2 a-- <20.00t 0 /dev/vdh vgdata01 lvm2 a-- <20.00t 0 /dev/vdi vgdata01 lvm2 a-- <20.00t 0 /dev/vdj vgdata01 lvm2 a-- <20.00t 0 /dev/vdk vgdata01 lvm2 a-- <20.00t 0 /dev/vdl vgdata01 lvm2 a-- <20.00t 0 /dev/vdm vgdata01 lvm2 a-- <20.00t 0 /dev/vdn vgdata01 lvm2 a-- <20.00t 0 /dev/vdo vgdata01 lvm2 a-- <20.00t 0 /dev/vdp vgdata01 lvm2 a-- <20.00t 0 /dev/vdq vgdata01 lvm2 a-- <20.00t 0 [root@pg11-320tb-zfs ~]# vgs VG #PV #LV #SN Attr VSize VFree vgdata01 16 3 0 wz--n- <320.00t 0 [root@pg11-320tb-zfs ~]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert lv01 vgdata01 -wi-a----- 220.00t lv02 vgdata01 -wi-a----- <96.00t lv03 vgdata01 -wi-a----- 4.00t
7. Create ext4 file system and configure stripes
mkfs.ext4 /dev/mapper/vgdata01-lv01 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stride=2,stripe_width=32 -b 4096 -T largefile -L lv01 mkfs.ext4 /dev/mapper/vgdata01-lv02 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stride=2,stripe_width=32 -b 4096 -T largefile -L lv02 mkfs.ext4 /dev/mapper/vgdata01-lv03 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stride=2,stripe_width=32 -b 4096 -T largefile -L lv03
8. Configuration mounting
vi /etc/fstab LABEL=lv01 /data01 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0 LABEL=lv02 /data02 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0 LABEL=lv03 /data03 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0 mkdir /data01 mkdir /data02 mkdir /data03 mount -a
[root@pg11-320tb-zfs ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/vda1 197G 2.1G 185G 2% / devtmpfs 252G 0 252G 0% /dev tmpfs 252G 0 252G 0% /dev/shm tmpfs 252G 596K 252G 1% /run tmpfs 252G 0 252G 0% /sys/fs/cgroup tmpfs 51G 0 51G 0% /run/user/0 /dev/mapper/vgdata01-lv01 220T 20K 220T 1% /data01 /dev/mapper/vgdata01-lv02 96T 20K 96T 1% /data02 /dev/mapper/vgdata01-lv03 4.0T 89M 4.0T 1% /data03
9. Create database data, table space, WAL log directory
[root@pg11-320tb-zfs ~]# mkdir /data01/pg11 [root@pg11-320tb-zfs ~]# mkdir /data02/pg11 [root@pg11-320tb-zfs ~]# mkdir /data03/pg11 [root@pg11-320tb-zfs ~]# chown postgres:postgres /data0*/pg11
10. Configuring environment variables
su - postgres export PS1="$USER@`/bin/hostname -s`-> " export PGPORT=1921 export PGDATA=/data01/pg11/pg_root$PGPORT export LANG=en_US.utf8 export PGHOME=/usr/pgsql-11 export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH export DATE=`date +"%Y%m%d%H%M"` export PATH=$PGHOME/bin:$PATH:. export MANPATH=$PGHOME/share/man:$MANPATH export PGHOST=$PGDATA export PGUSER=postgres export PGDATABASE=postgres alias rm='rm -i' alias ll='ls -lh' unalias vi
11. Initialization of database
initdb -D $PGDATA -X /data03/pg11/pg_wal1921 -U postgres -E SQL_ASCII --lc-collate=C --lc-ctype=en_US.utf8 --wal-segsize=1024
12. Creating table spaces
mkdir /data01/pg11/tbs1 mkdir /data02/pg11/tbs2 create tablespace tbs1 location '/data01/pg11/tbs1'; create tablespace tbs2 location '/data02/pg11/tbs2';
1 trillion tpcb test
Initialization data
nohup pgbench -i -s 10000000 -I dtg -n --tablespace=tbs1 >./init.log 2>&1 &
It takes 810688 seconds, about 1.233 million lines/s.
999999100000 of 1000000000000 tuples (99%) done (elapsed 810688.06 s, remaining 0.73 s) 999999200000 of 1000000000000 tuples (99%) done (elapsed 810688.14 s, remaining 0.65 s) 999999300000 of 1000000000000 tuples (99%) done (elapsed 810688.22 s, remaining 0.57 s) 999999400000 of 1000000000000 tuples (99%) done (elapsed 810688.30 s, remaining 0.49 s) 999999500000 of 1000000000000 tuples (99%) done (elapsed 810688.38 s, remaining 0.41 s) 999999600000 of 1000000000000 tuples (99%) done (elapsed 810688.46 s, remaining 0.32 s) 999999700000 of 1000000000000 tuples (99%) done (elapsed 810688.54 s, remaining 0.24 s) 999999800000 of 1000000000000 tuples (99%) done (elapsed 810688.61 s, remaining 0.16 s) 999999900000 of 1000000000000 tuples (99%) done (elapsed 810688.69 s, remaining 0.08 s) 1000000000000 of 1000000000000 tuples (100%) done (elapsed 810688.77 s, remaining 0.00 s) done.
Data Occupancy 120TB
postgres=# \l+ List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description -----------+----------+-----------+---------+------------+-----------------------+--------+------------+-------------------------------------------- postgres | postgres | SQL_ASCII | C | en_US.utf8 | | 120 TB | pg_default | default administrative connection database template0 | postgres | SQL_ASCII | C | en_US.utf8 | =c/postgres +| 15 MB | pg_default | unmodifiable empty database | | | | | postgres=CTc/postgres | | | template1 | postgres | SQL_ASCII | C | en_US.utf8 | =c/postgres +| 15 MB | pg_default | default template for new databases | | | | | postgres=CTc/postgres | | | (3 rows) postgres=# \dt+ List of relations Schema | Name | Type | Owner | Size | Description --------+------------------+-------+----------+---------+------------- public | pgbench_accounts | table | postgres | 120 TB | public | pgbench_branches | table | postgres | 344 MB | public | pgbench_history | table | postgres | 0 bytes | public | pgbench_tellers | table | postgres | 4201 MB | public | t | table | postgres | 804 MB | (5 rows) postgres=# \db+ List of tablespaces Name | Owner | Location | Access privileges | Options | Size | Description ------------+----------+---------------------+-------------------+---------+---------+------------- pg_default | postgres | | | | 850 MB | pg_global | postgres | | | | 2206 kB | tbs1 | postgres | /data01/digoal/tbs1 | | | 120 TB | tbs2 | postgres | /data02/digoal/tbs2 | | | 213 MB | (4 rows)
Add and delete field tests, second level
Adding and deleting fields only need to change metadata and complete in seconds.
PostgreSQL adds fields with default values, which are also done in seconds.
postgres=# \timing Timing is on. postgres=# select * from pgbench_accounts limit 10; aid | bid | abalance | filler -----+-----+----------+-------------------------------------------------------------------------------------- 1 | 1 | 0 | 2 | 1 | 0 | 3 | 1 | 0 | 4 | 1 | 0 | 5 | 1 | 0 | 6 | 1 | 0 | 7 | 1 | 0 | 8 | 1 | 0 | 9 | 1 | 0 | 10 | 1 | 0 | (10 rows) Time: 498.051 ms -- Adding fields in seconds postgres=# alter table pgbench_accounts add column col1 text; ALTER TABLE Time: 1254.611 ms (00:01.255) -- Adding non-null default value fields at second level postgres=# alter table pgbench_accounts add column col2 text default 'hello digoal'; ALTER TABLE Time: 1253.689 ms (00:01.254) postgres=# select * from pgbench_accounts limit 10; aid | bid | abalance | filler | col1 | col2 -----+-----+----------+--------------------------------------------------------------------------------------+------+-------------- 1 | 1 | 0 | | | hello digoal 2 | 1 | 0 | | | hello digoal 3 | 1 | 0 | | | hello digoal 4 | 1 | 0 | | | hello digoal 5 | 1 | 0 | | | hello digoal 6 | 1 | 0 | | | hello digoal 7 | 1 | 0 | | | hello digoal 8 | 1 | 0 | | | hello digoal 9 | 1 | 0 | | | hello digoal 10 | 1 | 0 | | | hello digoal (10 rows) Time: 502.608 ms postgres=# explain analyze select * from pgbench_accounts limit 10; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..0.00 rows=10 width=168) (actual time=0.022..0.026 rows=10 loops=1) -> Seq Scan on pgbench_accounts (cost=0.00..184941625.46 rows=673387096855 width=168) (actual time=0.020..0.022 rows=10 loops=1) Planning Time: 0.057 ms Execution Time: 0.041 ms (4 rows) Time: 0.453 ms -- Analysis table postgres=# analyze pgbench_accounts ; ANALYZE Time: 67373.884 ms (01:07.374) //Delete fields in milliseconds postgres=# alter table pgbench_accounts drop column col1; ALTER TABLE Time: 7.610 ms postgres=# alter table pgbench_accounts drop column col2; ALTER TABLE Time: 0.546 ms
Create index
After loading initialization data, create an index
1. Modifying Parallelism
psql analyze; alter table pgbench_accounts set (parallel_workers=64); alter table pgbench_tellers set (parallel_workers=64); alter table pgbench_branches set (parallel_workers=64);
2. Creating Index
nohup pgbench -i -s 10000000 -I p -n --index-tablespace=tbs2 >./init_pkey.log 2>&1 &
64 parallelism was turned on, and resource consumption at the beginning
avg-cpu: %user %nice %system %iowait %steal %idle 3.20 0.00 38.52 16.31 0.00 41.97 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 vdb 2783.00 6362.00 3972.00 169.00 49328.00 52360.00 49.11 5.35 1.29 0.93 9.91 0.24 98.90 vdc 2760.00 6361.00 4020.00 167.00 49296.00 51764.00 48.27 5.19 1.23 0.88 9.59 0.24 98.70 vdd 2747.00 6361.00 4010.00 169.00 49332.00 51860.00 48.43 5.18 1.24 0.88 9.64 0.24 98.80 vde 2757.00 6362.00 4032.00 169.00 49344.00 51864.00 48.18 5.13 1.22 0.87 9.54 0.23 98.50 vdf 2732.00 6360.00 4012.00 191.00 49336.00 52532.00 48.47 5.21 1.24 0.85 9.50 0.24 99.10 vdg 2716.00 6361.00 4039.00 191.00 49320.00 52036.00 47.92 5.28 1.25 0.86 9.35 0.23 99.20 vdh 2742.00 6361.00 4038.00 191.00 49340.00 52032.00 47.94 5.46 1.29 0.91 9.26 0.23 98.80 vdi 2749.00 6361.00 4041.00 193.00 49328.00 52544.00 48.12 5.35 1.26 0.88 9.40 0.23 99.20 vdj 2776.00 6385.00 3953.00 169.00 49344.00 52020.00 49.18 5.41 1.31 0.93 10.12 0.24 99.10 vdk 2767.00 6384.00 3999.00 171.00 49328.00 52028.00 48.61 5.52 1.32 0.96 9.76 0.24 99.10 vdl 2775.00 6386.00 3984.00 172.00 49328.00 52032.00 48.78 5.56 1.34 0.97 9.94 0.24 99.10 vdm 2759.00 6385.00 4039.00 172.00 49296.00 52416.00 48.31 5.58 1.32 0.95 9.87 0.23 98.90 vdn 2776.00 6369.00 3967.00 163.00 49352.00 51840.00 49.00 5.48 1.32 0.95 10.33 0.24 99.70 vdo 2776.00 6370.00 3978.00 163.00 49352.00 52220.00 49.06 5.42 1.31 0.93 10.56 0.24 99.30 vdp 2782.00 6370.00 4004.00 162.00 49356.00 51840.00 48.58 5.60 1.34 0.97 10.44 0.24 99.70 vdq 2759.00 6370.00 4033.00 161.00 49352.00 51828.00 48.25 5.61 1.34 0.97 10.48 0.24 99.50 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 108392.00 104717.00 790740.00 837136.00 15.28 1293.57 5.82 0.94 10.87 0.00 100.90 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 top - 07:41:20 up 12 days, 22:02, 2 users, load average: 61.15, 26.97, 10.80 Tasks: 607 total, 39 running, 568 sleeping, 0 stopped, 0 zombie %Cpu(s): 3.4 us, 54.3 sy, 0.0 ni, 21.2 id, 21.1 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 52819500+total, 2192660 free, 33286761+used, 19313472+buff/cache KiB Swap: 0 total, 0 free, 0 used. 19267680+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 65466 digoal 20 0 0.296t 25764 2044 R 68.4 0.0 1:11.05 postgres: parallel worker for PID 65299 65450 digoal 20 0 0.296t 25756 2076 D 63.2 0.0 1:10.51 postgres: parallel worker for PID 65299 65460 digoal 20 0 0.296t 25764 2048 R 63.2 0.0 1:10.37 postgres: parallel worker for PID 65299 65469 digoal 20 0 0.296t 25752 2056 R 63.2 0.0 1:10.48 postgres: parallel worker for PID 65299 65474 digoal 20 0 0.296t 25764 2052 R 63.2 0.0 1:10.36 postgres: parallel worker for PID 65299 65478 digoal 20 0 0.296t 25764 2060 R 63.2 0.0 1:10.64 postgres: parallel worker for PID 65299 65479 digoal 20 0 0.296t 25752 2056 R 63.2 0.0 1:10.47 postgres: parallel worker for PID 65299 65484 digoal 20 0 0.296t 25760 2056 R 63.2 0.0 1:10.63 postgres: parallel worker for PID 65299 65485 digoal 20 0 0.296t 25748 2068 R 63.2 0.0 1:11.10 postgres: parallel worker for PID 65299 Total DISK READ : 834.93 M/s | Total DISK WRITE : 1006.90 M/s Actual DISK READ: 835.23 M/s | Actual DISK WRITE: 994.87 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 65461 be/4 digoal 13.90 M/s 12.74 M/s 0.00 % 51.43 % postgres: parallel worker for PID 65299 65457 be/4 digoal 13.86 M/s 12.95 M/s 0.00 % 51.25 % postgres: parallel worker for PID 65299 65505 be/4 digoal 14.82 M/s 13.05 M/s 0.00 % 51.09 % postgres: parallel worker for PID 65299 65466 be/4 digoal 12.57 M/s 13.48 M/s 0.00 % 50.72 % postgres: parallel worker for PID 65299 65462 be/4 digoal 13.16 M/s 13.23 M/s 0.00 % 50.70 % postgres: parallel worker for PID 65299 65491 be/4 digoal 8.85 M/s 12.99 M/s 0.00 % 50.59 % postgres: parallel worker for PID 65299 65451 be/4 digoal 12.33 M/s 13.48 M/s 0.00 % 50.57 % postgres: parallel worker for PID 65299 65477 be/4 digoal 12.37 M/s 13.20 M/s 0.00 % 50.38 % postgres: parallel worker for PID 65299 65459 be/4 digoal 8.45 M/s 19.33 M/s 0.00 % 50.27 % postgres: parallel worker for PID 65299 65488 be/4 digoal 12.34 M/s 12.74 M/s 0.00 % 50.21 % postgres: parallel worker for PID 65299 65495 be/4 digoal 13.83 M/s 13.26 M/s 0.00 % 50.19 % postgres: parallel worker for PID 65299 65450 be/4 digoal 9.20 M/s 19.45 M/s 0.00 % 50.14 % postgres: parallel worker for PID 65299 65503 be/4 digoal 14.02 M/s 19.66 M/s 0.00 % 50.13 % postgres: parallel worker for PID 65299
Waiting for events
postgres=# select wait_event,wait_event_type,count(*) from pg_stat_activity where wait_event is not null group by 1,2; wait_event | wait_event_type | count ---------------------+-----------------+------- BufFileRead | IO | 59 BufFileWrite | IO | 1 CheckpointerMain | Activity | 1 BgWriterHibernate | Activity | 1 AutoVacuumMain | Activity | 1 LogicalLauncherMain | Activity | 1 WalWriterMain | Activity | 1 (7 rows) -[ RECORD 7 ]----+--------------------------------------------------------------------------------- datid | 13220 datname | postgres pid | 65448 usesysid | 10 usename | postgres application_name | pgbench client_addr | client_hostname | client_port | backend_start | 2018-10-02 07:38:46.003833+08 xact_start | 2018-10-02 07:38:46.003114+08 query_start | 2018-10-02 07:38:46.003114+08 state_change | 2018-10-02 07:38:46.00439+08 wait_event_type | IO wait_event | BufFileRead state | active backend_xid | backend_xmin | 598 query | alter table pgbench_accounts add primary key (aid) using index tablespace "tbs2" backend_type | parallel worker
Temporary space usage, about 19 TB, after the worker's work, began to merge the index.
digoal@pg11-320tb-zfs-> cd $PGDATA/base digoal@pg11-320tb-zfs-> du -sh * 16M 1 16M 13219 16M 13220 19T pgsql_tmp
After each parallel worker process completes its task, it begins to merge indexes and merge speed.
Total DISK READ : 116.21 M/s | Total DISK WRITE : 169.91 M/s Actual DISK READ: 116.21 M/s | Actual DISK WRITE: 197.28 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 65299 be/4 digoal 116.21 M/s 169.91 M/s 0.00 % 8.13 % postgres: postgres postgres [local] ALTER TABLE 65298 be/4 digoal 0.00 B/s 0.00 B/s 0.00 % 0.00 % pgbench -i -s 10000000 -I p -n --index-tablespace=tbs2 51030 be/4 digoal 0.00 B/s 0.00 B/s 0.00 % 0.00 % postgres 51032 be/4 digoal 0.00 B/s 0.00 B/s 0.00 % 0.00 % postgres: logger 51034 be/4 digoal 0.00 B/s 0.00 B/s 0.00 % 0.00 % postgres: checkpointer 51035 be/4 digoal 0.00 B/s 0.00 B/s 0.00 % 0.00 % postgres: background writer 51036 be/4 digoal 0.00 B/s 0.00 B/s 0.00 % 0.00 % postgres: walwriter 51037 be/4 digoal 0.00 B/s 0.00 B/s 0.00 % 0.00 % postgres: autovacuum launcher 51038 be/4 digoal 0.00 B/s 0.00 B/s 0.00 % 0.00 % postgres: stats collector 51039 be/4 digoal 0.00 B/s 0.00 B/s 0.00 % 0.00 % postgres: logical replication launcher
Index Occupancy 20 TB
postgres=# \di+ List of relations Schema | Name | Type | Owner | Table | Size | Description --------+-----------------------+-------+----------+------------------+---------+------------- public | pgbench_accounts_pkey | index | postgres | pgbench_accounts | 20 TB | public | pgbench_branches_pkey | index | postgres | pgbench_branches | 213 MB | public | pgbench_tellers_pkey | index | postgres | pgbench_tellers | 2125 MB | (3 rows)
Index creation takes time
7130 minutes
start 2018-10-02 07:51:00 End 2018-10-07 06:41:08
Press script
read-only
vi ro.sql \set aid random_gaussian(1, :range, 10.0) SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
pgbench -M prepared -n -r -P 1 -f ./ro.sql -c 64 -j 64 -T 360 -D range=100000000 pgbench -M prepared -n -r -P 1 -f ./ro.sql -c 64 -j 64 -T 360 -D range=1000000000 pgbench -M prepared -n -r -P 1 -f ./ro.sql -c 64 -j 64 -T 360 -D range=10000000000 pgbench -M prepared -n -r -P 1 -f ./ro.sql -c 64 -j 64 -T 360 -D range=100000000000 pgbench -M prepared -n -r -P 1 -f ./ro.sql -c 64 -j 64 -T 360 -D range=1000000000000
Read and write
vi rw.sql \set aid random_gaussian(1, :range, 10.0) \set bid random(1, 1 * :scale) \set tid random(1, 10 * :scale) \set delta random(-5000, 5000) BEGIN; UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; SELECT abalance FROM pgbench_accounts WHERE aid = :aid; UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); END;
pgbench -M prepared -n -r -P 1 -f ./rw.sql -c 64 -j 64 -T 360 -D scale=10000000 -D range=100000000 pgbench -M prepared -n -r -P 1 -f ./rw.sql -c 64 -j 64 -T 360 -D scale=10000000 -D range=1000000000 pgbench -M prepared -n -r -P 1 -f ./rw.sql -c 64 -j 64 -T 360 -D scale=10000000 -D range=10000000000 pgbench -M prepared -n -r -P 1 -f ./rw.sql -c 64 -j 64 -T 360 -D scale=10000000 -D range=100000000000 pgbench -M prepared -n -r -P 1 -f ./rw.sql -c 64 -j 64 -T 360 -D scale=10000000 -D range=1000000000000
1,1 trillion tpcb read-only test - 1 trillion data active
TPS:
QPS:
transaction type: ./ro.sql scaling factor: 1 query mode: prepared number of clients: 64 number of threads: 64 duration: 360 s number of transactions actually processed: 7737610 latency average = 2.977 ms latency stddev = 35.840 ms tps = 21492.371917 (including connections establishing) tps = 21495.359217 (excluding connections establishing) statement latencies in milliseconds: 0.002 \set aid random_gaussian(1, :range, 10.0) 2.975 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
1,1 trillion tpcb read and write test - 1 trillion data active
TPS:
QPS:
transaction type: ./rw.sql scaling factor: 1 query mode: prepared number of clients: 64 number of threads: 64 duration: 360 s number of transactions actually processed: 5812634 latency average = 3.963 ms latency stddev = 48.480 ms tps = 16143.312370 (including connections establishing) tps = 16145.557184 (excluding connections establishing) statement latencies in milliseconds: 0.003 \set aid random_gaussian(1, :range, 10.0) 0.001 \set bid random(1, 1 * :scale) 0.001 \set tid random(1, 10 * :scale) 0.000 \set delta random(-5000, 5000) 0.025 BEGIN; 3.511 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; 0.063 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; 0.155 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; 0.119 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; 0.047 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); 0.041 END;
2,1 trillion tpcb read-only test - 100 billion data active
TPS:
QPS:
transaction type: ./ro.sql scaling factor: 1 query mode: prepared number of clients: 64 number of threads: 64 duration: 360 s number of transactions actually processed: 8317050 latency average = 2.770 ms latency stddev = 34.609 ms tps = 23101.921465 (including connections establishing) tps = 23105.640572 (excluding connections establishing) statement latencies in milliseconds: 0.002 \set aid random_gaussian(1, :range, 10.0) 2.766 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
2,1 trillion tpcb read and write test - 100 billion data active
TPS:
QPS:
transaction type: ./rw.sql scaling factor: 1 query mode: prepared number of clients: 64 number of threads: 64 duration: 360 s number of transactions actually processed: 6526147 latency average = 3.529 ms latency stddev = 50.638 ms tps = 18126.367839 (including connections establishing) tps = 18134.592726 (excluding connections establishing) statement latencies in milliseconds: 0.002 \set aid random_gaussian(1, :range, 10.0) 0.001 \set bid random(1, 1 * :scale) 0.001 \set tid random(1, 10 * :scale) 0.000 \set delta random(-5000, 5000) 0.025 BEGIN; 3.102 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; 0.061 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; 0.159 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; 0.091 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; 0.046 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); 0.041 END;
3,1 trillion tpcb read-only test - 10 billion data active
TPS:
QPS:
transaction type: ./ro.sql scaling factor: 1 query mode: prepared number of clients: 64 number of threads: 64 duration: 360 s number of transactions actually processed: 212661629 latency average = 0.108 ms latency stddev = 12.493 ms tps = 590692.703049 (including connections establishing) tps = 590774.219034 (excluding connections establishing) statement latencies in milliseconds: 0.001 \set aid random_gaussian(1, :range, 10.0) 0.107 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
3,1 trillion tpcb read and write test - 10 billion data active
TPS:
QPS:
transaction type: ./rw.sql scaling factor: 1 query mode: prepared number of clients: 64 number of threads: 64 duration: 360 s number of transactions actually processed: 27002477 latency average = 0.853 ms latency stddev = 39.222 ms tps = 75002.036277 (including connections establishing) tps = 75012.139249 (excluding connections establishing) statement latencies in milliseconds: 0.003 \set aid random_gaussian(1, :range, 10.0) 0.001 \set bid random(1, 1 * :scale) 0.001 \set tid random(1, 10 * :scale) 0.001 \set delta random(-5000, 5000) 0.044 BEGIN; 0.211 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; 0.076 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; 0.198 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; 0.115 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; 0.072 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); 0.134 END;
4,1 trillion tpcb read-only test - 1 billion data active
TPS:
QPS:
transaction type: ./ro.sql scaling factor: 1 query mode: prepared number of clients: 64 number of threads: 64 duration: 360 s number of transactions actually processed: 374399291 latency average = 0.061 ms latency stddev = 5.647 ms tps = 1039962.270864 (including connections establishing) tps = 1040949.958600 (excluding connections establishing) statement latencies in milliseconds: 0.002 \set aid random_gaussian(1, :range, 10.0) 0.061 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
4,1 trillion tpcb read and write tests - 1 billion data active
TPS:
QPS:
transaction type: ./rw.sql scaling factor: 1 query mode: prepared number of clients: 64 number of threads: 64 duration: 360 s number of transactions actually processed: 29574604 latency average = 0.779 ms latency stddev = 16.197 ms tps = 82148.432097 (including connections establishing) tps = 82160.286498 (excluding connections establishing) statement latencies in milliseconds: 0.003 \set aid random_gaussian(1, :range, 10.0) 0.001 \set bid random(1, 1 * :scale) 0.001 \set tid random(1, 10 * :scale) 0.001 \set delta random(-5000, 5000) 0.043 BEGIN; 0.144 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; 0.074 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; 0.207 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; 0.106 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; 0.070 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); 0.131 END;
5,1 trillion tpcb read-only test - 100 million data active
TPS: 1068052
QPS: 1068052
transaction type: ./ro.sql scaling factor: 1 query mode: prepared number of clients: 64 number of threads: 64 duration: 360 s number of transactions actually processed: 384510720 latency average = 0.060 ms latency stddev = 4.332 ms tps = 1068052.373377 (including connections establishing) tps = 1068206.696327 (excluding connections establishing) statement latencies in milliseconds: 0.002 \set aid random_gaussian(1, :range, 10.0) 0.059 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
5,1 trillion tpcb read and write tests - 100 million data active
TPS:
QPS:
transaction type: ./rw.sql scaling factor: 1 query mode: prepared number of clients: 64 number of threads: 64 duration: 360 s number of transactions actually processed: 28314309 latency average = 0.814 ms latency stddev = 16.406 ms tps = 78647.191352 (including connections establishing) tps = 78658.751759 (excluding connections establishing) statement latencies in milliseconds: 0.003 \set aid random_gaussian(1, :range, 10.0) 0.001 \set bid random(1, 1 * :scale) 0.001 \set tid random(1, 10 * :scale) 0.001 \set delta random(-5000, 5000) 0.043 BEGIN; 0.184 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; 0.076 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; 0.217 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; 0.096 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; 0.069 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); 0.125 END;
Performance summary
Environment: Aliyun ECS + 320T ESSD
Table SIZE: 120 TB write time 810688 seconds, about 1.233 million lines/s
Index SIZE: 20 TB creation takes 427 800 seconds
Index depth: Level 4 (BLOCK_SIZE=32K, each page can store more item s than 8k, so 1 trillion index levels are lower than 100 billion (8K))
Single Table Data Volume | TEST CASE | QPS | TPS |
---|---|---|---|
1 trillion | tpcb active data 100 million read-only | 1068052 | 1068052 |
1 trillion | tpcb active data 1 billion read-only | 1039962 | 1039962 |
1 trillion | tpcb active data 10 billion read-only | 590692 | 590692 |
1 trillion | tpcb active data 100 billion read-only | 23101 | 23101 |
1 trillion | tpcb active data 100 billion read-only | 21492 | 21492 |
1 trillion | tpcb active data 100 million read and write | 393235 | 78647 |
1 trillion | tpcb active data 1 billion read and write | 410740 | 82148 |
1 trillion | tpcb active data 10 billion read and write | 375010 | 75002 |
1 trillion | tpcb active data 100 billion read and write | 90630 | 18126 |
1 trillion | tpcb active data 1000 billion read and write | 80715 | 16143 |
Adding fields (including default values) takes 1.25 seconds.
Deleting fields takes 1 millisecond.
Appendix - pgbench_accounts partition, parallel loading of test data, dynamic query
What is the problem of 1 trillion single sheets?
1. Table 125 TB, creating index takes more time. PG 11 introduces parallel index creation to solve this problem.
2. Table 125 TB, garbage collection time is prolonged. The use of zheap engine for PG 12 was completely eliminated.
3. Table 125 TB, FREEZE is time-consuming and may not even be able to complete in 2 billion transactions. Future version of PG, using more than 32 bits of XID, completely solved.
4. Table 125 TB must be placed in a single directory, which may lead to the upper limit of file system (INODE, capacity, etc.).
5. Table 125 TB is inconvenient for data cleaning. If there is the concept of time dimension aging, partition table can better manage hot and cold data, such as pg_pathman.
pgbench is converted to partitioned tables.
1. pg_pathman is recommended for low performance loss. Built-in partition function, there are still performance problems.
PostgreSQL 9.x, 10, 11 hash partition table usage example
PostgreSQL 10 built-in partition vs pg_pathman perf profiling
Partition Table Lock Granularity Difference - pg_pathman VS native partition table
Using internal partitioning, it is recommended to use dynamic SQL to avoid BIND problems.
Partition demo
PostgreSQL pgbench tpcb data generation and SQL partial source code interpretation
"PostgreSQL pgbench tpcb massive database testing-partition table testing optimization"
Loading data
1, table
pgbench -i -I dt --tablespace=tbs1 -s 10000000
2, zoning
create table p (like pgbench_accounts) partition by RANGE ( aid ) tablespace tbs1; do language plpgsql $$ declare i_rows_perpartition int8 := 244140625; begin for i in 0..4096 loop execute format ('create table pgbench_accounts%s partition of p for values from (%s) to (%s) tablespace tbs1', i, i*i_rows_perpartition, (i+1)*i_rows_perpartition); end loop; end; $$; drop table pgbench_accounts; alter table p rename to pgbench_accounts; -- alter table pgbench_accounts add constraint pk_pgbench_accounts_aid primary key (aid) using index tablespace tbs2;
3. Loading tasks
drop table task; create table task(id int primary key); insert into task select i from generate_series(0,4095) t(i);
4. Initialization record
create table init_accounts(aid int8); insert into init_accounts select generate_series(0,244140624);
5. Parallel State UDF
create or replace function tpcb_init_accounts() returns void as $$ declare v_id int; begin with tmp as (select * from task limit 1 for update skip locked), tmp1 as (delete from task using tmp where task.id=tmp.id) select id into v_id from tmp; if found then execute format ('insert into pgbench_accounts%s select aid+%s*244140625::int8, ((aid+%s*244140625::int8)-1)/100000 + 1, 0 from init_accounts on conflict do nothing', v_id, v_id, v_id); end if; end; $$ language plpgsql strict;
6. Parallel loading of data
vi test.sql select tpcb_init_accounts(); nohup pgbench -M prepared -n -r -f ./test.sql -c 64 -j 64 -t 100 >./init.log 2>&1 &
Initialization index
Task sheet
drop table task; create table task(id int primary key); insert into task select i from generate_series(0,4095) t(i);
2. Parallel creation of index UDF
create or replace function tpcb_init_accounts_pkey() returns void as $$ declare v_id int; begin with tmp as (select * from task limit 1 for update skip locked), tmp1 as (delete from task using tmp where task.id=tmp.id) select id into v_id from tmp; if found then execute format ('analyze pgbench_accounts%s', v_id); execute format ('alter table pgbench_accounts%s add constraint pk_pgbench_accounts%s_aid primary key (aid) using index tablespace tbs2', v_id, v_id); end if; end; $$ language plpgsql strict;
3. Parallel Index Creation
vi test.sql select tpcb_init_accounts_pkey(); nohup pgbench -M prepared -n -r -f ./test.sql -c 64 -j 64 -t 100 >./init.log 2>&1 &
Summary
The block size of 1K and 8K is the largest 32TB in a single table. (Since the block num of ctid is the address of 32BIT, the maximum capacity calculated by 8K block is 32TB. The single table 1 trillion tested in this paper has exceeded 32TB, so we need to choose a larger BLOCK SIZE, 32K can reach 256TB single table.)
Add -- with - block size at compile time= ./configure --with-blocksize=32
2. How can such a large database be backed up efficiently and restored at a point in time?
Full backup: 1. ZFS snapshot, send the snapshot to the backup machine (10,000 Mbp network can run full network card bandwidth). 2. Use pg_basebackup to back up the full amount. 3. Use pg_rman to back up the full amount. 4. Use cloud disk snapshots to back up the full amount.
Incremental backup: 1. ZFS snapshot, send the snapshot increment to the backup machine. 2. pg_basebackup can only backup the full amount. 3. Use pg_rman backup increment (distinguish the data blocks modified since the last backup by BLOCK LSN number). 4. Use cloud disk snapshot backup increments.
Archive Backup: Back up wal files for archiving.
Time Point Recovery: 1. zfs Snapshot Cloning + Filing Recovery to Time Point. 2. Full recovery + archiving recovery to the point of time. 4. Full + Incremental + Filing Recovery to Time Point.
3. This test of tpcb, concurrent 64, the first ten seconds of bind time consumed more.
4. It is recommended to partition large tables using pg_pathman. How large tables need to be partitioned?
PostgreSQL 9.5 + Efficient Partition Table Implementation - pg_pathman
Reference resources
PostgreSQL 11 tpcc Test (103,000 tpmC on ECS) - use sysbench-tpcc by Percona-Lab
(TPC-H test SF=10,SF=200) PostgreSQL 11 vs 10 vs Deepgreen
PostgreSQL 10 billion tpcb performance on ECS
PostgreSQL on Aliyun ECS + ESSD - 100 billion tpcb, 1000W tpcc test