PostgreSQL on Linux Best Deployment Manual

Installation of common packages

# yum -y install coreutils glib2 lrzsz mpstat dstat sysstat e4fsprogs xfsprogs ntp readline-devel zlib-devel openssl-devel pam-devel libxml2-devel libxslt-devel python-devel tcl-devel gcc make smartmontools flex bison perl-devel perl-ExtUtils* openldap-devel jadetex  openjade bzip2

Configuring OS Kernel Parameters

1. sysctl

Note that some parameters are configured according to memory size (explained)

The meaning is detailed.

"Operating System Kernel Parameters that DBA Can't Know"

# vi /etc/sysctl.conf

# add by digoal.zhou
fs.aio-max-nr = 1048576
fs.file-max = 76724600
kernel.core_pattern= /data01/corefiles/core_%e_%u_%t_%s.%p         
# / Data 01/corefiles was built beforehand with 777 privileges. If it is a soft link, the corresponding directory should be changed to 777.
kernel.sem = 4096 2147483647 2147483646 512000    
# The semaphores, ipcs-l or-u view, need 17 semaphores for each group of 16 processes.
kernel.shmall = 107374182      
# Increase or decrease restrictions on all shared memory segments (80% of recommended memory)
kernel.shmmax = 274877906944   
# Maximum single shared memory segment size (recommended half of memory). Version 9.2 has significantly reduced the use of shared memory
kernel.shmmni = 819200         
# How many shared memory segments can be generated altogether, and at least two shared memory segments per PG database cluster
net.core.netdev_max_backlog = 10000
net.core.rmem_default = 262144       
# The default setting of the socket receive buffer in bytes.
net.core.rmem_max = 4194304          
# The maximum receive socket buffer size in bytes
net.core.wmem_default = 262144       
# The default setting (in bytes) of the socket send buffer.
net.core.wmem_max = 4194304          
# The maximum send socket buffer size in bytes.
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.tcp_keepalive_intvl = 20
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_mem = 8388608 12582912 16777216
net.ipv4.tcp_fin_timeout = 5
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syncookies = 1    
# Open SYN Cookies. When SYN waiting queue overflow occurs, cookie s are enabled to prevent a small number of SYN attacks.
net.ipv4.tcp_timestamps = 1    
# Reduce time_wait
net.ipv4.tcp_tw_recycle = 0    
# If = 1, the rapid recovery of TIME-WAIT sockets in TCP connections is turned on, but the NAT environment may cause the connection to fail. It is recommended that the server close it.
net.ipv4.tcp_tw_reuse = 1      
# Open reuse. Allow TIME-WAIT sockets to be reused for new TCP connections
net.ipv4.tcp_max_tw_buckets = 262144
net.ipv4.tcp_rmem = 8192 87380 16777216
net.ipv4.tcp_wmem = 8192 65536 16777216
net.nf_conntrack_max = 1200000
net.netfilter.nf_conntrack_max = 1200000
vm.dirty_background_bytes = 409600000       
#  When the dirty pages of the system reach this value, the pdflush (or other) scheduling process automatically brushes the dirty pages to disk (dirty_expire_centisecs/100) seconds ago.
vm.dirty_expire_centisecs = 3000             
#  Dirty pages older than this will be brushed to disk. 3000 means 30 seconds.
vm.dirty_ratio = 95                          
#  If the system process brushes dirty pages too slowly, making the system dirty pages more than 95% of memory, then the user process needs to actively brush out the system dirty pages if it has the operation of writing to disk (such as fsync, fdatasync, etc.).
#  Effectively prevent user processes from brushing dirty pages. It is very effective in the case of single machine with multiple instances and using CGROUP to restrict single instance IOPS.  
vm.dirty_writeback_centisecs = 100            
#  The wake-up interval for a pdflush (or other) backstage dirty page brushing process is 100 for one second.
vm.mmap_min_addr = 65536
vm.overcommit_memory = 0     
#  When allocating memory, a small amount of over malloc is allowed. If set to 1, there is always enough memory, and the test environment with less memory can use 1.  
vm.overcommit_ratio = 90     
#  When overcommit_memory = 2, it is used to participate in calculating the allowable allocated memory size.
vm.swappiness = 0            
#  Close switching partitions
vm.zone_reclaim_mode = 0     
# Disable numa, or in vmlinux. 
net.ipv4.ip_local_port_range = 40000 65535    
# Locally Automated TCP, UDP Port Number Range
fs.nr_open=20480000
# The upper limit of file handle allowed to open by a single process

# Note the following parameters
# vm.extra_free_kbytes = 4096000
# vm.min_free_kbytes = 2097152
# If it is a small memory machine, the above two values are not recommended.
# vm.nr_hugepages = 66536    
#  It is recommended to use large page size / proc / meminfo Huge page size when shared buffer settings exceed 64GB
# vm.lowmem_reserve_ratio = 1 1 1
# For memory larger than 64G, it is recommended to set it, otherwise the default value is 256 256 32.

2. Effective configuration

sysctl -p

Configuring OS resource constraints

# vi /etc/security/limits.conf

# If nofile exceeds 1048 576, you must first set fs.nr_open of sysctl to a larger value and then continue to set nofile after it takes effect.

* soft    nofile  1024000
* hard    nofile  1024000
* soft    nproc   unlimited
* hard    nproc   unlimited
* soft    core    unlimited
* hard    core    unlimited
* soft    memlock unlimited
* hard    memlock unlimited

It's better to focus on the file contents in the / etc/security/limits.d directory, which overrides the configurations of limits.conf.

For ulimit of existing processes, see / proc/pid/limits, for example

Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            10485760             unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             11286                11286                processes 
Max open files            1024                 4096                 files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       11286                11286                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us

If you want to start other processes, it is recommended to exit SHELL and go in again to confirm that the ulimit environment configuration is in effect and restart.

Configuring OS Firewall

(Suggested by business scenario settings, I'll clear it here first)

iptables -F

Configuration example

# Private network segment
-A INPUT -s 192.168.0.0/16 -j ACCEPT
-A INPUT -s 10.0.0.0/8 -j ACCEPT
-A INPUT -s 172.16.0.0/16 -j ACCEPT

selinux

If there is no need for this, it is recommended to disable it.

# vi /etc/sysconfig/selinux 

SELINUX=disabled
SELINUXTYPE=targeted

Turn off unnecessary OS services

chkconfig --list|grep on  
Turn off unnecessary, for example 
chkconfig iscsi off

Deployment of file systems

Pay attention to SSD alignment, prolong life and avoid writing enlargement.

parted -s /dev/sda mklabel gpt
parted -s /dev/sda mkpart primary 1MiB 100%

Formatting (if you choose ext4)

mkfs.ext4 /dev/sda1 -m 0 -O extent,uninit_bg -E lazy_itable_init=1 -T largefile -L u01

The recommended ext4 mount option

# vi /etc/fstab

LABEL=u01 /u01     ext4        defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback    0 0

# mkdir /u01
# mount -a

Why do you need data=writeback?

It is recommended that pg_xlog be placed in a separate IOPS block device with good performance.

Set SSD Disk Scheduling to deadline

If it's not SSD, CFQ is still used, otherwise DEADLINE is recommended.

Temporary settings (such as sda disks)

echo deadline > /sys/block/sda/queue/scheduler

Permanent settings

Editing grub file to modify block device scheduling strategy

vi /boot/grub.conf

elevator=deadline

Note that if there are both mechanical disks and SSD s, you can use / etc/rc.local to modify the specified disk to the corresponding scheduling strategy.

Close transparent pages, numa

Add the previous default IO scheduling, as follows

vi /boot/grub.conf

elevator=deadline numa=off transparent_hugepage=never 

Compiler

It is recommended to use a newer compiler to install gcc 6.2.0 reference

PostgreSQL clang vs gcc Compilation

If installed, it can be distributed to different machines.

cd ~
tar -jxvf gcc6.2.0.tar.bz2
tar -jxvf python2.7.12.tar.bz2


# vi /etc/ld.so.conf

/home/digoal/gcc6.2.0/lib
/home/digoal/gcc6.2.0/lib64
/home/digoal/python2.7.12/lib

# ldconfig

environment variable

# vi ~/env_pg.sh

export PS1="$USER@`/bin/hostname -s`-> "
export PGPORT=$1
export PGDATA=/$2/digoal/pg_root$PGPORT
export LANG=en_US.utf8
export PGHOME=/home/digoal/pgsql9.6
export LD_LIBRARY_PATH=/home/digoal/gcc6.2.0/lib:/home/digoal/gcc6.2.0/lib64:/home/digoal/python2.7.12/lib:$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH
export PATH=/home/digoal/gcc6.2.0/bin:/home/digoal/python2.7.12/bin:/home/digoal/cmake3.6.3/bin:$PGHOME/bin:$PATH:.
export DATE=`date +"%Y%m%d%H%M"`
export MANPATH=$PGHOME/share/man:$MANPATH
export PGHOST=$PGDATA
export PGUSER=postgres
export PGDATABASE=postgres
alias rm='rm -i'
alias ll='ls -lh'
unalias vi

icc, clang

If you want to compile PostgreSQL using ICC or clang, please refer to

Compiling PostgreSQL with intel Compiler icc

PostgreSQL clang vs gcc Compilation

Compile PostgreSQL

Recommended use of NAMED_POSIX_SEMAPHORES

src/backend/port/posix_sema.c

create sem : 
named :
                mySem = sem_open(semname, O_CREAT | O_EXCL,
                                                 (mode_t) IPCProtection, (unsigned) 1);


unamed :
/*
 * PosixSemaphoreCreate
 *
 * Attempt to create a new unnamed semaphore.
 */
static void
PosixSemaphoreCreate(sem_t * sem)
{
        if (sem_init(sem, 1, 1) < 0)
                elog(FATAL, "sem_init failed: %m");
}


remove sem : 

#ifdef USE_NAMED_POSIX_SEMAPHORES
        /* Got to use sem_close for named semaphores */
        if (sem_close(sem) < 0)
                elog(LOG, "sem_close failed: %m");
#else
        /* Got to use sem_destroy for unnamed semaphores */
        if (sem_destroy(sem) < 0)
                elog(LOG, "sem_destroy failed: %m");
#endif

Compile item

. ~/env_pg.sh 1921 u01

cd postgresql-9.6.1
export USE_NAMED_POSIX_SEMAPHORES=1
LIBS=-lpthread CC="/home/digoal/gcc6.2.0/bin/gcc" CFLAGS="-O3 -flto" ./configure --prefix=/home/digoal/pgsql9.6
LIBS=-lpthread CC="/home/digoal/gcc6.2.0/bin/gcc" CFLAGS="-O3 -flto" make world -j 64
LIBS=-lpthread CC="/home/digoal/gcc6.2.0/bin/gcc" CFLAGS="-O3 -flto" make install-world

If you are a development environment and need to debug, it is recommended to compile this way.

cd postgresql-9.6.1
export USE_NAMED_POSIX_SEMAPHORES=1
LIBS=-lpthread CC="/home/digoal/gcc6.2.0/bin/gcc" CFLAGS="-O0 -flto -g -ggdb -fno-omit-frame-pointer" ./configure --prefix=/home/digoal/pgsql9.6 --enable-cassert
LIBS=-lpthread CC="/home/digoal/gcc6.2.0/bin/gcc" CFLAGS="-O0 -flto -g -ggdb -fno-omit-frame-pointer" make world -j 64
LIBS=-lpthread CC="/home/digoal/gcc6.2.0/bin/gcc" CFLAGS="-O0 -flto -g -ggdb -fno-omit-frame-pointer" make install-world

Initialize database cluster

pg_xlog is recommended for the best partition in IOPS.

. ~/env_pg.sh 1921 u01
initdb -D $PGDATA -E UTF8 --locale=C -U postgres -X /u02/digoal/pg_xlog$PGPORT

Configure postgresql.conf

Take PostgreSQL 9.6, 512G memory host as an example

Best to the end of the file can be repeated with the end as a valid value.  
  
$ vi postgresql.conf

listen_addresses = '0.0.0.0'
port = 1921
max_connections = 5000
unix_socket_directories = '.'
tcp_keepalives_idle = 60
tcp_keepalives_interval = 10
tcp_keepalives_count = 10
shared_buffers = 128GB                      # 1/4 Host Memory
maintenance_work_mem = 2GB                  # Min (2G, (1/4 host memory)/autovacuum_max_workers)
dynamic_shared_memory_type = posix
vacuum_cost_delay = 0
bgwriter_delay = 10ms
bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 10.0
bgwriter_flush_after = 0                    # IO is a good machine, no need to consider smooth scheduling
max_worker_processes = 128
max_parallel_workers_per_gather = 0         #  If parallel queries are required, set to greater than 1 and do not recommend exceeding host cores-2
old_snapshot_threshold = -1
backend_flush_after = 0  # IO good machine, do not need to consider smooth scheduling, otherwise recommend 128 ~ 256 kB
wal_level = replica
synchronous_commit = off
full_page_writes = on   # Block devices that support atomic writing over BLOCK_SIZE can be turned off after alignment. Or the file system that supports cow can be shut down.
wal_buffers = 1GB       # min( 2047MB, shared_buffers/32 ) = 512MB
wal_writer_delay = 10ms
wal_writer_flush_after = 0  # IO good machine, do not need to consider smooth scheduling, otherwise recommend 128 ~ 256 kB
checkpoint_timeout = 30min  # Frequent checkpoints are not recommended, otherwise XLOG will generate a lot of FULL PAGE WRITE(when full_page_writes=on).
max_wal_size = 256GB       # The recommendation is twice that of SHARED BUFFER
min_wal_size = 64GB        # max_wal_size/4
checkpoint_completion_target = 0.05          # When the hard disk is good, the checkpoint can be quickly terminated and the consistent state can be quickly reached when the hard disk is restored. Otherwise, recommendations are 0.5-0.9
checkpoint_flush_after = 0                   # IO good machine, do not need to consider smooth scheduling, otherwise recommend 128 ~ 256 kB
archive_mode = on
archive_command = '/bin/date'      #  Later revisions, such as'test! - f/disk1/digoal/arch/%f & & cp%p/disk1/digoal/arch/%f'
max_wal_senders = 8
random_page_cost = 1.3  # IO is a good machine without considering the cost differences between discrete and sequential scans
parallel_tuple_cost = 0
parallel_setup_cost = 0
min_parallel_relation_size = 0
effective_cache_size = 300GB                          # Watch and do, deduct the session connection RSS, shared buffer, autovacuum worker, and the rest is the ACHE available to OS.
force_parallel_mode = off
log_destination = 'csvlog'
logging_collector = on
log_truncate_on_rotation = on
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = verbose
log_timezone = 'PRC'
vacuum_defer_cleanup_age = 0
hot_standby_feedback = off                             # It is recommended to close down in order to avoid the expansion of the main repository caused by the warehouse manager's transaction.
max_standby_archive_delay = 300s
max_standby_streaming_delay = 300s
autovacuum = on
log_autovacuum_min_duration = 0
autovacuum_max_workers = 16                            # CPU core, and IO good case, can be more, but note that 16*autovacuum mem, will consume more memory, so memory also has a foundation.  
autovacuum_naptime = 45s                               # It is recommended that the frequency should not be too high, otherwise there will be more XLOG due to vacuum.
autovacuum_vacuum_scale_factor = 0.1
autovacuum_analyze_scale_factor = 0.1
autovacuum_freeze_max_age = 1600000000
autovacuum_multixact_freeze_max_age = 1600000000
vacuum_freeze_table_age = 1500000000
vacuum_multixact_freeze_table_age = 1500000000
datestyle = 'iso, mdy'
timezone = 'PRC'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'
default_text_search_config = 'pg_catalog.english'
shared_preload_libraries='pg_stat_statements'

## If your database has many very small files (such as hundreds of thousands of tables, indexes, etc.), and each table will be accessed, it is recommended that FD set up more files to avoid the need for the process to open and close files.
## But don't be larger than ulimit-n (open files) set by the system in the previous section.
max_files_per_process=655360

Configure pg_hba.conf

Avoid unnecessary access and open access. It is recommended that password access be used.

$ vi pg_hba.conf

host replication xx 0.0.0.0/0 md5  # Stream replication

host all postgres 0.0.0.0/0 reject # Deny Super User Logon from Network
host all all 0.0.0.0/0 md5  # Other users login

Start the database

pg_ctl start

Keywords: PostgreSQL Database socket SELinux

Added by hazel999 on Fri, 17 May 2019 02:29:34 +0300