Research on fast inserting large amount of INSERT data into PostgreSQL database

background

In some application scenarios, you need to quickly load a large amount of data into the PostgreSQL database, such as database migration, SQL log analysis, etc. How many ways to quickly insert data on PG? What is the efficiency of each scheme? How can I tune for faster data loading?

Scene setting

SQL log analysis is a tool for collecting JDBC logs, analyzing SQL and sending analysis results. In the analysis phase, you need to parse a large number of JDBC logs and load the parsed structured results into the database for subsequent processing. Take the analysis stage as the experimental scenario, start with parsing JDBC logs (multiple) and end with completing structured data loading (package index establishment) to test the data loading efficiency of different schemes.

Environmental preparation

Database environment

name	value
operating system	CENTOS 6.5
CPU	Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz, logic 64 core
Memory	316G
disk	RAID 10, write speed 1GB/s
Database version	PostgreSQL 9.5.4
Database memory parameters	shared_buffers:30G work_mem:4MB maintenance_work_mem: 64MB
Database CPU parameters	max_worker_processes:16

Create table statement

drop table if exists T_JDBC_SQL_RECORD ;
--No primary key, not used C_BH Query, add insert The speed is removed first
create table T_JDBC_SQL_RECORD (
C_BH VARCHAR(32) ,
C_BH_PARSE VARCHAR(32) NULL,
C_BH_GROUP VARCHAR(32) NULL,
C_BH_SQL VARCHAR(32) NULL,
DT_ZXSJ TIMESTAMP NULL,
N_RUNTIME INT NULL,
C_RZLJ VARCHAR(600) NULL,
N_STARTLINE INT NULL,
N_ENDLINE INT NULL,
N_SQLTYPE INT NULL,
N_SQLCOMPLEX INT NULL,
C_IP VARCHAR(100) NULL,
C_PORT VARCHAR(100) NULL,
C_XTBS VARCHAR(100) NULL,
N_CHECKSTATUS INT  default 0,
N_SQL_LENGTH INT NULL,
N_SQL_BYTE INT NULL,
N_5MIN INT NULL,
C_METHOD VARCHAR(600) NULL,
C_PSSQL_HASH VARCHAR(300) NULL,
N_IS_BATCH INT,
N_RESULTSET INT
);

drop table if exists T_JDBC_SQL_CONTENT ;
CREATE TABLE T_JDBC_SQL_CONTENT (
C_BH VARCHAR(32) NOT NULL,
C_PSSQL_HASH VARCHAR(300) NULL,
C_SQL_TEXT varchar NULL,
C_PSSQL_TEXT varchar NULL
);

Index statement

create index i_jdbc_sql_record_zh01 on t_jdbc_sql_record(c_bh_group,dt_zxsj,N_CHECKSTATUS,C_PSSQL_HASH);
create index i_jdbc_sql_record_pshash on t_jdbc_sql_record(c_pssql_hash);
create index i_jdbc_sql_content_pshash on t_jdbc_sql_content(c_pssql_hash);
alter table t_jdbc_sql_content add constraint t_jdbc_sql_content_pkey primary key (C_BH);

Asynchronous commit and unrecognized table

 -- Asynchronous commit,Restart the database after changing
 alter system set synchronous_commit to off;
 -- unlogged table 
 create unlogged table t_jdbc_sql_record 
 ...
 create unlogged table t_jdbc_sql_content 
 ...

JDBC log volume
19 JDBC log files, 2G logs in total, 6 million records

Scheme setting

Scheme name	Scheme description
Scheme I	Establish structured tables and their indexes, and load data with multiple threads and a single insert
Scheme II	Establish structured tables and their indexes, and load data in multithreaded batch insert
Programme III	The structured table and its index are established, the library is set to asynchronous submission, and multi-threaded batch insert loads data
Programme IV	Establish structured tables, set the library to asynchronous submission, multi-threaded batch insert, load data, and establish indexes
Programme V	Establish a structured table and its index, set the table to an unlocked table, and multi-threaded batch insert to load data
Programme VI	Create a structured table, set the table to an unlocked table, multi-threaded batch insert, load data, and establish an index
Programme VII	Establish structured tables, multi-threaded batch insert, load data, and establish indexes

experimental result

During each experiment, the amount of JDBC logs parsed, parsing code and middleware environment remain unchanged. Only adjust the process sequence and database parameters.

Number of experiments	Scheme I	Scheme II	Programme III	Programme IV	Programme V	Programme VI	Programme VII
for the first time	3596s	2043s	1164s	779s	545s	528s	1192s
The second time	4092s	2068s	1283s	843s	528s	528s	1227s
third time	3891s	2177s	1378s	858s	536s	537s	1248s
average value	3859s	2096s	1275s	826s	536s	531s	1222s

Result analysis

Scheme 1 and scheme 2 are compared. The database parameters remain unchanged and the process sequence remains unchanged
- Scheme 1: a single insert submission takes 3859 seconds
- Scheme 2: batch insert submission takes 2096 seconds
Scheme 2, scheme 3 and scheme 5 are compared. The process sequence remains the same. They are table creation - > index creation - > multi-threaded batch insertion.
- Scheme 2: synchronous submission (wait for the WAL log to complete), which takes 2096 seconds
- Scheme 3: asynchronous submission (do not wait for the completion of WAL log), with a time of 1275 seconds
- Scheme 5: no WAL log is recorded, taking 536 seconds
Scheme 2 and scheme 7 are compared and submitted synchronously
- Scheme 2: establish index before inserting data, which takes 2096 seconds
- Scheme 7: establish the index after inserting data, taking 1222 seconds
The comparison of scheme 3 and scheme 4 is asynchronous submission
- Scheme 3: establish an index before inserting data, which takes 1275 seconds
- Scheme 4: establish the index after inserting the data, which takes 826 seconds
Compared with scheme 5 and scheme 6, WAL logs are not recorded
- Scheme 5: build the index before inserting data, taking 536 seconds
- Scheme 6: establish the index after inserting data, taking 531 seconds

summary

In this scenario:

Batch submission is 55% faster than single submission
Asynchronous commit is 40% faster than synchronous commit
Commit without logging is 75% faster than synchronous commit
When logging and submitting synchronously, indexing later is 40% faster than indexing first
When logging and submitting asynchronously, indexing later is 35% faster than indexing first
When logs are not recorded, indexing later is slightly faster than indexing first, but the difference is small

The fastest combination of inserted data is:
Unligged Table + multithreaded batch insert + Post index

Guess:
In the process of insert, the time of maintaining the index accounts for 35% to 40% of the total time, and it is mainly spent on log persistence.

other:
At the same time, some other index information during the experiment, such as the write IO of the database under different schemes has never exceeded 100MB/s, which needs to be analyzed continuously.