Research on fast inserting large amount of INSERT data into PostgreSQL database

background

In some application scenarios, you need to quickly load a large amount of data into the PostgreSQL database, such as database migration, SQL log analysis, etc. How many ways to quickly insert data on PG? What is the efficiency of each scheme? How can I tune for faster data loading?

Scene setting

SQL log analysis is a tool for collecting JDBC logs, analyzing SQL and sending analysis results. In the analysis phase, you need to parse a large number of JDBC logs and load the parsed structured results into the database for subsequent processing. Take the analysis stage as the experimental scenario, start with parsing JDBC logs (multiple) and end with completing structured data loading (package index establishment) to test the data loading efficiency of different schemes.

Environmental preparation

  • Database environment
namevalue
operating systemCENTOS 6.5
CPUIntel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz, logic 64 core
Memory316G
diskRAID 10, write speed 1GB/s
Database versionPostgreSQL 9.5.4
Database memory parametersshared_buffers:30G work_mem:4MB maintenance_work_mem: 64MB
Database CPU parametersmax_worker_processes:16
  • Create table statement
drop table if exists T_JDBC_SQL_RECORD ;
--No primary key, not used C_BH Query, add insert The speed is removed first
create table T_JDBC_SQL_RECORD (
C_BH VARCHAR(32) ,
C_BH_PARSE VARCHAR(32) NULL,
C_BH_GROUP VARCHAR(32) NULL,
C_BH_SQL VARCHAR(32) NULL,
DT_ZXSJ TIMESTAMP NULL,
N_RUNTIME INT NULL,
C_RZLJ VARCHAR(600) NULL,
N_STARTLINE INT NULL,
N_ENDLINE INT NULL,
N_SQLTYPE INT NULL,
N_SQLCOMPLEX INT NULL,
C_IP VARCHAR(100) NULL,
C_PORT VARCHAR(100) NULL,
C_XTBS VARCHAR(100) NULL,
N_CHECKSTATUS INT  default 0,
N_SQL_LENGTH INT NULL,
N_SQL_BYTE INT NULL,
N_5MIN INT NULL,
C_METHOD VARCHAR(600) NULL,
C_PSSQL_HASH VARCHAR(300) NULL,
N_IS_BATCH INT,
N_RESULTSET INT
);

drop table if exists T_JDBC_SQL_CONTENT ;
CREATE TABLE T_JDBC_SQL_CONTENT (
C_BH VARCHAR(32) NOT NULL,
C_PSSQL_HASH VARCHAR(300) NULL,
C_SQL_TEXT varchar NULL,
C_PSSQL_TEXT varchar NULL
);
  • Index statement
create index i_jdbc_sql_record_zh01 on t_jdbc_sql_record(c_bh_group,dt_zxsj,N_CHECKSTATUS,C_PSSQL_HASH);
create index i_jdbc_sql_record_pshash on t_jdbc_sql_record(c_pssql_hash);
create index i_jdbc_sql_content_pshash on t_jdbc_sql_content(c_pssql_hash);
alter table t_jdbc_sql_content add constraint t_jdbc_sql_content_pkey primary key (C_BH);
  • Asynchronous commit and unrecognized table
 -- Asynchronous commit,Restart the database after changing
 alter system set synchronous_commit to off;
 -- unlogged table 
 create unlogged table t_jdbc_sql_record 
 ...
 create unlogged table t_jdbc_sql_content 
 ...
  • JDBC log volume
    19 JDBC log files, 2G logs in total, 6 million records

Scheme setting

Scheme nameScheme description
Scheme IEstablish structured tables and their indexes, and load data with multiple threads and a single insert
Scheme IIEstablish structured tables and their indexes, and load data in multithreaded batch insert
Programme IIIThe structured table and its index are established, the library is set to asynchronous submission, and multi-threaded batch insert loads data
Programme IVEstablish structured tables, set the library to asynchronous submission, multi-threaded batch insert, load data, and establish indexes
Programme VEstablish a structured table and its index, set the table to an unlocked table, and multi-threaded batch insert to load data
Programme VICreate a structured table, set the table to an unlocked table, multi-threaded batch insert, load data, and establish an index
Programme VIIEstablish structured tables, multi-threaded batch insert, load data, and establish indexes

experimental result

During each experiment, the amount of JDBC logs parsed, parsing code and middleware environment remain unchanged. Only adjust the process sequence and database parameters.

Number of experimentsScheme IScheme IIProgramme IIIProgramme IVProgramme VProgramme VIProgramme VII
for the first time3596s2043s1164s779s545s528s1192s
The second time4092s2068s1283s843s528s528s1227s
third time3891s2177s1378s858s536s537s1248s
average value3859s2096s1275s826s536s531s1222s

Result analysis

  • Scheme 1 and scheme 2 are compared. The database parameters remain unchanged and the process sequence remains unchanged

    • Scheme 1: a single insert submission takes 3859 seconds
    • Scheme 2: batch insert submission takes 2096 seconds
  • Scheme 2, scheme 3 and scheme 5 are compared. The process sequence remains the same. They are table creation - > index creation - > multi-threaded batch insertion.

    • Scheme 2: synchronous submission (wait for the WAL log to complete), which takes 2096 seconds
    • Scheme 3: asynchronous submission (do not wait for the completion of WAL log), with a time of 1275 seconds
    • Scheme 5: no WAL log is recorded, taking 536 seconds
  • Scheme 2 and scheme 7 are compared and submitted synchronously

    • Scheme 2: establish index before inserting data, which takes 2096 seconds
    • Scheme 7: establish the index after inserting data, taking 1222 seconds
  • The comparison of scheme 3 and scheme 4 is asynchronous submission

    • Scheme 3: establish an index before inserting data, which takes 1275 seconds
    • Scheme 4: establish the index after inserting the data, which takes 826 seconds
  • Compared with scheme 5 and scheme 6, WAL logs are not recorded

    • Scheme 5: build the index before inserting data, taking 536 seconds
    • Scheme 6: establish the index after inserting data, taking 531 seconds

summary

In this scenario:

  • Batch submission is 55% faster than single submission
  • Asynchronous commit is 40% faster than synchronous commit
  • Commit without logging is 75% faster than synchronous commit
  • When logging and submitting synchronously, indexing later is 40% faster than indexing first
  • When logging and submitting asynchronously, indexing later is 35% faster than indexing first
  • When logs are not recorded, indexing later is slightly faster than indexing first, but the difference is small

The fastest combination of inserted data is:
Unligged Table + multithreaded batch insert + Post index

Guess:
In the process of insert, the time of maintaining the index accounts for 35% to 40% of the total time, and it is mainly spent on log persistence.

other:
At the same time, some other index information during the experiment, such as the write IO of the database under different schemes has never exceeded 100MB/s, which needs to be analyzed continuously.

reference resources:

Postgresql quickly writes / reads large amounts of data (. net) - Podolski - blog Park

How to quickly insert a large amount of data into a PostgreSQL database_ wangzhen3798 blog - CSDN blog

PostgreSQL 9.5.4 database fast INSERT large amount of data research - wangzhen3798 - blog Garden

Python multi process concurrent writing to PostgreSQL data table - Programmer's base

Keywords: Database Big Data PostgreSQL

Added by alin19 on Tue, 12 Oct 2021 03:29:32 +0300