background
In some application scenarios, you need to quickly load a large amount of data into the PostgreSQL database, such as database migration, SQL log analysis, etc. How many ways to quickly insert data on PG? What is the efficiency of each scheme? How can I tune for faster data loading?
Scene setting
SQL log analysis is a tool for collecting JDBC logs, analyzing SQL and sending analysis results. In the analysis phase, you need to parse a large number of JDBC logs and load the parsed structured results into the database for subsequent processing. Take the analysis stage as the experimental scenario, start with parsing JDBC logs (multiple) and end with completing structured data loading (package index establishment) to test the data loading efficiency of different schemes.
Environmental preparation
- Database environment
name | value |
---|---|
operating system | CENTOS 6.5 |
CPU | Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz, logic 64 core |
Memory | 316G |
disk | RAID 10, write speed 1GB/s |
Database version | PostgreSQL 9.5.4 |
Database memory parameters | shared_buffers:30G work_mem:4MB maintenance_work_mem: 64MB |
Database CPU parameters | max_worker_processes:16 |
- Create table statement
drop table if exists T_JDBC_SQL_RECORD ; --No primary key, not used C_BH Query, add insert The speed is removed first create table T_JDBC_SQL_RECORD ( C_BH VARCHAR(32) , C_BH_PARSE VARCHAR(32) NULL, C_BH_GROUP VARCHAR(32) NULL, C_BH_SQL VARCHAR(32) NULL, DT_ZXSJ TIMESTAMP NULL, N_RUNTIME INT NULL, C_RZLJ VARCHAR(600) NULL, N_STARTLINE INT NULL, N_ENDLINE INT NULL, N_SQLTYPE INT NULL, N_SQLCOMPLEX INT NULL, C_IP VARCHAR(100) NULL, C_PORT VARCHAR(100) NULL, C_XTBS VARCHAR(100) NULL, N_CHECKSTATUS INT default 0, N_SQL_LENGTH INT NULL, N_SQL_BYTE INT NULL, N_5MIN INT NULL, C_METHOD VARCHAR(600) NULL, C_PSSQL_HASH VARCHAR(300) NULL, N_IS_BATCH INT, N_RESULTSET INT ); drop table if exists T_JDBC_SQL_CONTENT ; CREATE TABLE T_JDBC_SQL_CONTENT ( C_BH VARCHAR(32) NOT NULL, C_PSSQL_HASH VARCHAR(300) NULL, C_SQL_TEXT varchar NULL, C_PSSQL_TEXT varchar NULL );
- Index statement
create index i_jdbc_sql_record_zh01 on t_jdbc_sql_record(c_bh_group,dt_zxsj,N_CHECKSTATUS,C_PSSQL_HASH); create index i_jdbc_sql_record_pshash on t_jdbc_sql_record(c_pssql_hash); create index i_jdbc_sql_content_pshash on t_jdbc_sql_content(c_pssql_hash); alter table t_jdbc_sql_content add constraint t_jdbc_sql_content_pkey primary key (C_BH);
- Asynchronous commit and unrecognized table
-- Asynchronous commit,Restart the database after changing alter system set synchronous_commit to off; -- unlogged table create unlogged table t_jdbc_sql_record ... create unlogged table t_jdbc_sql_content ...
- JDBC log volume
19 JDBC log files, 2G logs in total, 6 million records
Scheme setting
Scheme name | Scheme description |
---|---|
Scheme I | Establish structured tables and their indexes, and load data with multiple threads and a single insert |
Scheme II | Establish structured tables and their indexes, and load data in multithreaded batch insert |
Programme III | The structured table and its index are established, the library is set to asynchronous submission, and multi-threaded batch insert loads data |
Programme IV | Establish structured tables, set the library to asynchronous submission, multi-threaded batch insert, load data, and establish indexes |
Programme V | Establish a structured table and its index, set the table to an unlocked table, and multi-threaded batch insert to load data |
Programme VI | Create a structured table, set the table to an unlocked table, multi-threaded batch insert, load data, and establish an index |
Programme VII | Establish structured tables, multi-threaded batch insert, load data, and establish indexes |
experimental result
During each experiment, the amount of JDBC logs parsed, parsing code and middleware environment remain unchanged. Only adjust the process sequence and database parameters.
Number of experiments | Scheme I | Scheme II | Programme III | Programme IV | Programme V | Programme VI | Programme VII |
---|---|---|---|---|---|---|---|
for the first time | 3596s | 2043s | 1164s | 779s | 545s | 528s | 1192s |
The second time | 4092s | 2068s | 1283s | 843s | 528s | 528s | 1227s |
third time | 3891s | 2177s | 1378s | 858s | 536s | 537s | 1248s |
average value | 3859s | 2096s | 1275s | 826s | 536s | 531s | 1222s |
Result analysis
-
Scheme 1 and scheme 2 are compared. The database parameters remain unchanged and the process sequence remains unchanged
- Scheme 1: a single insert submission takes 3859 seconds
- Scheme 2: batch insert submission takes 2096 seconds
-
Scheme 2, scheme 3 and scheme 5 are compared. The process sequence remains the same. They are table creation - > index creation - > multi-threaded batch insertion.
- Scheme 2: synchronous submission (wait for the WAL log to complete), which takes 2096 seconds
- Scheme 3: asynchronous submission (do not wait for the completion of WAL log), with a time of 1275 seconds
- Scheme 5: no WAL log is recorded, taking 536 seconds
-
Scheme 2 and scheme 7 are compared and submitted synchronously
- Scheme 2: establish index before inserting data, which takes 2096 seconds
- Scheme 7: establish the index after inserting data, taking 1222 seconds
-
The comparison of scheme 3 and scheme 4 is asynchronous submission
- Scheme 3: establish an index before inserting data, which takes 1275 seconds
- Scheme 4: establish the index after inserting the data, which takes 826 seconds
-
Compared with scheme 5 and scheme 6, WAL logs are not recorded
- Scheme 5: build the index before inserting data, taking 536 seconds
- Scheme 6: establish the index after inserting data, taking 531 seconds
summary
In this scenario:
- Batch submission is 55% faster than single submission
- Asynchronous commit is 40% faster than synchronous commit
- Commit without logging is 75% faster than synchronous commit
- When logging and submitting synchronously, indexing later is 40% faster than indexing first
- When logging and submitting asynchronously, indexing later is 35% faster than indexing first
- When logs are not recorded, indexing later is slightly faster than indexing first, but the difference is small
The fastest combination of inserted data is:
Unligged Table + multithreaded batch insert + Post index
Guess:
In the process of insert, the time of maintaining the index accounts for 35% to 40% of the total time, and it is mainly spent on log persistence.
other:
At the same time, some other index information during the experiment, such as the write IO of the database under different schemes has never exceeded 100MB/s, which needs to be analyzed continuously.
reference resources:
Postgresql quickly writes / reads large amounts of data (. net) - Podolski - blog Park
PostgreSQL 9.5.4 database fast INSERT large amount of data research - wangzhen3798 - blog Garden
Python multi process concurrent writing to PostgreSQL data table - Programmer's base