Several ways to improve the efficiency of Sqlite mass importing data

preface

SQLite database has been more and more applied to small and medium-sized applications because of its simplicity, flexibility, light weight and open source. Some people even say that SQLite can be used to replace file reading and writing in c language. Therefore, when I recently wrote the program on remote sensing data processing, I also introduced SQLite to improve the structure of data and the processing capacity of big data (SQLite supports up to 2PB of data). But at first, I found that the insertion efficiency of directly using SQL statements was simply appalling. Later, after constantly checking documents and materials, we found a fast way of "data insertion". Taking inserting data as an example, this paper integrates various methods to improve the efficiency of SQLite on the Internet and in reference books, and gives a complete method to improve the efficiency of SQLite data insertion.

 

1 data

The computer I use is Win7 64 bit system, compiled with VC2010, SQLIte version is 3.7.15.2, the computer CPU is the second generation i3 processor, and the memory is 6G.

Before the experiment, create a table to insert data:

create table t1 (id integer , x integer , y integer, weight real)  

2. Slow speed - the roughest way

The functions that directly execute SQL in SQLite API are:
int sqlite3_exec(  sqlite3*,    const char *sql,   int (*callback)(void*,int,char**,char**),   void *,   char **errmsg)  

Directly use the string of the INSERT statement to INSERT. Part of the program code (see the following for the complete code), as follows

 

for(int i=0;i<nCount;++i)  
{  
     std::stringstream ssm;  
     ssm<<"insert into t1 values("<<i<<","<<i*2<<","<<i/2<<","<<i*i<<")";  
     sqlite3_exec(db,ssm.str().c_str(),0,0,0);  
}  

 

This program runs too slowly. I don't have time to wait. It's estimated that it's basically 7.826 / s

3. Medium speed - open transaction explicitly

The so-called "transaction" refers to a set of SQL commands, which are either executed together or not executed at all. In SQLite, sqlite3_exec() is called every time Function, a transaction will be implicitly opened. If a piece of data is inserted, the function will be called once, and the transaction will be opened and closed repeatedly, which will increase the amount of Io. If the transaction is explicitly opened before inserting data and submitted together after inserting, the IO efficiency will be greatly improved and the data insertion speed will be faster.

To start a transaction, just add a command to start and submit the transaction before and after the above code:

 

sqlite3_exec(db,"begin;",0,0,0);  
for(int i=0;i<nCount;++i)  
{  
    std::stringstream ssm;  
    ssm<<"insert into t1 values("<<i<<","<<i*2<<","<<i/2<<","<<i*i<<")";  
    sqlite3_exec(db,ssm.str().c_str(),0,0,0);  
}  
sqlite3_exec(db,"commit;",0,0,0); 

 

After the transaction is explicitly opened, the program runs much faster, and the estimation efficiency reaches 34095 pieces / s, which is about 5000 times higher than the original method.

4. High speed - write synchronous

I want to use a remote sensing processing algorithm to process 10000 * 10000 images. In the middle, I need to insert 100000000 pieces of data into the database. If the speed after starting the transaction is 34095 pieces / s, 100000000 ÷ 34095 = 2932 seconds = 48.9 minutes is still unacceptable, so I then find a way to improve the speed. Finally, in the materials about explaining SQLite configuration, I saw the "write synchronization" option.

In SQLite, the parameters of database configuration are implemented by pragma, and the synchronous option has three optional states: full, normal and off. This blog as well as Official documents The settings of these three parameters are described in detail. In short, full is the slowest to write, but it ensures that the data is safe from power failure and system crash. off can speed up some operations of the database, but if the system crashes or power failure, the database may be damaged.

In SQLite3, the default value of this option is full. If we change it to off before inserting data again, the efficiency will be improved. If you only treat SQLite as a temporary database, you don't need to set it to full at all. In the code, the setting method is to insert the following statement directly after opening the database:

sqlite3_exec(db,"PRAGMA synchronous = OFF; ",0,0,0); 

At this time, after testing, the insertion speed has become 41851 pieces / s, that is, it takes 2389 seconds to insert 100000000 pieces of data = 39.8 minutes

5. Extreme speed - execution preparation

Although the speed increases slightly after the write synchronization is set to off, it is still slow. Once again, I am on the road of finding ways to improve the efficiency of SQLite insertion. Finally, I found that SQLite can execute SQL statements in two ways: one is to use the function SQLite3 mentioned above_ Exec(), which directly calls the string containing the SQL statement; Another method is "execution preparation" (similar to stored procedure), that is, compile the SQL statement first, and then execute it step by step (or line by line). If the former is adopted, even if the transaction is started, SQLite still needs to perform "lexical analysis" and "syntax analysis" on each SQL statement in the loop, which is a waste of time for the operation of inserting a large amount of data at the same time. Therefore, in order to further improve the insertion efficiency, the latter should be used.

"Implementation preparation" is mainly divided into three steps:

1. Call function:

int sqlite3_prepare_v2( sqlite3 *db,  const char *zSql,  int nByte,  sqlite3_stmt **ppStmt,  const char **pzTail);

And declare a pointer to SQLite3_ Pointer to stmt object. This function compiles the parameterized SQL statement zSql and stores the compiled state in ppStmt.

2. Call the function sqlite3_step(), this function is to execute one step (in this case, insert a row). If the function returns SQLite_ROW indicates that the execution is still continuing, otherwise it indicates that all operations have been performed;

3. Call the function sqlite3_finalize(), close the statement.

For the specific syntax of the API for execution preparation, see Official documents . The c + + code prepared for execution in this article is as follows:

 

sqlite3_exec(db,"begin;",0,0,0);  
sqlite3_stmt *stmt;  
const char* sql = "insert into t1 values(?,?,?,?)";  
sqlite3_prepare_v2(db,sql,strlen(sql),&stmt,0);  
  
for(int i=0;i<nCount;++i)  
{         
    sqlite3_reset(stmt);  
    sqlite3_bind_int(stmt,1,i);  
    sqlite3_bind_int(stmt,1,i*2);  
    sqlite3_bind_int(stmt,1,i/2);  
    sqlite3_bind_double(stmt,1,i*i);  
}  
sqlite3_finalize(stmt);  
sqlite3_exec(db,"commit;",0,0,0);  

 

At this time, the test data insertion efficiency is 265816 pieces / s, that is, it takes 376 seconds to insert 100000000 pieces of data = 6.27 minutes. This speed is already very satisfactory.

 

To sum up, the fastest way for SQLite to insert data is: transaction + turn off write synchronization + execution preparation (stored procedure). If there are requirements for database security, turn on write synchronization.

6. Use WAL mode

   (1).WAL:Write Ahead Logging, which is a mechanism used to implement atomic transactions in the database. It was introduced after version 3.7.0

   (2).WAL mode has two main advantages:

a. read and write can be completely concurrent without blocking each other (but they can't be concurrent between writes)

b.WAL has better performance in most cases (because there is no need to write two files every time)

   (3). Principle of rollback journal mechanism: before modifying the data in the database, first back up the data in the page where the modification is located in another place, and then write the modification to the data; If the transaction fails, copy back the backup data and undo the modification; If the transaction is successful, delete the backup and commit the modification.

   (4). Principle of wal mechanism: the modification is not directly written to the database file, but to another file called wal. If the transaction fails, the file in wal will be ignored and the modification will be revoked; If the transaction succeeds, it will be written back to the database file at a later time to commit the changes.

The performance difference is mainly due to each transaction submission. Wal only needs to write the updated log to disk. In the delete mode, first copy the original data to the log file and perform fsync, and then write the modified page to disk. At the same time, fsync is also required to ensure that the data falls to disk and clear the log file. Therefore, write transactions only need one fsync in wal mode and are written sequentially, while in delete mode, at least two fsyncs (logs and data) are required, and the updated data is discretely distributed in multiple pages, so multiple fsyncs may be required.

WAL uses shared memory technology, so all read and write processes must be on the same machine

How to turn on WAL mode:

 View Code

On the basis of the above, it takes an average of 4.324 seconds to perform five operations after using WAL mode

7. In memory database

In addition, if the data does not need to be saved for a long time, you can use the memory database of sqlite to replace the file database

How to open sqlite memory database:

 View Code

5 times of execution, average time: 4.052 seconds

However, the memory database has the following disadvantages:

  (1). The database will disappear after power failure or program crash
  (2). The database in memory cannot be accessed by other processes

  (3). It does not support read-write mutually exclusive processing like on the hard disk. You need to lock it yourself

8 summary

reference material:

1. Official SQLite documents: http://www.sqlite.org/docs.html

2. Solving the problem of slow data insertion in sqlite3: http://blog.csdn.net/victoryknight/article/details/7461703

3. The Definitive Guide to SQLite published by Apress: http://www.apress.com/9781430232254 (this is a good book)

Final complete code attached:

 

 

#include <iostream>  
#include <string>  
#include <sstream>  
#include <time.h>  
#include "sqlite3.h"  
  
const int nCount = 500000;  
      
int main (int argc,char** argv)  
{  
    sqlite3* db;  
    sqlite3_open("testdb.db" ,&db);  
    sqlite3_exec(db,"PRAGMA synchronous = OFF; ",0,0,0);  
    sqlite3_exec(db,"drop table if exists t1",0,0,0);  
    sqlite3_exec(db,"create table t1(id integer,x integer,y integer ,weight real)",0,0,0);  
    clock_t t1 = clock();  
      
    sqlite3_exec(db,"begin;",0,0,0);  
    sqlite3_stmt *stmt;  
    const char* sql = "insert into t1 values(?,?,?,?)";  
    sqlite3_prepare_v2(db,sql,strlen(sql),&stmt,0);  
      
    for(int i=0;i<nCount;++i)  
    {  
        // std::stringstream ssm;  
        // ssm<<"insert into t1 values("<<i<<","<<i*2<<","<<i/2<<","<<i*i<<")";  
        // sqlite3_exec(db,ssm.str().c_str(),0,0,0);  
        sqlite3_reset(stmt);  
        sqlite3_bind_int(stmt,1,i);  
        sqlite3_bind_int(stmt,2,i*2);  
        sqlite3_bind_int(stmt,3,i/2);  
        sqlite3_bind_double(stmt,4,i*i);  
        sqlite3_step(stmt);  
    }  
    sqlite3_finalize(stmt);  
    sqlite3_exec(db,"commit;",0,0,0);  
    clock_t t2 = clock();  
      
    sqlite3_close(db);  
      
    std::cout<<"cost tima: "<<(t2-t1)/1000.<<"s"<<std::endl;  
      
    return 0;  
}  

 

 

Keywords: Sqlite3

Added by Courbois on Sun, 30 Jan 2022 20:00:19 +0200