Machine quantitative analysis -- data acquisition, preprocessing and modeling

This series mainly introduces a relatively simple and complete quantitative framework, which is based on modern portfolio theory and analyzed by using mainstream machine learning algorithms, in order to help you expand the idea of quantitative investment and help build a scientific and reasonable investment strategy.

As the first part of the series, according to the analysis and calculation process, this paper mainly introduces three parts: data acquisition, data preprocessing and modeling using SVM algorithm.

>>Data acquisition<<

The quantitative framework of this series adopts localized computing. Why localized computing? Compared with online data acquisition for analysis and calculation, localized computing has the following advantages:

  • 1. Stability - the analysis process will not be interrupted due to network instability.
  • 2. Fast - the access speed of localized operation to data is faster than that of online data acquisition. This is particularly important when the algorithm of machine learning involves massive data for training set or iterative training.
  • 3. Reusable - whether the basic market data or the processed data are saved locally, it is more convenient for subsequent result analysis or strategy optimization.

The first thing we need to do for localization calculation is to collect the required basic data into the local database. The database used in the example source code of this article is MySQL 5 5. The data source is tushare pro interface.

We are now going to take the daily quotes of a number of specific stocks, with some codes as follows:

# Set the token of tushare pro and get the connection
pro = ts.pro_api()
# Set the initial date and end date for obtaining daily quotes, where the end date is set to yesterday.
start_dt = '20100101'
time_temp = - datetime.timedelta(days=1)
end_dt = time_temp.strftime('%Y%m%d')
# Establish database connection and eliminate the parts that have been put into storage
db = pymysql.connect(host='', user='root', passwd='admin', db='stock', charset='utf8')
cursor = db.cursor()
# Set the stock pool to get data
stock_pool = ['603912.SH','300666.SZ','300618.SZ','002049.SZ','300672.SZ']
total = len(stock_pool)
# Cycle to get the daily quotation of a single stock
for i in range(len(stock_pool)):
        df = pro.daily(ts_code=stock_pool[i], start_date=start_dt, end_date=end_dt)
        # Print progress
        print('Seq: ' + str(i+1) + ' of ' + str(total) + '   Code: ' + str(stock_pool[i]))

The function of each line of code has been explained clearly in the comments of the above code. In fact, the data acquisition program mainly sets three parameters: the initial date of obtaining the market, the end date, and the stock code pool.

After we get the data, we need to write (store) to the local database. This code uses SQL language and needs to build the corresponding table in the database in advance. The table configuration and table structure are as follows:

Database name: stock} table name: stock_all

Where state_dt and stock_code is the primary key and index. state_ The format of DT is' yyyy MM DD '(for example,' 2018-06-11 '). This date format is easy to query and can be used for size comparison within MySQL.
(for complete data acquisition code, see (file)

>>Data preprocessing<<

Whether it is quantitative strategy or simple machine learning project, data preprocessing is a very important part. From the perspective of machine learning, data preprocessing mainly includes data cleaning, sorting, missing or abnormal value processing, statistics analysis and correlation analysis, Principal component analysis (PCA), normalization, etc. the data preprocessing to be introduced in this article is relatively simple, only integrating the daily market data existing in the local database into a training set data for subsequent machine learning modeling and training.

Before introducing the specific example code, we need to think about a problem first. Using the supervised learning algorithm to model individual stocks, what are our input data and what are our expected output data?

The answer to this question varies from person to person and from strategy to strategy. This problem itself is a process of transforming the market problem into a mathematical problem. It relies on quantifying kuanke's own knowledge system and understanding of the market.

Back to the point, in this example, we will analyze the simplest data. The data at our input end is the daily basic market of individual stocks, and the data at the output end is the rise and fall of stock prices compared with the previous trading day. Simply put, we input today's basic market into the model and let the model predict whether the stock price will rise or fall tomorrow.

In terms of code implementation, we use the object-oriented idea to encapsulate the whole data preprocessing process and results into a class. Each time we create a class instance, we get a training set under specific conditions. The example code is as follows:

class data_collect(object):

    def __init__(self, in_code,start_dt,end_dt):
        ans = self.collectDATA(in_code,start_dt,end_dt)

    def collectDATA(self,in_code,start_dt,end_dt):
        # Establish a database connection to obtain the daily basic market (opening price, closing price, highest price, lowest price, trading volume and turnover)
        db = pymysql.connect(host='', user='root', passwd='admin', db='stock', charset='utf8')
        cursor = db.cursor()
        sql_done_set = "SELECT * FROM stock_all a where stock_code = '%s' and state_dt >= '%s' and state_dt <= '%s' order by state_dt asc" % (in_code, start_dt, end_dt)
        done_set = cursor.fetchall()
        if len(done_set) == 0:
            raise Exception
        self.date_seq = []
        self.open_list = []
        self.close_list = []
        self.high_list = []
        self.low_list = []
        self.vol_list = []
        self.amount_list = []
        for i in range(len(done_set)):
        # Integrate daily quotes into training sets (where self.train is the input set, is the output set, and self.test_case is the single test input on the end_dt day)
        self.data_train = []
        self.data_target = []

Finally, after the class is instantiated, it is necessary to integrate three data:

  • 1. self.train: the input data in the training set. In this example, it is the daily basic market.
  • 2. the output data in the training set. In this example, compared with the rise and fall of the stock price the previous day, the rise is 1 and the non rise is 0. And in the sorting, the self of each t trading day The data in the train corresponds to the rise and fall of the stock price in day t+1.
  • 3. self.test_case: the basic market data on the trading day at the end of t, as the input, is used to predict the rise and fall of the next day after model training.

(for complete data preprocessing code, see (file)

>>SVM modeling<<

There are many supervised learning algorithms in machine learning. SVM is a common one. This example uses SVM algorithm for modeling. The theoretical principle of SVM is not described in detail in this article. The following is only a modeling introduction from a practical point of view.

First post a piece of code for modeling, training and prediction:)

model = svm.SVC()               # modeling, target)        # train
ans2 = model.predict(test_case) # forecast

Three lines of code, reminiscent of the cold joke of loading the elephant into the refrigerator in several steps

However, this side also shows the power of Python in Data Mining: simple, convenient and easy to use.

The machine learning framework used in this example is scikit learn. It is a very powerful algorithm library. Friends familiar with the algorithm principle can consult the official API documents, modify the model parameters and further optimize the model; You can also try other algorithms, such as decision tree, logistic regression, naive Bayes, etc.

(for complete SVM modeling code, see (file)

Finally, although we successfully model and make predictions, we still face two main problems: 1 What is the predictive power of the model? Or how to evaluate the quality of a model? 2. How to combine the model for position management? What are the risks? How to quantify?

Please pay attention to the content of the next issue [machine quantitative analysis (II) - building a portfolio]

Added by deveed on Wed, 29 Dec 2021 17:42:05 +0200