Quantitative investment starts from 0 series - 12 Daily statistics of previous period

At present, the five domestic futures exchanges have disclosed the historical data of their futures varieties on their official websites. Technically, their implementation methods are different. When capturing data, you will use different methods such as parsing json, parsing xml, parsing html and parsing tsv. Crawling the data of these companies is a good practice topic for learning web crawlers, especially data analysis methods. Here we start with the daily statistics of Shanghai Futures Exchange.

Press F12 in the browser to see that the data acquisition interface of Shanghai Futures Exchange is:

GET	http://www.shfe.com.cn/data/dailydata/kx/kx20210826.dat

The returned data format is json.

The specific implementation still follows the framework that has been used in this series, which is implemented respectively according to full acquisition and incremental acquisition, and uses thread pool to improve access speed. The previous articles in this series have explained it, so I won't repeat it here.

json can replace other data formats. There is a reason why it is so popular at present. It not only meets the readability to a certain extent, but also is very friendly to program parsing. It is basically transformed into a map data structure in one sentence. Of course, in python is dict.

json_value = json.loads(r.content.decode())

The only remaining problem worth mentioning is that during the specific test, it is found that the old and new data are not completely compatible. A turnover field (representing transaction volume) is added to the relatively new data, but there is no such field in the old data.

In order that the program does not make mistakes and can be saved to the same database table, the simplest way is used here. If it is found that the server does not return the turnover field, fill in a None to fill in the position.

                if 'TURNOVER' in data:
                    turnover = data['TURNOVER']
                else:
                    turnover = None

The complete code is as follows:

class ShfeDaily(AbstractDataRetriever):
    def __init__(self):
        super().__init__('futures_shfe_daily')

    def _full(self, **kwargs):
        self._get_data_list('20110101', today())

    def _delta(self, **kwargs):
        df_origin = self.query(fields='max(report_date)')
        if df_origin.empty or df_origin.iat[0, 0] is None:
            self._get_data_list('20110101', today())
        else:
            self._get_data_list(df_origin.iat[0, 0], today())

    def _get_data_list(self, start_date, end_date, max_worker=multiprocessing.cpu_count() * 2):
        df_cal_date = StockCalendar().query(
            fields='cal_date',
            where=f'`exchange`=\'shfe\' and is_open=\'1\' and cal_date >\'{start_date}\' and cal_date <= \'{end_date}\'',
            order_by='cal_date')

        with ThreadPoolExecutor(max_worker) as executor:
            future_to_date = \
                {executor.submit(self._get_daily_data, trade_date=row['cal_date']): row
                 for index, row in df_cal_date.iterrows()}
            for future in as_completed(future_to_date):
                row = future_to_date[future]
                try:
                    data = future.result()
                except Exception as ex:
                    self.logger.error(f"failed to retrieve {row['cal_date']}")
                    self.logger.exception(ex)

    def _get_daily_data(self, trade_date):
        shfe_url = f'http://www.shfe.com.cn/data/dailydata/kx/kx{trade_date}.dat'
        headers = {
            'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0',
            'Accept': '*/*',
            'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2'
        }
        r = requests.get(shfe_url, headers=headers)
        json_value = json.loads(r.content.decode())

        df = pd.DataFrame(
            columns=['productid', 'name', 'deliverymonth', 'open', 'high', 'low', 'close', 'settlement', 'zd1_chg',
                     'zd2_chg', 'volume', 'turnover', 'openinterest', 'openinterestchg', 'report_date'])

        for data in json_value['o_curinstrument']:
            if data['OPENPRICE'] != '':
                if 'TURNOVER' in data:
                    turnover = data['TURNOVER']
                else:
                    turnover = None
                df = df.append(
                    {'productid':data['PRODUCTID'].strip(), 'name':data['PRODUCTNAME'].strip(),'deliverymonth':data['DELIVERYMONTH'],
                     'open':data['OPENPRICE'], 'high':data['HIGHESTPRICE'], 'low':data['LOWESTPRICE'],
                     'close':data['CLOSEPRICE'],'settlement':data['SETTLEMENTPRICE'],'zd1_chg':data['ZD1_CHG'],
                     'zd2_chg':data['ZD2_CHG'],'volume':data['VOLUME'],  'turnover':turnover,
                     'openinterest':data['OPENINTEREST'],'openinterestchg':data['OPENINTERESTCHG'],
                     'report_date':trade_date},
                    ignore_index=True)

        self._save(df)


if __name__ == '__main__':
    ShfeDaily().retrieve()

Keywords: Python crawler

Added by amsgwp on Sun, 19 Dec 2021 02:20:01 +0200