Crawler of subscription and redemption list of Minsheng Jiayin ETF fund (Minsheng Jiayin CSI 300ETF)


In April, he joined a data-oriented artificial intelligence enterprise in Nanjing as a python crawler intern. In addition to writing scripts for Python crawlers, this position also has a job of text markup. Text marking can be regarded as a process of sorting and cleaning data manually. Although it is boring, it is also an essential link.

In the past half a month or so, I have come into contact with the crawler of json files. Before, I also compared dishes. I came into contact with the kind that can directly match html files with xpath. In my current work, I came into contact with the crawler of json files for the first time. Now I take the fund Minsheng Jiayin CSI 300ETF as an example and crawl its subscription and redemption list by analyzing its json files.


It is understood that generally, only the funds ending in the three letters of ETF will have the "purchase and redemption list", and here only the fund "Minsheng Canada Bank Shanghai and Shenzhen 300ETF" meets the requirements, as shown in the figure below:

Therefore, there is no need to write the page turning function this time, because there is only one that meets the requirements, so just climb this page alone, as shown in the figure:


What we need to crawl are the following fields: fund code, announcement date, name of position stock, code of position stock, number of position stock, cash substitution mark, subscription premium proportion, redemption discount proportion and substitution amount. You can see that there are all of them on this page.

So I wrote the following code:

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
    'Referer': '',
base_url = ""
html = requests.get(url=base_url, headers=headers).content.decode()

The following results are obtained after operation:

You can press Ctrl+F to search for keywords such as fund name, and you will find that there is no relevant HTML at all, so the data of this table should be rendered through Ajax. I pressed F12 in the browser to check the DOC item in the Network, and confirmed my idea at the same time. I searched XHR in Network in F12 and finally found that all the data I wanted was in this json file, as shown in the following two figures:

So now, just crawl the json file, so I wrote the following code:

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
    'Referer': '',
base_url = ""
json = requests.get(url=base_url, headers=headers).json()

The operation was carried out with joy, but the following results were obtained:

C:\Users\ABC\AppData\Local\Programs\Python\Python38\python.exe C:/Users/ABC/PycharmProjects/pythonProject/
{'error_no': '-10004', 'error_info': 'call BUS The function number of the interface cannot be empty'}

Process finished with exit code 0

After reflection, I found that there was no parameter passed, so I found the parameter to be passed here, as shown in the figure below:

So the url is changed to

Then follow this idea to climb to the data!

Complete code

import requests

data = {
            "decel_date": "",  # Announcement date
            # "trade_date": "",  # Transaction date
            # "sub_redem_sec_type": "",  # Category of redemption application components
            "fund_code": "",  # Fund transaction code
            "sec_code": "",  # Component securities publication code
            "sec_name": "",  # Securities abbreviation
            "sec_quantity": "",  # Number of Securities: shares / hands / gram
            "cash_substitute_sign": "",  # Cash substitution flag: 0 - prohibited; 1 - allowed; 2 - must; 3 - reimbursement
            # "cash_substitute_ratio": "",  # Cash substitution premium ratio
            "fixed_substitute_amount": "",  # Fixed substitution amount
            "sub_replace_amt": "",  # Subscription substitution amount
            "redem_replace_amt": "",  # Redemption substitution amount

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
    'Referer': '',
base_url = ""
json = requests.get(url=base_url, headers=headers).json()
# print(json)

items = json.get('dataList')[0].get('data')
for item in items:
    data['fund_code'] = item['fundcode']
    data['decel_date'] = item['tradingday']
    data['sec_code'] = item['stockcode']
    data['sec_name'] = item['stockname']
    data['sec_quantity'] = item['stocknum']
    data['cash_substitute_sign'] = item['cashflag']
    data['sub_replace_amt'] = item['redemptiondiscountrate']
    data['redem_replace_amt'] = item['cashratio']
    data['fixed_substitute_amount'] = item['substituteamount']

Operation results


