Data:
Link: https://pan.baidu.com/s/1XGfITsSWdP_c1_wFepUs0A
Extraction code: b880
Special treatment:
Considering that the data of different contracts will vary greatly during the month change, avoid using the data of two contracts at the time point of the month change when calculating the yield, and process the data as follows:
Processing logic:
1). The data of the previous main contract and the data of the next main contract are required for the month change day (i.e. the first day of switching to a new contract)
2). Yield on month change day = (closing price of the previous main contract on month change day - closing price of the previous main contract on month change day) / closing price of the previous main contract on month change day
3). Yield on the second day of the month change = (closing price on the second day of the month change of the new main contract - closing price on the day of the month change of the new main contract) / closing price on the day of the month change of the new main contract
Calculation method:
1). Add a column pre in the new table (res_pd)_ Ticker, this column is the ticker column. The whole column moves down one grid, so that the row of month and day is pre_ticker was the first major contract
2). Group the ticker column, get the first row of each group, and get a new table three_df
3). Remove three_ The first line of DF is the first contract in history, because the first line is not a month and day change
4). From the first large table_ Query three in DF_ DF pre per line_ Ticker corresponds to the close value of date and overrides three_ Close of DF table
5). Yes, three_df sort date in ascending order
6). Yes, three_df adds a counter column to fill in integer increments
7). Yes, res_pd adds a counter column to fill in integer increments
8). Set res_pd and three_df is merged into a table four_df
9). Right four_df sorts date and counter in ascending order. The sorting priority is date first and then counter
10). Calculate yield column
11). De duplicate the date column and keep the first duplicate value
So far, the yield of main contract has been calculated
Calculation process (Jupiter notebook):
import os import numpy as np import pandas as pd # if_ The list folder contains the daily data of each contract dir = r'./if_list/' file_list = os.listdir(dir) # Read out all contracts and merge them into a large table all_df = None for file_name in file_list: ticker = file_name[0:-4] file_path = dir + file_name df = pd.read_csv(file_path,encoding='utf-8',header=0,names=['date','open','close','low','high','volume','money','factor','high_limit','low_limit','avg','pre_close','paused','open_interest']) df['ticker'] = ticker if all_df is None: all_df = df else: all_df = pd.concat([all_df,df])
# This method is to sort the trading volume and position in descending order (i.e. from large to small), and the priority is the trading volume first and then the position; Return the first row after sorting def get_biggest_openinterest(x): df = x.sort_values(['volume','open_interest'],ascending=[False,False]) return df.iloc[0,:]
# Large table of all contracts_ DF is grouped by date, and the row with the largest trading volume and position is extracted from the group of each date to form a new table res_pd all_df['date'] = pd.to_datetime(all_df['date']) all_df.sort_values(by='date') res_pd = all_df.groupby('date').apply(get_biggest_openinterest)
# Check whether the contract is continuous, generally continuous. In case of discontinuity, handle it manually # Inspection method: extract the date of each group and store it in the list, and then verify the date to ensure that the earliest date of each group is not earlier than all the dates of the previous group. If this condition is met, it means that it is a continuous contract res_pd.sort_index(inplace=True) res_group = res_pd.groupby('ticker') results_list = [] for name,group in res_group: results_list.append(group['date'].tolist()) for i in range(len(results_list)-1): pre_list = results_list[i] now_first = results_list[i+1][0] for item in pre_list: if item >= now_first: print(item,now_first) break
# Push the whole column of contract data down one grid, so as to change the pre in the data line of month and day_ The ticker field is the previous main contract res_pd['pre_ticker'] = res_pd['ticker'].shift(1) # Table res_pd is grouped by contract, and the first row of data in each group is taken to form a new table three_df, except the first row, every row of the table is the month and day three_df = res_pd.groupby('ticker').head(1) three_df = three_df.iloc[1:,:] three_df.head()
# Add a new field origi_close is used to store the original close field data. Subsequent calculations will overwrite the original value of close three_df['origi_close'] = three_df['close'] # Traverse three_ For each line of DF, fill in the close field with the closing price of the previous main contract corresponding to the month change day all_df['date'] = pd.to_datetime(all_df['date']) for index,row in three_df.iterrows(): row_res = all_df.loc[(all_df['date']==row['date']) & (all_df['ticker'].str.startswith(row['pre_ticker']))] res_list = row_res['close'].values.tolist() if len(res_list)<=0: continue three_df.at[index,'close'] = res_list[0] three_df.sort_index(inplace=True) three_df = three_df.loc[:,['date','close','volume','open_interest','ticker','pre_ticker','origi_close']] res_df = res_pd.loc[:,['date','close','volume','open_interest','ticker','pre_ticker']] res_df.sort_index(inplace=True) # Next, three_df and res_ When DF is integrated together, there will be two lines on each month and day. Keep a useful line, three_ The rows provided by DF are reserved, so they are implemented through counter three_df['counter'] = range(len(three_df)) res_df['counter'] = range(len(res_df)) # Combine the two tables together to form a new table four_df four_df = pd.concat([three_df,res_df]) # date and counter will be sorted later. date as an index cannot be sorted simultaneously with the value, so reset the index four_df.reset_index(drop=True,inplace=True) # Table four_df is used to sort in ascending order (i.e. from small to large). The sorting priority is date first and then counter. Since it arrives from small, three_ The lines of DF are the lines of month change day, so the value of counter corresponding to month change day is always higher than res_ The counter corresponding to the month and day in DF is small, so it always comes first in ascending sorting four_df.sort_values(['date','counter'],ascending=[True,True],inplace=True) # Calculate yield four_df['pe'] = (four_df['close'] - four_df['close'].shift(1))/four_df['close'].shift(1) # Remove the duplicate lines of the month change day, and only keep the valid lines four_df.drop_duplicates(subset='date',keep='first',inplace=True)
So far, four_df is the final required yield data