The 3rd Xiamen International Bank digital innovation finance cup financial marketing modeling competition BaseLine
1. Competition background
With the development of science and technology, the bank has successively created rich and diverse customer contacts online and offline to meet the needs of customers for daily business handling and channel transactions. Facing a large number of customers, banks need more comprehensive and accurate insight into customers' financial needs. In the actual business development process of financial products, we need to tap the attraction of different financial products to the customer group, so as to find the target customer group and carry out targeted marketing.
This competition provides customer behavior, asset information and product transaction information in the actual business scenario as the modeling object. On the one hand, it is hoped to show the data mining actual combat ability of each contestant. On the other hand, it is required that the contestants put forward corresponding marketing solutions in combination with the modeling results in the semi-finals to fully reflect the value of data analysis.
2. Game Description
(1) Competition task
The title of this competition is mainly to predict the probability of customers purchasing certificates of deposit of various financial products, and take the prediction results as the basis of marketing plan.
(2) Data usage rules
No external data can be used in this competition. The data provided this time is desensitized, and some continuous data (such as interest rate, price, finance, etc.) undergo certain linear transformation, but it does not affect the modeling use and model prediction results.
(3) A/B list rules
This preliminary competition adopts the form of AB list. The preliminary contest lasts one and A half months in total. The ranking list of the previous month shows the results of list A (there are public and private lists, and the ratio of public and private lists is 6:4). The latter half of the month switches to list B (there are public and private lists). The ranking list shows the results of List B, subject to the highest score submitted by the contestants. The final preliminary contest result = 0.3 in list A + 0.7 in list B.
3. Evaluation indicators
1. The preliminary competition adopts A/B list competition system, and the final preliminary competition result = F2 value of 0.3A list test set + F2 value of 0.7B list test set, where:
recall = TP/(TP+FN), recall rate
precision = TP/(TP+FP), accuracy
F2 = 5recallprecision/(4*precision+recall), F2 value
TP is the true example, FP is the false positive example, and FN is the false negative example. The F2 value of this kind is obtained through the above formula.
4. Data description
The core task of this competition is to predict whether users will buy in October through their historical consumption records in July, August and September. There are many data tables given by the competition questions, which will not be expanded in detail here. For details, you can check the data description given by the competition organizers Click to access the field description of each table in all datasets. (extraction code: zhyo)
5.Baseline idea
This competition is a very typical structured data competition. Here we still adopt the traditional feature engineering + LGB (also xgb, cat) scheme. Due to the strong timing of this competition, we use the data in July and August for training when we verify offline, and use the data in September for verification, while we use the data in July, August and September for training when we submit online It should be noted here that the online scoring standard of this competition is F2 score. When calculating the F2 score, the prediction results we need to submit are integer values of 0 and 1, but the prediction result of our model is a probability value, which involves the selection of a threshold. It is recommended that you manually search this threshold during offline verification, Its code is as follows:
def search_best_thre(y_true,y_pre): best_f2 = 0 best_th = 0 for i in range(100): th = 0.03+i/1000 y_pre_copy = y_pre.copy() y_pre_copy[y_pre_copy >= th] = 1 y_pre_copy[y_pre_copy < th] = 0 temp_f2 = f2_score(y_true,y_pre_copy) if temp_f2>best_f2: best_f2 = temp_f2 best_th = th print(f'thre: {best_th} f2 score: {best_f2}') return best_th
Among them, F2 here_ Score is the evaluation function that we manually implemented to calculate F2 score
6. Characteristic Engineering
6.1 user information table
Here, the user information table is simply merged with the main table to obtain user information. The code is as follows:
# Customer information sheet d = pd.read_csv(data_root2 + 'd.csv') df = df.merge(d, on='core_cust_id', how='left') del d gc.collect()
6.2 user risk table
The user risk here is that each period actually rates the user's risk, so each user here may correspond to multiple risk levels. Here, we only select the latest score for user risk rating. The specific method is to sort by user and evaluation date as the primary key, and then de duplicate by user as the primary key, In addition, only the last piece of data is kept during de duplication, so we keep the record of the user's latest risk assessment, and its code is as follows:
# Customer risk table e = pd.read_csv(data_root2+'e.csv') e = e.sort_values(['core_cust_id','e2']) e = e.drop_duplicates(subset=['core_cust_id'],keep='last') df = df.merge(e[['core_cust_id','e1']],on='core_cust_id',how='left') del e gc.collect()
6.3 user asset information table
The asset information of users will also change at different time points. Here, we calculate the average value of various assets of users every month as the characteristics of users. The code is implemented as follows:
# Asset information sheet f = pd.read_csv(data_root2 + 'f.csv') f.fillna(0, inplace=True) map_dict = dict(zip(sorted(f['f22'].unique()), sorted(df['a3'].unique()))) f['f22'] = f['f22'].map(map_dict) for c in ['f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19','f20', 'f21']: f[c] = f[c].apply(lambda x: str(x).replace(',','')).astype('float') f_stat = f.groupby(['core_cust_id','f22']).agg( f2_mean=('f2', 'mean'), f3_mean=('f3', 'mean'), f4_mean=('f4', 'mean'), f5_mean=('f5', 'mean'), f6_mean=('f6', 'mean'), f7_mean=('f7', 'mean'), f8_mean=('f8', 'mean'), f9_mean=('f9', 'mean'), f10_mean=('f10', 'mean'), f11_mean=('f11', 'mean'), f12_mean=('f12', 'mean'), f13_mean=('f13', 'mean'), f14_mean=('f14', 'mean'), f15_mean=('f15', 'mean'), f16_mean=('f16', 'mean'), f17_mean=('f17', 'mean'), f18_mean=('f18', 'mean'), f19_mean=('f19', 'mean'), f20_mean=('f20', 'mean'), f21_mean=('f21', 'mean'), ).reset_index() df = df.merge(f_stat, left_on=['core_cust_id','a3'], right_on=['core_cust_id','f22'], how='left') del f, f_stat gc.collect()
6.4 account transaction flow sheet
The user's transaction flow is also very important. The account transaction flow information here includes the information of the debit and the information of the credit. We make statistics on these two parts respectively. The statistical value of the amount involved in the debit and loan each month is counted by risk, and it is spliced into the original data table as historical information, For example, if I predict whether users will buy things in October, I will take their account transaction information in September as its historical feature. For predicting whether users will buy things in September, I will take their account transaction information in August as its historical feature. Its code is realized as follows:
# Table s: account transaction flow table, where s3 and s6 are customer numbers, which can be compared with the core in other tables_ cust_ ID. s = pd.read_csv(data_root2+'s.csv') s['month'] = s['s7'].apply(lambda x: x.split('-')[1]).astype('int32') s['s4'] = s['s4'].apply(lambda x: str(x).replace(',','')).astype('float') tmp_s3 = s.groupby(['s3','month']).agg( s3_s4_sum=('s4', 'sum'), s3_s4_mean=('s4', 'mean'), s3_s4_count=('s4','count') ).reset_index() tmp_s6 = s.groupby(['s6','month']).agg( s6_s4_sum=('s4', 'sum'), s6_s4_mean=('s4', 'mean'), s6_s4_count=('s4','count') ).reset_index() tmp_s3['month'] = tmp_s3['month']+1 tmp_s6['month'] = tmp_s6['month']+1 tmp_s3 = tmp_s3.rename(columns={'s3':'core_cust_id'}) tmp_s6 = tmp_s6.rename(columns={'s6':'core_cust_id'}) df = df.merge(tmp_s3,on=['core_cust_id','month'],how='left') df = df.merge(tmp_s6,on=['core_cust_id','month'],how='left')
6.5 app click behavior table
This table records the interaction between the user and the product. Here, the interaction between the user and the product is simply counted. The code is as follows:
#app click behavior table r = pd.read_csv(data_root2+'r.csv') r = cross_enc(r,'core_cust_id','prod_code') r = r.sort_values(['core_cust_id','r5']).reset_index(drop=True) r = r.drop_duplicates(subset=['core_cust_id'],keep='last') df = df.merge(r[['core_cust_id','prod_code','core_cust_id_count','prod_code_count','core_cust_id_prod_code_count','cross_core_cust_id_prod_code_count_div_core_cust_id_count','cross_core_cust_id_prod_code_count_div_prod_code_count']],on=['core_cust_id','prod_code'],how='left')
6.6 target code
Here, the user and product id are coded with a 50% discount. The codes are as follows:
## 50% off target code from sklearn.model_selection import StratifiedKFold label = 'y' skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=2020) enc_list = ['core_cust_id','prod_code'] for f in tqdm(enc_list): train_df[f + '_target_enc'] = 0 test_df[f + '_target_enc'] = 0 for i, (trn_idx, val_idx) in enumerate(skf.split(train_df, train_df[label])): trn_x = train_df[[f, label]].iloc[trn_idx].reset_index(drop=True) val_x = train_df[[f]].iloc[val_idx].reset_index(drop=True) enc_df = trn_x.groupby(f, as_index=False)[label].agg({f + '_target_enc': 'mean'}) val_x = val_x.merge(enc_df, on=f, how='left') test_x = test_df[[f]].merge(enc_df, on=f, how='left') val_x[f + '_target_enc'] = val_x[f + '_target_enc'].fillna(train_df[label].mean()) test_x[f + '_target_enc'] = test_x[f + '_target_enc'].fillna(train_df[label].mean()) train_df.loc[val_idx, f + '_target_enc'] = val_x[f + '_target_enc'].values test_df[f + '_target_enc'] += test_x[f + '_target_enc'].values / skf.n_splits
7.Baseline results
After constructing the feature, lgb is used to train the obtained data, and the offline threshold search results are as follows:
thre: 0.03 f2 score: 0.3942484810161383 thre: 0.031 f2 score: 0.39657528516049867 thre: 0.032 f2 score: 0.39895684622669514 thre: 0.033 f2 score: 0.40054636211507927 thre: 0.034 f2 score: 0.4026079162691184 thre: 0.034999999999999996 f2 score: 0.4044912629302788 thre: 0.036 f2 score: 0.40612958944151645 thre: 0.037 f2 score: 0.40753496138593204 thre: 0.038 f2 score: 0.40927755967690993 thre: 0.039 f2 score: 0.410503800118327 thre: 0.04 f2 score: 0.4126804328858547 thre: 0.040999999999999995 f2 score: 0.41439493397173527 thre: 0.041999999999999996 f2 score: 0.414933881594318 thre: 0.043 f2 score: 0.4164246353831421 thre: 0.044 f2 score: 0.417484643151162 thre: 0.045 f2 score: 0.4190938253084154 thre: 0.046 f2 score: 0.4201667149431947 thre: 0.047 f2 score: 0.4213196324089531 thre: 0.048 f2 score: 0.4225329108548656 thre: 0.049 f2 score: 0.42326606470288314 thre: 0.05 f2 score: 0.4239773348575973 thre: 0.051000000000000004 f2 score: 0.4246038365304421 thre: 0.052 f2 score: 0.42623418944115027 thre: 0.054 f2 score: 0.4272477986135949 thre: 0.055 f2 score: 0.4281746644036414 thre: 0.057999999999999996 f2 score: 0.428551417415209 thre: 0.059 f2 score: 0.4293245411428671 thre: 0.06 f2 score: 0.4303395127494851 thre: 0.061 f2 score: 0.43075245365321696 thre: 0.062 f2 score: 0.43146751165763464 thre: 0.063 f2 score: 0.43217145548751756 thre: 0.064 f2 score: 0.4324398249452955 thre: 0.065 f2 score: 0.433346787509632 thre: 0.066 f2 score: 0.4336518242520382
The score of online evaluation is 0.29. It can be seen that there is a big difference between offline and online questions
8. Outlook
- Not many tables are used in the baseline this time, and many tables related to products are not used. You can add the use of tables related to these products
- Try other boosting models, such as xgb, cat, etc
- You can try to tune lgb using some auto tuning tools
- Model fusion