The 3rd Xiamen International Bank digital innovation finance cup financial marketing modeling competition BaseLine

The 3rd Xiamen International Bank digital innovation finance cup financial marketing modeling competition BaseLine

1. Competition background

With the development of science and technology, the bank has successively created rich and diverse customer contacts online and offline to meet the needs of customers for daily business handling and channel transactions. Facing a large number of customers, banks need more comprehensive and accurate insight into customers' financial needs. In the actual business development process of financial products, we need to tap the attraction of different financial products to the customer group, so as to find the target customer group and carry out targeted marketing.

This competition provides customer behavior, asset information and product transaction information in the actual business scenario as the modeling object. On the one hand, it is hoped to show the data mining actual combat ability of each contestant. On the other hand, it is required that the contestants put forward corresponding marketing solutions in combination with the modeling results in the semi-finals to fully reflect the value of data analysis.

2. Game Description

(1) Competition task

The title of this competition is mainly to predict the probability of customers purchasing certificates of deposit of various financial products, and take the prediction results as the basis of marketing plan.

(2) Data usage rules

No external data can be used in this competition. The data provided this time is desensitized, and some continuous data (such as interest rate, price, finance, etc.) undergo certain linear transformation, but it does not affect the modeling use and model prediction results.

(3) A/B list rules

This preliminary competition adopts the form of AB list. The preliminary contest lasts one and A half months in total. The ranking list of the previous month shows the results of list A (there are public and private lists, and the ratio of public and private lists is 6:4). The latter half of the month switches to list B (there are public and private lists). The ranking list shows the results of List B, subject to the highest score submitted by the contestants. The final preliminary contest result = 0.3 in list A + 0.7 in list B.

3. Evaluation indicators

1. The preliminary competition adopts A/B list competition system, and the final preliminary competition result = F2 value of 0.3A list test set + F2 value of 0.7B list test set, where:

recall = TP/(TP+FN), recall rate

precision = TP/(TP+FP), accuracy

F2 = 5recallprecision/(4*precision+recall), F2 value

TP is the true example, FP is the false positive example, and FN is the false negative example. The F2 value of this kind is obtained through the above formula.

4. Data description

The core task of this competition is to predict whether users will buy in October through their historical consumption records in July, August and September. There are many data tables given by the competition questions, which will not be expanded in detail here. For details, you can check the data description given by the competition organizers Click to access the field description of each table in all datasets. (extraction code: zhyo)

5.Baseline idea

This competition is a very typical structured data competition. Here we still adopt the traditional feature engineering + LGB (also xgb, cat) scheme. Due to the strong timing of this competition, we use the data in July and August for training when we verify offline, and use the data in September for verification, while we use the data in July, August and September for training when we submit online It should be noted here that the online scoring standard of this competition is F2 score. When calculating the F2 score, the prediction results we need to submit are integer values of 0 and 1, but the prediction result of our model is a probability value, which involves the selection of a threshold. It is recommended that you manually search this threshold during offline verification, Its code is as follows:

def search_best_thre(y_true,y_pre):
    best_f2 = 0
    best_th = 0
    for i in range(100):
        th = 0.03+i/1000
        y_pre_copy = y_pre.copy()
        y_pre_copy[y_pre_copy >= th] = 1
        y_pre_copy[y_pre_copy < th] = 0
        temp_f2 = f2_score(y_true,y_pre_copy)
        if temp_f2>best_f2:
            best_f2 = temp_f2
            best_th = th
            print(f'thre: {best_th} f2 score: {best_f2}')
    return best_th

Among them, F2 here_ Score is the evaluation function that we manually implemented to calculate F2 score

6. Characteristic Engineering

6.1 user information table

Here, the user information table is simply merged with the main table to obtain user information. The code is as follows:

# Customer information sheet
d = pd.read_csv(data_root2 + 'd.csv')
df = df.merge(d, on='core_cust_id', how='left')
del d

6.2 user risk table

The user risk here is that each period actually rates the user's risk, so each user here may correspond to multiple risk levels. Here, we only select the latest score for user risk rating. The specific method is to sort by user and evaluation date as the primary key, and then de duplicate by user as the primary key, In addition, only the last piece of data is kept during de duplication, so we keep the record of the user's latest risk assessment, and its code is as follows:

# Customer risk table
e = pd.read_csv(data_root2+'e.csv')
e = e.sort_values(['core_cust_id','e2'])
e = e.drop_duplicates(subset=['core_cust_id'],keep='last')
df = df.merge(e[['core_cust_id','e1']],on='core_cust_id',how='left')
del e

6.3 user asset information table

The asset information of users will also change at different time points. Here, we calculate the average value of various assets of users every month as the characteristics of users. The code is implemented as follows:

# Asset information sheet
f = pd.read_csv(data_root2 + 'f.csv')
f.fillna(0, inplace=True)
map_dict = dict(zip(sorted(f['f22'].unique()), sorted(df['a3'].unique())))
f['f22'] = f['f22'].map(map_dict)

for c in ['f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9',
          'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16',
          'f17', 'f18', 'f19','f20', 'f21']:
    f[c] = f[c].apply(lambda x: str(x).replace(',','')).astype('float')  
f_stat = f.groupby(['core_cust_id','f22']).agg(
                                             f2_mean=('f2', 'mean'), 
                                             f3_mean=('f3', 'mean'), 
                                             f4_mean=('f4', 'mean'), 
                                             f5_mean=('f5', 'mean'),
                                             f6_mean=('f6', 'mean'),
                                             f7_mean=('f7', 'mean'),
                                             f8_mean=('f8', 'mean'),
                                             f9_mean=('f9', 'mean'),
                                             f10_mean=('f10', 'mean'),
                                             f11_mean=('f11', 'mean'),
                                             f12_mean=('f12', 'mean'),
                                             f13_mean=('f13', 'mean'),
                                             f14_mean=('f14', 'mean'),
                                             f15_mean=('f15', 'mean'),
                                             f16_mean=('f16', 'mean'),
                                             f17_mean=('f17', 'mean'),
                                             f18_mean=('f18', 'mean'),
                                             f19_mean=('f19', 'mean'),
                                             f20_mean=('f20', 'mean'),
                                             f21_mean=('f21', 'mean'),

df = df.merge(f_stat, left_on=['core_cust_id','a3'], right_on=['core_cust_id','f22'], how='left')
del f, f_stat

6.4 account transaction flow sheet

The user's transaction flow is also very important. The account transaction flow information here includes the information of the debit and the information of the credit. We make statistics on these two parts respectively. The statistical value of the amount involved in the debit and loan each month is counted by risk, and it is spliced into the original data table as historical information, For example, if I predict whether users will buy things in October, I will take their account transaction information in September as its historical feature. For predicting whether users will buy things in September, I will take their account transaction information in August as its historical feature. Its code is realized as follows:

# Table s: account transaction flow table, where s3 and s6 are customer numbers, which can be compared with the core in other tables_ cust_ ID.
s = pd.read_csv(data_root2+'s.csv')
s['month'] = s['s7'].apply(lambda x: x.split('-')[1]).astype('int32')
s['s4'] = s['s4'].apply(lambda x: str(x).replace(',','')).astype('float')
tmp_s3 = s.groupby(['s3','month']).agg(
                             s3_s4_sum=('s4', 'sum'),
                             s3_s4_mean=('s4', 'mean'),
tmp_s6 = s.groupby(['s6','month']).agg(
                             s6_s4_sum=('s4', 'sum'), 
                             s6_s4_mean=('s4', 'mean'),
tmp_s3['month']  = tmp_s3['month']+1
tmp_s6['month']  = tmp_s6['month']+1
tmp_s3 = tmp_s3.rename(columns={'s3':'core_cust_id'})
tmp_s6 = tmp_s6.rename(columns={'s6':'core_cust_id'})
df = df.merge(tmp_s3,on=['core_cust_id','month'],how='left')
df = df.merge(tmp_s6,on=['core_cust_id','month'],how='left')

6.5 app click behavior table

This table records the interaction between the user and the product. Here, the interaction between the user and the product is simply counted. The code is as follows:

#app click behavior table
r = pd.read_csv(data_root2+'r.csv')
r = cross_enc(r,'core_cust_id','prod_code')
r = r.sort_values(['core_cust_id','r5']).reset_index(drop=True)
r = r.drop_duplicates(subset=['core_cust_id'],keep='last')
df = df.merge(r[['core_cust_id','prod_code','core_cust_id_count','prod_code_count','core_cust_id_prod_code_count','cross_core_cust_id_prod_code_count_div_core_cust_id_count','cross_core_cust_id_prod_code_count_div_prod_code_count']],on=['core_cust_id','prod_code'],how='left')

6.6 target code

Here, the user and product id are coded with a 50% discount. The codes are as follows:

## 50% off target code
from sklearn.model_selection import StratifiedKFold
label = 'y'
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=2020)
enc_list = ['core_cust_id','prod_code']
for f in tqdm(enc_list):
    train_df[f + '_target_enc'] = 0
    test_df[f + '_target_enc'] = 0
    for i, (trn_idx, val_idx) in enumerate(skf.split(train_df, train_df[label])):
        trn_x = train_df[[f, label]].iloc[trn_idx].reset_index(drop=True)
        val_x = train_df[[f]].iloc[val_idx].reset_index(drop=True)
        enc_df = trn_x.groupby(f, as_index=False)[label].agg({f + '_target_enc': 'mean'})
        val_x = val_x.merge(enc_df, on=f, how='left')
        test_x = test_df[[f]].merge(enc_df, on=f, how='left')
        val_x[f + '_target_enc'] = val_x[f + '_target_enc'].fillna(train_df[label].mean())
        test_x[f + '_target_enc'] = test_x[f + '_target_enc'].fillna(train_df[label].mean())
        train_df.loc[val_idx, f + '_target_enc'] = val_x[f + '_target_enc'].values
        test_df[f + '_target_enc'] += test_x[f + '_target_enc'].values / skf.n_splits

7.Baseline results

After constructing the feature, lgb is used to train the obtained data, and the offline threshold search results are as follows:

thre: 0.03 f2 score: 0.3942484810161383
thre: 0.031 f2 score: 0.39657528516049867
thre: 0.032 f2 score: 0.39895684622669514
thre: 0.033 f2 score: 0.40054636211507927
thre: 0.034 f2 score: 0.4026079162691184
thre: 0.034999999999999996 f2 score: 0.4044912629302788
thre: 0.036 f2 score: 0.40612958944151645
thre: 0.037 f2 score: 0.40753496138593204
thre: 0.038 f2 score: 0.40927755967690993
thre: 0.039 f2 score: 0.410503800118327
thre: 0.04 f2 score: 0.4126804328858547
thre: 0.040999999999999995 f2 score: 0.41439493397173527
thre: 0.041999999999999996 f2 score: 0.414933881594318
thre: 0.043 f2 score: 0.4164246353831421
thre: 0.044 f2 score: 0.417484643151162
thre: 0.045 f2 score: 0.4190938253084154
thre: 0.046 f2 score: 0.4201667149431947
thre: 0.047 f2 score: 0.4213196324089531
thre: 0.048 f2 score: 0.4225329108548656
thre: 0.049 f2 score: 0.42326606470288314
thre: 0.05 f2 score: 0.4239773348575973
thre: 0.051000000000000004 f2 score: 0.4246038365304421
thre: 0.052 f2 score: 0.42623418944115027
thre: 0.054 f2 score: 0.4272477986135949
thre: 0.055 f2 score: 0.4281746644036414
thre: 0.057999999999999996 f2 score: 0.428551417415209
thre: 0.059 f2 score: 0.4293245411428671
thre: 0.06 f2 score: 0.4303395127494851
thre: 0.061 f2 score: 0.43075245365321696
thre: 0.062 f2 score: 0.43146751165763464
thre: 0.063 f2 score: 0.43217145548751756
thre: 0.064 f2 score: 0.4324398249452955
thre: 0.065 f2 score: 0.433346787509632
thre: 0.066 f2 score: 0.4336518242520382

The score of online evaluation is 0.29. It can be seen that there is a big difference between offline and online questions

8. Outlook

  • Not many tables are used in the baseline this time, and many tables related to products are not used. You can add the use of tables related to these products
  • Try other boosting models, such as xgb, cat, etc
  • You can try to tune lgb using some auto tuning tools
  • Model fusion

Keywords: Big Data Data Analysis Data Mining

Added by JTapp on Mon, 10 Jan 2022 07:57:16 +0200