catalogue
- Import data
- Missing and outlier handling
- Feature visualization
- feature selection
- model training
- Model evaluation
- Model result transfer score
- Calculate user total score
1, Import data
Reply to key words in official account python wind control model: learning materials
#Import module import pandas as pd import numpy as np from scipy import stats import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline plt.rc("font",family="SimHei",size="12") #Solve the problem that Chinese cannot be displayed #Import data train=pd.read_csv('F:\\python\\Give-me-some-credit-master\\data\\cs-training.csv')
Simple view of data information
#Simple view data train.info() ''' train.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 150000 entries, 0 to 149999 Data columns (total 12 columns): Unnamed: 0 150000 non-null int64 SeriousDlqin2yrs 150000 non-null int64 RevolvingUtilizationOfUnsecuredLines 150000 non-null float64 age 150000 non-null int64 NumberOfTime30-59DaysPastDueNotWorse 150000 non-null int64 DebtRatio 150000 non-null float64 MonthlyIncome 120269 non-null float64 NumberOfOpenCreditLinesAndLoans 150000 non-null int64 NumberOfTimes90DaysLate 150000 non-null int64 NumberRealEstateLoansOrLines 150000 non-null int64 NumberOfTime60-89DaysPastDueNotWorse 150000 non-null int64 NumberOfDependents 146076 non-null float64 dtypes: float64(4), int64(8) memory usage: 13.7 MB '''
Data viewing of the first three rows and the last three rows
#Data viewing of the first three rows and the last three rows train.head(3).append(train.tail(3))
shape view
#shape train.shape #(150000, 11)
Convert English fields to Chinese field names for easy understanding
states={'Unnamed: 0':'id', 'SeriousDlqin2yrs':'Good and bad customers', 'RevolvingUtilizationOfUnsecuredLines':'Available limit ratio', 'age':'Age', 'NumberOfTime30-59DaysPastDueNotWorse':'Overdue 30-59 Tianbi number', 'DebtRatio':'Debt ratio', 'MonthlyIncome':'monthly income', 'NumberOfOpenCreditLinesAndLoans':'Credit quantity', 'NumberOfTimes90DaysLate':'Number of transactions overdue for 90 days', 'NumberRealEstateLoansOrLines':'Fixed asset loans', 'NumberOfTime60-89DaysPastDueNotWorse':'Overdue 60-89 Tianbi number', 'NumberOfDependents':'Number of family members'} train.rename(columns=states,inplace=True) #catalog index train=train.set_index('id',drop=True)
descriptive statistics
2, Missing and outlier handling
1. Missing value processing
View missing values
#Check the missing condition of each column train.isnull().sum() #Check the missing proportion train.isnull().sum()/len(train) #Missing value visualization missing=train.isnull().sum() missing[missing>0].sort_values().plot.bar() #Take out and sort those greater than 0
knowable
The missing value of monthly income is 29731, and the missing proportion is 0.198207
Missing value of number of family members: 3924, missing ratio: 0.026160
First copy a copy of data, retain the original data, and then process the missing values
#Keep original data train_cp=train.copy() #Monthly income uses the average to fill in the missing value train_cp.fillna({'monthly income':train_cp['monthly income'].mean()},inplace=True) train_cp.isnull().sum() #Lines with missing number of family members are removed train_cp=train_cp.dropna() train_cp.shape #(146076, 11)
2. Abnormal value handling
View outliers
#View outliers #Draw box diagram for col in train_cp.columns: plt.boxplot(train_cp[col]) plt.title(col) plt.show()
Data with an available quota ratio greater than 1 is abnormal
The data with an age of 0 is also abnormal. In fact, those younger than 18 can be identified as abnormal. There is a super outlier data for those who are 30-59 days overdue
Outlier processing eliminates illogical data and super outlier data. The ratio of available quota should be less than 1. Outliers with age of 0 and super outlier data with overdue days of more than 80 are filtered out to filter out the remaining data
train_cp=train_cp[train_cp['Available limit ratio']<1] train_cp=train_cp[train_cp['Age']>0] train_cp=train_cp[train_cp['Overdue 30-59 Tianbi number']<80] train_cp=train_cp[train_cp['Overdue 60-89 Tianbi number']<80] train_cp=train_cp[train_cp['Number of transactions overdue for 90 days']<80] train_cp=train_cp[train_cp['Fixed asset loans']<50] train_cp=train_cp[train_cp['Debt ratio']<5000] train_cp.shape #(141180, 11)
3, Feature visualization
1. Univariate visualization
Good and bad users
#Good and bad users train_cp.info() train_cp['Good and bad customers'].value_counts() train_cp['Good and bad customers'].value_counts()/len(train_cp) train_cp['Good and bad customers'].value_counts().plot.bar() ''' 0 132787 1 8393 Name: Good and bad customers, dtype: int64 The data is heavily skewed 0 0.940551 1 0.059449 Name: Good and bad customers, dtype: float64 ''' ![](https://img-blog.csdnimg.cn/img_convert/49c8e0fb7b04e6dd6483f2c736f80e13.png) knowable y The value is heavily skewed **Available limit ratio and debt ratio**
#Available limit ratio and debt ratio
train_cp ['available quota ratio']. plot.hist()
train_cp ['debt ratio'. plot.hist()
![](https://img-blog.csdnimg.cn/img_convert/85893463396bda9aed3a95f79e9fcbe9.png)![](https://img-blog.csdnimg.cn/img_convert/5f825a8610c950c2fe6e1f82508f36da.png)
#The data with a debt ratio greater than 1 has too much impact
a=train_cp ['debt ratio']
a[a<=1].plot.hist()
![](https://img-blog.csdnimg.cn/img_convert/0247238ac5d85d8ac6f1cc54ecc7e14e.png) ### 30-59 days overdue, 90 days overdue, 60-89 days overdue
#30-59 days overdue, 90 days overdue, 60-89 days overdue
for i,col in enumerate(['30-59 days overdue', '90 days overdue', '60-89 days overdue'):
plt.subplot(1,3,i+1)
train_cp[col].value_counts().plot.bar()
plt.title(col)
train_cp ['number of overdue transactions 30-59 days']. value_counts().plot.bar()
train_cp ['number of transactions overdue for 90 days']. value_counts().plot.bar()
train_cp ['number of overdue transactions 60-89 days']. value_counts().plot.bar()
![Copy code](https://img-blog.csdnimg.cn/img_convert/48304ba5e6f9fe08f3fa1abda7d326ab.png) ![](https://img-blog.csdnimg.cn/img_convert/9270d628d92c21cef0937a06251d84ff.png)![](https://img-blog.csdnimg.cn/img_convert/35fe4315b9f2dc0a8f2ae925a93199d3.png) ![](https://img-blog.csdnimg.cn/img_convert/19af9f76b167dafe4e3d83d03084f40f.png) ### Age: basically in line with normal distribution
#Age
train_cp ['age'. plot.hist()
![](https://img-blog.csdnimg.cn/img_convert/de8b795133e01c5b8ee32d5c273a6a56.png) ### monthly income
#Monthly income
train_cp ['monthly income']. plot.hist()
sns.distplot(train_cp ['monthly income'])
#The influence of super outliers is too great. We take the data less than 5w to draw the graph
a=train_cp ['monthly income']
a[a<=50000].plot.hist()
#If it is found that there are not many less than 50000, take 2w
a=train_cp ['monthly income']
a[a<=20000].plot.hist()
![](https://img-blog.csdnimg.cn/img_convert/f94ae42605a8673851fd362780b41dc5.png)![](https://img-blog.csdnimg.cn/img_convert/b780d28a6cbd3eab9a18309d8a887f6a.png) ### Credit quantity
#Credit quantity
train_cp ['credit quantity']. value_counts().plot.bar()
sns.distplot(train_cp ['credit quantity'])
![](https://img-blog.csdnimg.cn/img_convert/a9701bc0dc1bf7fe6f30779de0b52d94.png) ### Fixed asset loans
#Fixed asset loans
train_cp ['fixed asset loan volume']. value_counts().plot.bar()
sns.distplot(train_cp ['fixed asset loan volume'])
![](https://img-blog.csdnimg.cn/img_convert/81fb9842311666ef0259711a59718f19.png) ### Number of family members
#Number of family members
train_cp ['number of family members']. value_counts().plot.bar()
sns.distplot(train_cp ['number of family members'])
![](https://img-blog.csdnimg.cn/img_convert/6e4112eb58908b1503e20416467bf5f0.png) 2.Univariate and y Value visualization ----------- ### Available limit ratio
#Univariate and y-value visualization
#Available quota ratio, debt ratio, age and monthly income need to be separated
#Available limit ratio
train_ CP ['available quota ratio'] = pd.cut(train_cp ['available quota ratio'], 5)
pd.crosstab(train_cp ['available quota ratio _cut'], train_cp ['good and bad customers']. plot(kind = "bar")
a=pd.crosstab(train_cp ['available quota ratio _cut'], train_cp ['good and bad customers'])
A ['proportion of bad users'] = a[1]/(a[0]+a[1])
a ['proportion of bad users']. plot()
![](https://img-blog.csdnimg.cn/img_convert/020fcee2fe119ef169d2d1db664bef3e.png)![](https://img-blog.csdnimg.cn/img_convert/511286c57c709fa52972d56fe51fe4f7.png) It can be seen that the overdue rate of each box at the end of the box division is only 6 times worse, indicating that this feature is still good ### Debt ratio
#Debt ratio
cut=[-1,0.2,0.4,0.6,0.8,1,1.5,2,5,10,5000]
train_ CP ['debt ratio'] = pd.cut(train_cp ['debt ratio'], bins=cut)
pd.crosstab(train_cp ['debt ratio _cut'], train_cp ['good and bad customers']. plot(kind = "bar")
a=pd.crosstab(train_cp ['debt ratio _cut'], train_cp ['good and bad customers'])
A ['proportion of bad users'] = a[1]/(a[0]+a[1])
a ['proportion of bad users']. plot()
![](https://img-blog.csdnimg.cn/img_convert/5d712e1dd18e93cd10837947b5b8b56d.png)![](https://img-blog.csdnimg.cn/img_convert/902b0a3108355a3a9bf278a138e94d74.png) ### Age
#Age
cut=[0,30,40,50,60,100]
train_ CP ['age _cut '] = pd.cut(train_cp ['age'], bins=cut)
pd.crosstab(train_cp ['age _cut'], train_cp ['good and bad customers']. plot(kind = "bar")
a=pd.crosstab(train_cp ['age _cut'], train_cp ['good and bad customers'])
A ['proportion of bad users'] = a[1]/(a[0]+a[1])
a ['proportion of bad users']. plot()
![](https://img-blog.csdnimg.cn/img_convert/24592d98f67c48b206f76dd4923a5689.png)![](https://img-blog.csdnimg.cn/img_convert/e90323f13c10b43f49daec880c116dd4.png) Why are there so many elderly people? It's unrealistic. Are the products mainly aimed at elderly users ### monthly income ![Copy code](https://img-blog.csdnimg.cn/img_convert/48304ba5e6f9fe08f3fa1abda7d326ab.png)
#Monthly income
cut=[0,3000,5000,7000,10000,15000,30000,1000000]
train_ CP ['monthly income _cut '] = pd.cut(train_cp ['monthly income'], bins=cut)
pd.crosstab(train_cp ['monthly revenue _cut'], train_cp ['good and bad customers']. plot(kind = "bar")
a=pd.crosstab(train_cp ['monthly revenue _cut'], train_cp ['good and bad customers'])
A ['proportion of bad users'] = a[1]/(a[0]+a[1])
a ['proportion of bad users']. plot()
![](https://img-blog.csdnimg.cn/img_convert/3ae7f3a42a9cd2f902545e2c5d5afdb3.png)![](https://img-blog.csdnimg.cn/img_convert/9b33632c3754eb0eea88e1181aed8f71.png) Overdue 30-59 Tianbi number,Number of transactions overdue for 90 days,Overdue 60-89 Tianbi number \\Credit quantity\\Fixed asset loans\\The number of family members does not need to be divided into boxes for the time being: ### 30-59 days overdue
#30-59 days overdue, 90 days overdue, 60-89 days overdue \ amount of credit \ amount of fixed asset loans \ number of family members
#30-59 days overdue
pd.crosstab(train_cp ['30-59 days overdue], train_cp [' good and bad customers'). plot(kind = "bar")
a=pd.crosstab(train_cp ['30-59 days overdue], train_cp [' good and bad customers'])
A ['proportion of bad users'] = a[1]/(a[0]+a[1])
a ['proportion of bad users']. plot()
![](https://img-blog.csdnimg.cn/img_convert/45118fcc577f1c2c8e33de23a7c46ccb.png)![](https://img-blog.csdnimg.cn/img_convert/0e2b82a23b862209152bc8ab67d4a2db.png) ### Number of transactions overdue for 90 days
#Number of transactions overdue for 90 days
pd.crosstab(train_cp ['90 days overdue'], train_cp ['good and bad customers']. plot(kind = "bar")
a=pd.crosstab(train_cp ['90 days overdue], train_cp [' good and bad customers'])
A ['proportion of bad users'] = a[1]/(a[0]+a[1])
a ['proportion of bad users']. plot()
![](https://img-blog.csdnimg.cn/img_convert/619ba7aa94d726c2deb8f7ded4becd32.png)![](https://img-blog.csdnimg.cn/img_convert/1a3cff30dd5ba5cc87db55437ac93ac1.png) ### 60-89 days overdue
#60-89 days overdue
pd.crosstab(train_cp ['60-89 days overdue'], train_cp ['good and bad customers']. plot(kind = "bar")
a=pd.crosstab(train_cp ['60-89 days overdue], train_cp [' good and bad customers'])
A ['proportion of bad users'] = a[1]/(a[0]+a[1])
a ['proportion of bad users']. plot()
![](https://img-blog.csdnimg.cn/img_convert/71aef3943bab88d8110264d1fa91bbd2.png)![](https://img-blog.csdnimg.cn/img_convert/f967211e55b3b4f81cb16a7d9f917cf6.png) ### Credit quantity ![Copy code](https://img-blog.csdnimg.cn/img_convert/48304ba5e6f9fe08f3fa1abda7d326ab.png)
#Credit quantity
cut=[-1,0,1,2,3,4,5,10,15,100]
train_ CP ['credit quantity _cut'] = pd.cut(train_cp ['monthly income'], bins=cut)
pd.crosstab(train_cp ['credit quantity _cut'], train_cp ['good and bad customers']. plot(kind = "bar")
a=pd.crosstab(train_cp ['credit quantity _cut'], train_cp ['good and bad customers'])
A ['proportion of bad users'] = a[1]/(a[0]+a[1])
a ['proportion of bad users']. plot()
![Copy code](https://img-blog.csdnimg.cn/img_convert/48304ba5e6f9fe08f3fa1abda7d326ab.png) ![](https://img-blog.csdnimg.cn/img_convert/43ba45c1bee369ba7f1909a895f62500.png)![](https://img-blog.csdnimg.cn/img_convert/bc67976292ef5f92cbdb5f8fd140991b.png) ### Fixed asset loans
#Fixed asset loans
pd.crosstab(train_cp ['fixed asset loan volume'], train_cp ['good and bad customers']. plot(kind = "bar")
a=pd.crosstab(train_cp ['fixed asset loan volume'], train_cp ['good and bad customers'])
A ['proportion of bad users'] = a[1]/(a[0]+a[1])
a ['proportion of bad users']. plot()
![](https://img-blog.csdnimg.cn/img_convert/a52fdc14c61ec3f30bec1284c12045b1.png)![](https://img-blog.csdnimg.cn/img_convert/9b657b11bc0139b0ee101293ea3e6ee4.png) ### Number of family members
#Number of family members
pd.crosstab(train_cp ['number of family members'], train_cp [' good and bad customers']. plot(kind = "bar")
a=pd.crosstab(train_cp ['number of family members'], train_cp [' good and bad customers'])
A ['proportion of bad users'] = a[1]/(a[0]+a[1])
a ['proportion of bad users']. plot()
![](https://img-blog.csdnimg.cn/img_convert/5c524812cc1950387d735f78454a8b63.png)![](https://img-blog.csdnimg.cn/img_convert/104d41848ba02b9ef8678a8ca1a9a8e4.png) 3.Correlation between variables: -----------
#Correlation between variables
train_cp.corr() ['good and bad customers']. sort_values(ascending = False).plot(kind=‘bar’)
plt.figure(figsize=(20,16))
corr=train_cp.corr()
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns,
linewidths=0.2, cmap="YlGnBu",annot=True)
![](https://img-blog.csdnimg.cn/img_convert/b2a5d74d25bb25125829d5967b98a085.png) ![](https://img-blog.csdnimg.cn/img_convert/61279dea5973fdb8e3d7fb597cd320f0.png) **4, Feature selection** ---------- 1.woe Sub box -------
#woe sub box
cut1=pd.qcut(train_cp ["available quota ratio"], 4,labels=False)
cut2=pd.qcut(train_cp ["age"], 8,labels=False)
bins3=[-1,0,1,3,5,13]
cut3=pd.cut(train_cp ["30-59 days overdue"], bins3,labels=False)
cut4=pd.qcut(train_cp ["debt ratio"], 3,labels=False)
cut5=pd.qcut(train_cp ["monthly income"], 4,labels=False)
cut6=pd.qcut(train_cp ["credit quantity"], 4,labels=False)
bins7=[-1, 0, 1, 3,5, 20]
cut7=pd.cut(train_cp ["90 days overdue"], bins7,labels=False)
bins8=[-1, 0,1,2, 3, 33]
cut8=pd.cut(train_cp ["fixed asset loan amount"], bins8,labels=False)
bins9=[-1, 0, 1, 3, 12]
cut9=pd.cut(train_cp ["60-89 days overdue"], bins9,labels=False)
bins10=[-1, 0, 1, 2, 3, 5, 21]
cut10=pd.cut(train_cp ["number of family members"], bins10,labels=False)
2. Calculation of woe value
The ratio of bad customers to good customers in the current group, and the difference between this ratio in all samples
#woe calculation rate=train_cp["Good and bad customers"].sum()/(train_cp["Good and bad customers"].count()-train_cp["Good and bad customers"].sum()) #rate = bad / (total bad) def get_woe_data(cut): grouped=train_cp["Good and bad customers"].groupby(cut,as_index = True).value_counts() woe=np.log(grouped.unstack().iloc[:,1]/grouped.unstack().iloc[:,0]/rate) return woe cut1_woe=get_woe_data(cut1) cut2_woe=get_woe_data(cut2) cut3_woe=get_woe_data(cut3) cut4_woe=get_woe_data(cut4) cut5_woe=get_woe_data(cut5) cut6_woe=get_woe_data(cut6) cut7_woe=get_woe_data(cut7) cut8_woe=get_woe_data(cut8) cut9_woe=get_woe_data(cut9) cut10_woe=get_woe_data(cut10)
Visualize:
l=[cut1_woe,cut2_woe,cut3_woe,cut4_woe,cut5_woe,cut6_woe,cut7_woe,cut8_woe,cut9_woe,cut10_woe] for i,col in enumerate(l): col.plot()
3.iv value calculation
iv value is actually equal to woe * (the proportion of bad customers in the current group to all bad customers - the proportion of good customers in the current group to all good customers)
#iv value calculation def get_IV_data(cut,cut_woe): grouped=train_cp["Good and bad customers"].groupby(cut,as_index = True).value_counts() cut_IV=((grouped.unstack().iloc[:,1]/train_cp["Good and bad customers"].sum()-grouped.unstack().iloc[:,0]/(train_cp["Good and bad customers"].count()-train_cp["Good and bad customers"].sum()))*cut_woe).sum() return cut_IV #Calculate the IV value of each group cut1_IV=get_IV_data(cut1,cut1_woe) cut2_IV=get_IV_data(cut2,cut2_woe) cut3_IV=get_IV_data(cut3,cut3_woe) cut4_IV=get_IV_data(cut4,cut4_woe) cut5_IV=get_IV_data(cut5,cut5_woe) cut6_IV=get_IV_data(cut6,cut6_woe) cut7_IV=get_IV_data(cut7,cut7_woe) cut8_IV=get_IV_data(cut8,cut8_woe) cut9_IV=get_IV_data(cut9,cut9_woe) cut10_IV=get_IV_data(cut10,cut10_woe) IV=pd.DataFrame([cut1_IV,cut2_IV,cut3_IV,cut4_IV,cut5_IV,cut6_IV,cut7_IV,cut8_IV,cut9_IV,cut10_IV],index=['Available limit ratio','Age','Overdue 30-59 Tianbi number','Debt ratio','monthly income','Credit quantity','Number of transactions overdue for 90 days','Fixed asset loans','Overdue 60-89 Tianbi number','Number of family members'],columns=['IV']) iv=IV.plot.bar(color='b',alpha=0.3,rot=30,figsize=(10,5),fontsize=(10)) iv.set_title('Characteristic variables and IV Value distribution diagram',fontsize=(15)) iv.set_xlabel('Characteristic variable',fontsize=(15)) iv.set_ylabel('IV',fontsize=(15))
Generally, characteristic variables with IV greater than 0.02 are selected for follow-up training. It can be seen from the above that all variables meet the requirements, so all variables are selected
4.woe conversion
df_new=pd.DataFrame() #New df_new stores the data converted by woe def replace_data(cut,cut_woe): a=[] for i in cut.unique(): a.append(i) a.sort() for m in range(len(a)): cut.replace(a[m],cut_woe.values[m],inplace=True) return cut df_new["Good and bad customers"]=train_cp["Good and bad customers"] df_new["Available limit ratio"]=replace_data(cut1,cut1_woe) df_new["Age"]=replace_data(cut2,cut2_woe) df_new["Overdue 30-59 Tianbi number"]=replace_data(cut3,cut3_woe) df_new["Debt ratio"]=replace_data(cut4,cut4_woe) df_new["monthly income"]=replace_data(cut5,cut5_woe) df_new["Credit quantity"]=replace_data(cut6,cut6_woe) df_new["Number of transactions overdue for 90 days"]=replace_data(cut7,cut7_woe) df_new["Fixed asset loans"]=replace_data(cut8,cut8_woe) df_new["Overdue 60-89 Tianbi number"]=replace_data(cut9,cut9_woe) df_new["Number of family members"]=replace_data(cut10,cut10_woe) df_new.head()
5, Model training
The main algorithm model used by credit scoring card is logistic regression. Logistic model is less sensitive to customer group changes than other high complexity models, so it is more robust and robust. In addition, the model is intuitive, the meaning of the coefficient is easy to explain and understand, and the advantage of using logical regression is that it can obtain the linear relationship between variables and the corresponding characteristic weight, which is convenient to convert it into one-to-one corresponding score form later
model training
#model training from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split x=df_new.iloc[:,1:] y=df_new.iloc[:,:1] x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.6,random_state=0) model=LogisticRegression() clf=model.fit(x_train,y_train) print('Test results:{}'.format(clf.score(x_test,y_test)))
Test score: 0.9427326816829579, which seems to be very high. In fact, it is caused by too serious data skew. The final result depends on auc
Calculate the characteristic weight coefficient coe, which will be used when converting the training results to scores later:
coe=clf.coef_ #Feature weight coefficient, which will be used later when converted to scoring rules coe ''' array([[0.62805638, 0.46284749, 0.54319513, 1.14645109, 0.42744108, 0.2503357 , 0.59564263, 0.81828033, 0.4433141 , 0.23788103]]) '''
6, Model evaluation
Model evaluation mainly depends on AUC and K-S values
#Model evaluation from sklearn.metrics import roc_curve, auc fpr, tpr, threshold = roc_curve(y_test, y_pred) roc_auc = auc(fpr, tpr) plt.plot(fpr, tpr, color='darkorange',label='ROC curve (area = %0.2f)' % roc_auc) plt.plot([0, 1], [0, 1], color='navy', linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.0]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC_curve') plt.legend(loc="lower right") plt.show() roc_auc #0.5756615527156178
ks
#ks fig, ax = plt.subplots() ax.plot(1 - threshold, tpr, label='tpr') # ks curves should be arranged in descending order of prediction probability, so 1-threshold image is required ax.plot(1 - threshold, fpr, label='fpr') ax.plot(1 - threshold, tpr-fpr,label='KS') plt.xlabel('score') plt.title('KS Curve') plt.ylim([0.0, 1.0]) plt.figure(figsize=(20,20)) legend = ax.legend(loc='upper left') plt.show() max(tpr-fpr) # 0.1513231054312355
ROC0.58, K-S value is about 0.15, and the modeling effect is general
Why is the score so high, but auc and ks are very low, which is caused by the imbalance of samples
7, Model result transfer score
Assuming that the score is 600 when the good / bad ratio is 20, double the good / bad ratio every 20 points higher
Now let's find the fractional scale corresponding to different woe values of each variable, and we can get:
#Model result transfer score factor = 20 / np.log(2) offset = 600 - 20 * np.log(20) / np.log(2) def get_score(coe,woe,factor): scores=[] for w in woe: score=round(coe*w*factor,0) scores.append(score) return scores x1 = get_score(coe[0][0], cut1_woe, factor) x2 = get_score(coe[0][1], cut2_woe, factor) x3 = get_score(coe[0][2], cut3_woe, factor) x4 = get_score(coe[0][3], cut4_woe, factor) x5 = get_score(coe[0][4], cut5_woe, factor) x6 = get_score(coe[0][5], cut6_woe, factor) x7 = get_score(coe[0][6], cut7_woe, factor) x8 = get_score(coe[0][7], cut8_woe, factor) x9 = get_score(coe[0][8], cut9_woe, factor) x10 = get_score(coe[0][9], cut10_woe, factor) print("Score corresponding to available quota ratio:{}".format(x1)) print("Age corresponding score:{}".format(x2)) print("Overdue 30-59 Score corresponding to the number of Tianbi:{}".format(x3)) print("Score corresponding to debt ratio:{}".format(x4)) print("Score corresponding to monthly income:{}".format(x5)) print("Score corresponding to credit quantity:{}".format(x6)) print("Score corresponding to the number of pen overdue for 90 days:{}".format(x7)) print("Score corresponding to the loan amount of fixed assets:{}".format(x8)) print("Overdue 60-89 Score corresponding to the number of Tianbi:{}".format(x9)) print("Score corresponding to the number of family members:{}".format(x10))
Score corresponding to available quota ratio: [- 22.0, - 21.0, - 5.0, 19.0]
Scores corresponding to age: [7.0, 5.0, 3.0, 2.0, - 0.0, - 5.0, - 11.0, - 14.0]
Scores corresponding to the number of transactions overdue for 30-59 days: [- 7.0, 14.0, 27.0, 37.0, 41.0]
Score corresponding to debt ratio: [- 5.0, - 2.0, 6.0]
Score corresponding to monthly income: [4.0, 1.0, - 2.0, - 4.0]
Score corresponding to credit quantity: [2.0, - 2.0, - 1.0, 0.0]
Scores corresponding to the number of 90 days overdue: [- 6.0, 34.0, 48.0, 56.0, 57.0]
Score corresponding to loan amount of fixed assets: [5.0, - 6.0, - 3.0, 2.0, 16.0]
Scores corresponding to the number of 60-89 days overdue: [- 3.0, 23.0, 35.0, 38.0]
Scores corresponding to the number of family members: [- 1.0, 1.0, 1.0, 2.0, 3.0, 5.0]
It can be seen that the higher the score, the greater the possibility of becoming a bad customer. For example, the older the age, the lower the bad customer rate, the larger the score span of the available quota ratio and the number of overdue transactions, which has a greater impact on the final total score, which confirms the results of the previous exploration and analysis.
8, Calculate total user score
1. Take the boundary division point of automatic box division
cu1=pd.qcut(train_cp["Available limit ratio"],4,labels=False,retbins=True) bins1=cu1[1] cu2=pd.qcut(train_cp["Age"],8,labels=False,retbins=True) bins2=cu2[1] # bins3=[-1,0,1,3,5,13] # cut3=pd.cut(train_cp ["30-59 days overdue"], bins3,labels=False) cu4=pd.qcut(train_cp["Debt ratio"],3,labels=False,retbins=True) bins4=cu4[1] cu5=pd.qcut(train_cp["monthly income"],4,labels=False,retbins=True) bins5=cu5[1] cu6=pd.qcut(train_cp["Credit quantity"],4,labels=False,retbins=True) bins6=cu6[1]
2. Sum the scores corresponding to each variable to calculate the total score of each user
#Sum the scores corresponding to each variable to calculate the total score of each user def compute_score(series,bins,score): list = [] i = 0 while i < len(series): value = series[i] j = len(bins) - 2 m = len(bins) - 2 while j >= 0: if value >= bins[j]: j = -1 else: j -= 1 m -= 1 list.append(score[m]) i += 1 return list path2=r'F:\\python\\Give-me-some-credit-master\\data\\cs-test.csv' test1 = pd.read_csv(path2) test1['x1'] = pd.Series(compute_score(test1['RevolvingUtilizationOfUnsecuredLines'], bins1, x1)) test1['x2'] = pd.Series(compute_score(test1['age'], bins2, x2)) test1['x3'] = pd.Series(compute_score(test1['NumberOfTime30-59DaysPastDueNotWorse'], bins3, x3)) test1['x4'] = pd.Series(compute_score(test1['DebtRatio'], bins4, x4)) test1['x5'] = pd.Series(compute_score(test1['MonthlyIncome'], bins5, x5)) test1['x6'] = pd.Series(compute_score(test1['NumberOfOpenCreditLinesAndLoans'], bins6, x6)) test1['x7'] = pd.Series(compute_score(test1['NumberOfTimes90DaysLate'], bins7, x7)) test1['x8'] = pd.Series(compute_score(test1['NumberRealEstateLoansOrLines'], bins8, x8)) test1['x9'] = pd.Series(compute_score(test1['NumberOfTime60-89DaysPastDueNotWorse'], bins9, x9)) test1['x10'] = pd.Series(compute_score(test1['NumberOfDependents'], bins10, x10)) test1['Score'] = test1['x1']+test1['x2']+test1['x3']+test1['x4']+test1['x5']+test1['x6']+test1['x7']+test1['x8']+test1['x9']+test1['x10']+600 test1.to_csv(r'F:\\python\\Give-me-some-credit-master\\data\\ScoreData.csv', index=False)
Article reprint: https://www.cnblogs.com/daliner/p/10268350.html
All codes:
# -*- coding: utf-8 -*- """ Created on Tue Aug 11 14:09:20 2020 @author: Admin """ #Import module import pandas as pd import numpy as np from scipy import stats import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline plt.rc("font",family="SimHei",size="12") #Solve the problem that Chinese cannot be displayed #Import data train=pd.read_csv('F:\\python\\Give-me-some-credit-master\\data\\cs-training.csv') #Simple view data train.info() #Data viewing of the first three rows and the last three rows b=train.head(3).append(train.tail(3)) #shape train.shape #(150000, 11) #Convert English fields to Chinese field names for easy understanding states={'Unnamed: 0':'id', 'SeriousDlqin2yrs':'Good and bad customers', 'RevolvingUtilizationOfUnsecuredLines':'Available limit ratio', 'age':'Age', 'NumberOfTime30-59DaysPastDueNotWorse':'Overdue 30-59 Tianbi number', 'DebtRatio':'Debt ratio', 'MonthlyIncome':'monthly income', 'NumberOfOpenCreditLinesAndLoans':'Credit quantity', 'NumberOfTimes90DaysLate':'Number of transactions overdue for 90 days', 'NumberRealEstateLoansOrLines':'Fixed asset loans', 'NumberOfTime60-89DaysPastDueNotWorse':'Overdue 60-89 Tianbi number', 'NumberOfDependents':'Number of family members'} train.rename(columns=states,inplace=True) #catalog index train=train.set_index('id',drop=True) #descriptive statistics train.describe() #Check the missing condition of each column train.isnull().sum() #Check the missing proportion train.isnull().sum()/len(train) #Missing value visualization missing=train.isnull().sum() missing[missing>0].sort_values().plot.bar() #Take out and sort those greater than 0 #Keep original data train_cp=train.copy() #Monthly income uses the average to fill in the missing value train_cp.fillna({'monthly income':train_cp['monthly income'].mean()},inplace=True) train_cp.isnull().sum() #Lines with missing number of family members are removed train_cp=train_cp.dropna() train_cp.shape #(146076, 11) #View outliers #Draw box diagram for col in train_cp.columns: plt.boxplot(train_cp[col]) plt.title(col) plt.show() #Outlier handling train_cp=train_cp[train_cp['Available limit ratio']<1] train_cp=train_cp[train_cp['Age']>0] train_cp=train_cp[train_cp['Overdue 30-59 Tianbi number']<80] train_cp=train_cp[train_cp['Overdue 60-89 Tianbi number']<80] train_cp=train_cp[train_cp['Number of transactions overdue for 90 days']<80] train_cp=train_cp[train_cp['Fixed asset loans']<50] train_cp=train_cp[train_cp['Debt ratio']<5000] train_cp.shape #(141180, 11) #Univariate analysis #Good and bad users train_cp.info() train_cp['Good and bad customers'].value_counts() train_cp['Good and bad customers'].value_counts()/len(train_cp) train_cp['Good and bad customers'].value_counts().plot.bar() #Available limit ratio and debt ratio train_cp['Available limit ratio'].plot.hist() train_cp['Debt ratio'].plot.hist() #The data with a debt ratio greater than 1 has too much impact a=train_cp['Debt ratio'] a[a<=1].plot.hist() #30-59 days overdue, 90 days overdue, 60-89 days overdue for i,col in enumerate(['Overdue 30-59 Tianbi number','Number of transactions overdue for 90 days','Overdue 60-89 Tianbi number']): plt.subplot(1,3,i+1) train_cp[col].value_counts().plot.bar() plt.title(col) train_cp['Overdue 30-59 Tianbi number'].value_counts().plot.bar() train_cp['Number of transactions overdue for 90 days'].value_counts().plot.bar() train_cp['Overdue 60-89 Tianbi number'].value_counts().plot.bar() #Age train_cp['Age'].plot.hist() #monthly income train_cp['monthly income'].plot.hist() sns.distplot(train_cp['monthly income']) #The influence of super outliers is too great. We take the data less than 5w to draw the graph a=train_cp['monthly income'] a[a<=50000].plot.hist() #If it is found that there are not many less than 50000, take 2w a=train_cp['monthly income'] a[a<=20000].plot.hist() #Credit quantity train_cp['Credit quantity'].value_counts().plot.bar() sns.distplot(train_cp['Credit quantity']) #Fixed asset loans train_cp['Fixed asset loans'].value_counts().plot.bar() sns.distplot(train_cp['Fixed asset loans']) #Number of family members train_cp['Number of family members'].value_counts().plot.bar() sns.distplot(train_cp['Number of family members']) #Univariate and y-value visualization #Available quota ratio, debt ratio, age and monthly income need to be separated #Available limit ratio train_cp['Available limit ratio_cut']=pd.cut(train_cp['Available limit ratio'],5) pd.crosstab(train_cp['Available limit ratio_cut'],train_cp['Good and bad customers']).plot(kind="bar") a=pd.crosstab(train_cp['Available limit ratio_cut'],train_cp['Good and bad customers']) a['Proportion of bad users']=a[1]/(a[0]+a[1]) a['Proportion of bad users'].plot() #Debt ratio cut=[-1,0.2,0.4,0.6,0.8,1,1.5,2,5,10,5000] train_cp['Debt ratio_cut']=pd.cut(train_cp['Debt ratio'],bins=cut) pd.crosstab(train_cp['Debt ratio_cut'],train_cp['Good and bad customers']).plot(kind="bar") a=pd.crosstab(train_cp['Debt ratio_cut'],train_cp['Good and bad customers']) a['Proportion of bad users']=a[1]/(a[0]+a[1]) a['Proportion of bad users'].plot() #Age cut=[0,30,40,50,60,100] train_cp['Age_cut']=pd.cut(train_cp['Age'],bins=cut) pd.crosstab(train_cp['Age_cut'],train_cp['Good and bad customers']).plot(kind="bar") a=pd.crosstab(train_cp['Age_cut'],train_cp['Good and bad customers']) a['Proportion of bad users']=a[1]/(a[0]+a[1]) a['Proportion of bad users'].plot() #monthly income cut=[0,3000,5000,7000,10000,15000,30000,1000000] train_cp['monthly income_cut']=pd.cut(train_cp['monthly income'],bins=cut) pd.crosstab(train_cp['monthly income_cut'],train_cp['Good and bad customers']).plot(kind="bar") a=pd.crosstab(train_cp['monthly income_cut'],train_cp['Good and bad customers']) a['Proportion of bad users']=a[1]/(a[0]+a[1]) a['Proportion of bad users'].plot() #30-59 days overdue, 90 days overdue, 60-89 days overdue \ amount of credit \ amount of fixed asset loans \ number of family members #30-59 days overdue pd.crosstab(train_cp['Overdue 30-59 Tianbi number'],train_cp['Good and bad customers']).plot(kind="bar") a=pd.crosstab(train_cp['Overdue 30-59 Tianbi number'],train_cp['Good and bad customers']) a['Proportion of bad users']=a[1]/(a[0]+a[1]) a['Proportion of bad users'].plot() #Number of transactions overdue for 90 days pd.crosstab(train_cp['Number of transactions overdue for 90 days'],train_cp['Good and bad customers']).plot(kind="bar") a=pd.crosstab(train_cp['Number of transactions overdue for 90 days'],train_cp['Good and bad customers']) a['Proportion of bad users']=a[1]/(a[0]+a[1]) a['Proportion of bad users'].plot() #60-89 days overdue pd.crosstab(train_cp['Overdue 60-89 Tianbi number'],train_cp['Good and bad customers']).plot(kind="bar") a=pd.crosstab(train_cp['Overdue 60-89 Tianbi number'],train_cp['Good and bad customers']) a['Proportion of bad users']=a[1]/(a[0]+a[1]) a['Proportion of bad users'].plot() #Credit quantity cut=[-1,0,1,2,3,4,5,10,15,100] train_cp['Credit quantity_cut']=pd.cut(train_cp['monthly income'],bins=cut) pd.crosstab(train_cp['Credit quantity_cut'],train_cp['Good and bad customers']).plot(kind="bar") a=pd.crosstab(train_cp['Credit quantity_cut'],train_cp['Good and bad customers']) a['Proportion of bad users']=a[1]/(a[0]+a[1]) a['Proportion of bad users'].plot() #Fixed asset loans pd.crosstab(train_cp['Fixed asset loans'],train_cp['Good and bad customers']).plot(kind="bar") a=pd.crosstab(train_cp['Fixed asset loans'],train_cp['Good and bad customers']) a['Proportion of bad users']=a[1]/(a[0]+a[1]) a['Proportion of bad users'].plot() #Number of family members pd.crosstab(train_cp['Number of family members'],train_cp['Good and bad customers']).plot(kind="bar") a=pd.crosstab(train_cp['Number of family members'],train_cp['Good and bad customers']) a['Proportion of bad users']=a[1]/(a[0]+a[1]) a['Proportion of bad users'].plot() #Correlation between variables train_cp.corr()['Good and bad customers'].sort_values(ascending = False).plot(kind='bar') plt.figure(figsize=(20,16)) corr=train_cp.corr() sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, linewidths=0.2, cmap="YlGnBu",annot=True) #woe sub box cut1=pd.qcut(train_cp["Available limit ratio"],4,labels=False) cut2=pd.qcut(train_cp["Age"],8,labels=False) bins3=[-1,0,1,3,5,13] cut3=pd.cut(train_cp["Overdue 30-59 Tianbi number"],bins3,labels=False) cut4=pd.qcut(train_cp["Debt ratio"],3,labels=False) cut5=pd.qcut(train_cp["monthly income"],4,labels=False) cut6=pd.qcut(train_cp["Credit quantity"],4,labels=False) bins7=[-1, 0, 1, 3,5, 20] cut7=pd.cut(train_cp["Number of transactions overdue for 90 days"],bins7,labels=False) bins8=[-1, 0,1,2, 3, 33] cut8=pd.cut(train_cp["Fixed asset loans"],bins8,labels=False) bins9=[-1, 0, 1, 3, 12] cut9=pd.cut(train_cp["Overdue 60-89 Tianbi number"],bins9,labels=False) bins10=[-1, 0, 1, 2, 3, 5, 21] cut10=pd.cut(train_cp["Number of family members"],bins10,labels=False) #woe calculation rate=train_cp["Good and bad customers"].sum()/(train_cp["Good and bad customers"].count()-train_cp["Good and bad customers"].sum()) #rate = bad / (total bad) def get_woe_data(cut): grouped=train_cp["Good and bad customers"].groupby(cut,as_index = True).value_counts() woe=np.log(grouped.unstack().iloc[:,1]/grouped.unstack().iloc[:,0]/rate) return woe cut1_woe=get_woe_data(cut1) cut2_woe=get_woe_data(cut2) cut3_woe=get_woe_data(cut3) cut4_woe=get_woe_data(cut4) cut5_woe=get_woe_data(cut5) cut6_woe=get_woe_data(cut6) cut7_woe=get_woe_data(cut7) cut8_woe=get_woe_data(cut8) cut9_woe=get_woe_data(cut9) cut10_woe=get_woe_data(cut10) l=[cut1_woe,cut2_woe,cut3_woe,cut4_woe,cut5_woe,cut6_woe,cut7_woe,cut8_woe,cut9_woe,cut10_woe] for i,col in enumerate(l): col.plot() #iv value calculation def get_IV_data(cut,cut_woe): grouped=train_cp["Good and bad customers"].groupby(cut,as_index = True).value_counts() cut_IV=((grouped.unstack().iloc[:,1]/train_cp["Good and bad customers"].sum()-grouped.unstack().iloc[:,0]/(train_cp["Good and bad customers"].count()-train_cp["Good and bad customers"].sum()))*cut_woe).sum() return cut_IV #Calculate the IV value of each group cut1_IV=get_IV_data(cut1,cut1_woe) cut2_IV=get_IV_data(cut2,cut2_woe) cut3_IV=get_IV_data(cut3,cut3_woe) cut4_IV=get_IV_data(cut4,cut4_woe) cut5_IV=get_IV_data(cut5,cut5_woe) cut6_IV=get_IV_data(cut6,cut6_woe) cut7_IV=get_IV_data(cut7,cut7_woe) cut8_IV=get_IV_data(cut8,cut8_woe) cut9_IV=get_IV_data(cut9,cut9_woe) cut10_IV=get_IV_data(cut10,cut10_woe) IV=pd.DataFrame([cut1_IV,cut2_IV,cut3_IV,cut4_IV,cut5_IV,cut6_IV,cut7_IV,cut8_IV,cut9_IV,cut10_IV],index=['Available limit ratio','Age','Overdue 30-59 Tianbi number','Debt ratio','monthly income','Credit quantity','Number of transactions overdue for 90 days','Fixed asset loans','Overdue 60-89 Tianbi number','Number of family members'],columns=['IV']) iv=IV.plot.bar(color='b',alpha=0.3,rot=30,figsize=(10,5),fontsize=(10)) iv.set_title('Characteristic variables and IV Value distribution diagram',fontsize=(15)) iv.set_xlabel('Characteristic variable',fontsize=(15)) iv.set_ylabel('IV',fontsize=(15)) #woe conversion df_new=pd.DataFrame() #New df_new stores the data converted by woe def replace_data(cut,cut_woe): a=[] for i in cut.unique(): a.append(i) a.sort() for m in range(len(a)): cut.replace(a[m],cut_woe.values[m],inplace=True) return cut df_new["Good and bad customers"]=train_cp["Good and bad customers"] df_new["Available limit ratio"]=replace_data(cut1,cut1_woe) df_new["Age"]=replace_data(cut2,cut2_woe) df_new["Overdue 30-59 Tianbi number"]=replace_data(cut3,cut3_woe) df_new["Debt ratio"]=replace_data(cut4,cut4_woe) df_new["monthly income"]=replace_data(cut5,cut5_woe) df_new["Credit quantity"]=replace_data(cut6,cut6_woe) df_new["Number of transactions overdue for 90 days"]=replace_data(cut7,cut7_woe) df_new["Fixed asset loans"]=replace_data(cut8,cut8_woe) df_new["Overdue 60-89 Tianbi number"]=replace_data(cut9,cut9_woe) df_new["Number of family members"]=replace_data(cut10,cut10_woe) df_new.head() #model training from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split x=df_new.iloc[:,1:] y=df_new.iloc[:,:1] x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.6,random_state=0) model=LogisticRegression() clf=model.fit(x_train,y_train) print('Test results:{}'.format(clf.score(x_test,y_test))) #coefficient coe=clf.coef_ #Feature weight coefficient, which will be used later when converted to scoring rules coe #Score of test set y_pred=clf.predict(x_test) #Model evaluation from sklearn.metrics import roc_curve, auc fpr, tpr, threshold = roc_curve(y_test, y_pred) roc_auc = auc(fpr, tpr) plt.plot(fpr, tpr, color='darkorange',label='ROC curve (area = %0.2f)' % roc_auc) plt.plot([0, 1], [0, 1], color='navy', linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.0]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC_curve') plt.legend(loc="lower right") plt.show() roc_auc #0.5756615527156178 #ks fig, ax = plt.subplots() ax.plot(1 - threshold, tpr, label='tpr') # ks curves should be arranged in descending order of prediction probability, so 1-threshold image is required ax.plot(1 - threshold, fpr, label='fpr') ax.plot(1 - threshold, tpr-fpr,label='KS') plt.xlabel('score') plt.title('KS Curve') plt.ylim([0.0, 1.0]) plt.figure(figsize=(20,20)) legend = ax.legend(loc='upper left') plt.show() max(tpr-fpr) # 0.1513231054312355 #Model result transfer score factor = 20 / np.log(2) offset = 600 - 20 * np.log(20) / np.log(2) def get_score(coe,woe,factor): scores=[] for w in woe: score=round(coe*w*factor,0) scores.append(score) return scores x1 = get_score(coe[0][0], cut1_woe, factor) x2 = get_score(coe[0][1], cut2_woe, factor) x3 = get_score(coe[0][2], cut3_woe, factor) x4 = get_score(coe[0][3], cut4_woe, factor) x5 = get_score(coe[0][4], cut5_woe, factor) x6 = get_score(coe[0][5], cut6_woe, factor) x7 = get_score(coe[0][6], cut7_woe, factor) x8 = get_score(coe[0][7], cut8_woe, factor) x9 = get_score(coe[0][8], cut9_woe, factor) x10 = get_score(coe[0][9], cut10_woe, factor) print("Score corresponding to available quota ratio:{}".format(x1)) print("Age corresponding score:{}".format(x2)) print("Overdue 30-59 Score corresponding to the number of Tianbi:{}".format(x3)) print("Score corresponding to debt ratio:{}".format(x4)) print("Score corresponding to monthly income:{}".format(x5)) print("Score corresponding to credit quantity:{}".format(x6)) print("Score corresponding to the number of pen overdue for 90 days:{}".format(x7)) print("Score corresponding to the loan amount of fixed assets:{}".format(x8)) print("Overdue 60-89 Score corresponding to the number of Tianbi:{}".format(x9)) print("Score corresponding to the number of family members:{}".format(x10)) #1. Take the boundary division point of automatic box division cu1=pd.qcut(train_cp["Available limit ratio"],4,labels=False,retbins=True) bins1=cu1[1] cu2=pd.qcut(train_cp["Age"],8,labels=False,retbins=True) bins2=cu2[1] # bins3=[-1,0,1,3,5,13] # cut3=pd.cut(train_cp ["30-59 days overdue"], bins3,labels=False) cu4=pd.qcut(train_cp["Debt ratio"],3,labels=False,retbins=True) bins4=cu4[1] cu5=pd.qcut(train_cp["monthly income"],4,labels=False,retbins=True) bins5=cu5[1] cu6=pd.qcut(train_cp["Credit quantity"],4,labels=False,retbins=True) bins6=cu6[1] #Sum the scores corresponding to each variable to calculate the total score of each user def compute_score(series,bins,score): list = [] i = 0 while i < len(series): value = series[i] j = len(bins) - 2 m = len(bins) - 2 while j >= 0: if value >= bins[j]: j = -1 else: j -= 1 m -= 1 list.append(score[m]) i += 1 return list path2=r'F:\\python\\Give-me-some-credit-master\\data\\cs-test.csv' test1 = pd.read_csv(path2) test1['x1'] = pd.Series(compute_score(test1['RevolvingUtilizationOfUnsecuredLines'], bins1, x1)) test1['x2'] = pd.Series(compute_score(test1['age'], bins2, x2)) test1['x3'] = pd.Series(compute_score(test1['NumberOfTime30-59DaysPastDueNotWorse'], bins3, x3)) test1['x4'] = pd.Series(compute_score(test1['DebtRatio'], bins4, x4)) test1['x5'] = pd.Series(compute_score(test1['MonthlyIncome'], bins5, x5)) test1['x6'] = pd.Series(compute_score(test1['NumberOfOpenCreditLinesAndLoans'], bins6, x6)) test1['x7'] = pd.Series(compute_score(test1['NumberOfTimes90DaysLate'], bins7, x7)) test1['x8'] = pd.Series(compute_score(test1['NumberRealEstateLoansOrLines'], bins8, x8)) test1['x9'] = pd.Series(compute_score(test1['NumberOfTime60-89DaysPastDueNotWorse'], bins9, x9)) test1['x10'] = pd.Series(compute_score(test1['NumberOfDependents'], bins10, x10)) test1['Score'] = test1['x1']+test1['x2']+test1['x3']+test1['x4']+test1['x5']+test1['x6']+test1['x7']+test1['x8']+test1['x9']+test1['x10']+600 test1.to_csv(r'F:\\python\\Give-me-some-credit-master\\data\\ScoreData.csv', index=False)
Reprint https://www.cnblogs.com/cgmcoding/p/13491940.html
That's all for python,
Welcome to register< python financial risk control scorecard model and data analysis micro professional course >, learn more about it.