# Code sharing of three continuous variable box dividing methods

Hello, everyone! In the previous article, we introduced three automatic optimal bin separation methods commonly used in the industry. 1) Optimal bin separation of continuous variables based on CART algorithm 2) Optimal bin division of continuous variables based on Chi square test 3) Continuous variable optimal bin division based on optimal KS Today's article will share the Python implementation of these three methods.

### 00 Index

01 preparation of test data and evaluation methods 02 implementation of optimal bin division code based on CART algorithm 03 implementation of optimal bin division code based on Chi square test 04 implementation of optimal bin division code based on optimal KS 05 test effect and section

### 01 preparation of test data and evaluation methods

In order to simulate the data sets we often encounter in risk modeling, I have simply created some data here, mainly including three columns:

Among them, target is our Y column, and the other two are X column, which is our feature. What we need to do is to import data into the data set, which can be returned to cut in the background of official account (SamShare).

```# Import related libraries
import pandas as pd
import numpy as np
import random
import math
from scipy.stats import chi2
import scipy

# Test data structure, where target is Y, 1 represents bad people and 0 represents good people.
print(len(df))
print(df.target.value_counts()/len(df))
```

In addition, we need a method to evaluate the effect of box division. In the previous article, we mentioned that the effect can be measured by IV value, so we also need to construct a calculation method of IV value.

```def iv_count(data, var, target):
''' calculation iv value
Args:
data: DataFrame，Data set to be operated
var: String，Quasi calculation IV Variable name of the value
target: String，Y Column name
Returns:
IV Value, float
'''
value_list = set(list(np.unique(data[var])))
iv = 0
data_good = pd.Series(data[data[target]==0][var].values, index=data[data[target]==0].index)
len_good = len(data_good)
for value in value_list:
# Judge whether a certain class is 0 to avoid infinitesimal value and infinity value
if sum(data_bad == value) == 0:
else:
if sum(data_good == value) == 0:
good_rate = 1 / len_good
else:
good_rate = sum(data_good == value) / len_good
# print(value,iv)
return iv
```

### 02 implementation of optimal bin division code based on CART algorithm

The implementation steps of continuous variable optimal bin division based on CART algorithm are as follows: 1. Given the continuous variable V, sort the values in V; 2. Successively calculate the Gini index of the median between adjacent elements as the binary division point; 3. Select the optimal division point (the Gini index decreases the most after division) as the division point of this iteration; 4. Recursively iterate steps 2-3 until the stop condition is met. (generally, the divided sample size is taken as the stop condition, for example, the sample size of leaf node > = 10% of the total sample size)

```def get_var_median(data, var):
""" Gets the median list of all elements of the specified continuous variable
Args:
data: DataFrame，Data set to be operated
var: String，Name of continuous variable to be divided into boxes
Returns:
For a median list of all elements of a continuous variable, List
"""
var_value_list = list(np.unique(data[var]))
var_median_list = []
for i in range(len(var_value_list)-1):
var_median = (var_value_list[i] + var_value_list[i+1]) / 2
var_median_list.append(var_median)
return var_median_list

def calculate_gini(y):
""" Calculate Gini index
Args:
y: Array，Data to be calculated target，That is, an array of 0 and 1
Returns:
Gini index, float
"""
# Convert array to list
y = y.tolist()
probs = [y.count(i)/len(y) for i in np.unique(y)]
gini = sum([p*(1-p) for p in probs])
return gini

def get_cart_split_point(data, var, target, min_sample):
""" Obtain the optimal binary partition point (i.e. the point where the Gini index decreases the most)
Args:
data: DataFrame，Quasi operational data set
var: String，Name of continuous variable to be divided into boxes
target: String，Y Column name
min_sample: int，The minimum data sample of the sub box, that is, the amount of data at least reaches, which is generally used at the sub box point at the beginning or end

Returns:
BestSplit_Point: Return the optimal division point of this iteration, float
BestSplit_Position: Returns the position of the optimal dividing point, with 0 on the leftmost and 1 on the rightmost, float
"""

# initialization
Gini = calculate_gini(data[target].values)
Best_Gini = 0.0
BestSplit_Point = -99999
BestSplit_Position = 0.0
median_list = get_var_median(data, var) # Gets a list of all median values for the specified element of the current dataset

for i in range(len(median_list)):
left = data[data[var] < median_list[i]]
right = data[data[var] > median_list[i]]

# If the amount of data after segmentation is less than the specified threshold, skip the box splitting calculation
if len(left) < min_sample or len(right) < min_sample:
continue

Left_Gini = calculate_gini(left[target].values)
Right_Gini = calculate_gini(right[target].values)
Left_Ratio = len(left) / len(data)
Right_Ratio = len(right) / len(data)

Temp_Gini = Gini - (Left_Gini * Left_Ratio + Right_Gini * Right_Ratio)
if Temp_Gini > Best_Gini:
Best_Gini = Temp_Gini
BestSplit_Point = median_list[i]
# Get the position of the syncopation point. The leftmost is 0 and the rightmost is 1
if len(median_list) > 1:
BestSplit_Position = i / (len(median_list) - 1)
else:
BestSplit_Position = i / len(len(median_list))
else:
continue
Gini = Gini - Best_Gini
# print("optimal tangent point:", BestSplit_Point)
return BestSplit_Point, BestSplit_Position

def get_cart_bincut(data, var, target, leaf_stop_percent=0.05):
""" Calculate the optimal dividing point
Args:
data: DataFrame，Data set to be operated
var: String，Name of continuous variable to be divided into boxes
target: String，Y Column name
leaf_stop_percent: The proportion of leaf nodes, as the stop condition, is 5 by default%

Returns:
best_bincut: The best list of syncopation points, List
"""
min_sample = len(data) * leaf_stop_percent
best_bincut = []

def cutting_data(data, var, target, min_sample, best_bincut):
split_point, position = get_cart_split_point(data, var, target, min_sample)

if split_point != -99999:
best_bincut.append(split_point)

# The data set is segmented according to the optimal segmentation point, and the segmentation point is calculated recursively for the segmented data set until the stop condition is met
# print("the value range of this sub box is {0} ~ {1}". Format (data [var]. Min(), data [var] max()))
left = data[data[var] < split_point]
right = data[data[var] > split_point]

# When the data set after segmentation is still larger than the minimum data sample requirements, continue segmentation
if len(left) >= min_sample and position not in [0.0, 1.0]:
cutting_data(left, var, target, min_sample, best_bincut)
else:
pass
if len(right) >= min_sample and position not in [0.0, 1.0]:
cutting_data(right, var, target, min_sample, best_bincut)
else:
pass
return best_bincut
best_bincut = cutting_data(data, var, target, min_sample, best_bincut)

# Fill in the head and tail of the syncopation point
best_bincut.append(data[var].min())
best_bincut.append(data[var].max())
best_bincut_set = set(best_bincut)
best_bincut = list(best_bincut_set)

best_bincut.remove(data[var].min())
best_bincut.append(data[var].min()-1)
# Sort cut point
best_bincut.sort()

return best_bincut
```

### 03 implementation of optimal bin division code based on Chi square test

The implementation steps of continuous variable optimal bin division based on Chi square test are as follows: 1. Given the continuous variable V, sort the values in V, and then set each element value separately to complete the initialization phase; 2. For adjacent groups, calculate the chi square value in pairs; 3. Merge the two groups with the lowest chi square value; 4. Recursively iterate steps 2-3 until the stop condition is met. (generally, the chi square value is higher than the set threshold, or reaches the maximum number of groups, etc.)

```def calculate_chi(freq_array):
""" Calculate chi square value
Args:
freq_array: Array，Two dimensional array of chi square value to be calculated, frequency statistics results
Returns:
Chi square value, float
"""
# Check whether it is a two-dimensional array
assert(freq_array.ndim==2)

# Calculate the sum of the frequencies of each column
col_nums = freq_array.sum(axis=0)
# Calculate the sum of the frequencies of each line
row_nums = freq_array.sum(axis=1)
# Calculate the total frequency
nums = freq_array.sum()
# Calculate the expected frequency
E_nums = np.ones(freq_array.shape) * col_nums / nums
E_nums = (E_nums.T * row_nums).T
# Calculate chi square value
tmp_v = (freq_array - E_nums)**2 / E_nums
# If the expected frequency is 0, the calculation result is recorded as 0
tmp_v[E_nums==0] = 0
chi_v = tmp_v.sum()
return chi_v

def get_chimerge_bincut(data, var, target, max_group=None, chi_threshold=None):
""" Calculate the optimal bin division point of chi square bin Division
Args:
data: DataFrame，Data set of the optimal cut-off point list of chi square sub box to be calculated
var: Name of continuous variable to be calculated
target: Target column to be calculated Y Name of
max_group: Maximum number of sub boxes (because chi square sub box is actually a process of merging boxes, the maximum number of sub boxes that can be retained needs to be limited)
chi_threshold: Chi square threshold, if not specified max_group，We select the number of categories by default-1，Confidence 95%To set the threshold
If you don't know how to get the chi square threshold, you can generate a chi square table to see the code as follows:
import pandas as pd
import numpy as np
from scipy.stats import chi2
p = [0.995, 0.99, 0.975, 0.95, 0.9, 0.5, 0.1, 0.05, 0.025, 0.01, 0.005]
pd.DataFrame(np.array([chi2.isf(p, df=i) for i in range(1,10)]), columns=p, index=list(range(1,10)))
Returns:
List of optimal tangent points, List
"""
freq_df = pd.crosstab(index=data[var], columns=data[target])
# Convert to 2D array
freq_array = freq_df.values

# Initialize the box, and each element has a separate group
best_bincut = freq_df.index.values

# Initialization threshold chi_threshold, if Chi is not specified_ Threshold, the number of target s is - 1 by default, and the confidence is 95%
if max_group is None:
if chi_threshold is None:
chi_threshold = chi2.isf(0.05, df = freq_array.shape[-1])

# Start iteration
while True:
min_chi = None
min_idx = None
for i in range(len(freq_array) - 1):
# Calculate the chi square values of two adjacent groups in pairs to obtain the two groups with the minimum chi square value
v = calculate_chi(freq_array[i: i+2])
if min_chi is None or min_chi > v:
min_chi = v
min_idx = i

# Whether to continue iterative condition judgment
# Condition 1: the current number of boxes is still greater than the threshold of the maximum number of boxes
# Condition 2: the current minimum chi square value is still less than the specified chi square threshold
if (max_group is not None and max_group < len(freq_array)) or (chi_threshold is not None and min_chi < chi_threshold):
tmp = freq_array[min_idx] + freq_array[min_idx+1]
freq_array[min_idx] = tmp
freq_array = np.delete(freq_array, min_idx+1, 0)
best_bincut = np.delete(best_bincut, min_idx+1, 0)
else:
break

# Fill in the head and tail of the syncopation point
best_bincut = best_bincut.tolist()
best_bincut.append(data[var].min())
best_bincut.append(data[var].max())
best_bincut_set = set(best_bincut)
best_bincut = list(best_bincut_set)

best_bincut.remove(data[var].min())
best_bincut.append(data[var].min()-1)
# Sort cut point
best_bincut.sort()

return best_bincut
```

### 04 implementation of optimal bin division code based on optimal KS

The implementation steps of continuous variable optimal bin division based on optimal KS are as follows: 1. Given the continuous variable V, sort the values in V; 2. Each element value is a calculation point, corresponding to bin0 ~ 9 in the above figure; 3. Calculate the element with the largest KS as the optimal division point, and divide the variable into two parts D1 and D2; 4. Recursively iterate step 3 and calculate the partition points of data set D1 and D2 generated in step 3 until the stop conditions are met. (generally, the number of boxes reaches a certain threshold, or the KS value is less than a certain threshold)

```def get_maxks_split_point(data, var, target, min_sample=0.05):
""" calculation KS value
Args:
data: DataFrame，Data set of the optimal cut-off point list of chi square sub box to be calculated
var: Name of continuous variable to be calculated
target: Target column to be calculated Y Name of
min_sample: int，The minimum data sample of the sub box, that is, the amount of data at least reaches, which is generally used at the sub box point at the beginning or end
Returns:
ks_v: KS Value, float
BestSplit_Point: Return the optimal division point of this iteration, float
BestSplit_Position: Returns the position of the optimal dividing point, with 0 on the leftmost and 1 on the rightmost, float
"""
if len(data) < min_sample:
ks_v, BestSplit_Point, BestSplit_Position = 0, -9999, 0.0
else:
freq_df = pd.crosstab(index=data[var], columns=data[target])
freq_array = freq_df.values
if freq_array.shape[1] == 1: # If a group has only one enumeration value, such as 0 or 1, the array shape will have problems and jump out of this calculation
# tt = np.zeros(freq_array.shape).T
# freq_array = np.insert(freq_array, 0, values=tt, axis=1)
ks_v, BestSplit_Point, BestSplit_Position = 0, -99999, 0.0
else:
bincut = freq_df.index.values
tmp = freq_array.cumsum(axis=0)/(np.ones(freq_array.shape) * freq_array.sum(axis=0).T)
tmp_abs = abs(tmp.T[0] - tmp.T[1])
ks_v = tmp_abs.max()
BestSplit_Point = bincut[tmp_abs.tolist().index(ks_v)]
BestSplit_Position = tmp_abs.tolist().index(ks_v)/max(len(bincut) - 1, 1)

return ks_v, BestSplit_Point, BestSplit_Position

def get_bestks_bincut(data, var, target, leaf_stop_percent=0.05):
""" Calculate the optimal dividing point
Args:
data: DataFrame，Data set to be operated
var: String，Name of continuous variable to be divided into boxes
target: String，Y Column name
leaf_stop_percent: The proportion of leaf nodes, as the stop condition, is 5 by default%

Returns:
best_bincut: The best list of syncopation points, List
"""
min_sample = len(data) * leaf_stop_percent
best_bincut = []

def cutting_data(data, var, target, min_sample, best_bincut):
ks, split_point, position = get_maxks_split_point(data, var, target, min_sample)

if split_point != -99999:
best_bincut.append(split_point)

# The data set is segmented according to the optimal segmentation point, and the segmentation point is calculated recursively for the segmented data set until the stop condition is met
# print("the value range of this sub box is {0} ~ {1}". Format (data [var]. Min(), data [var] max()))
left = data[data[var] < split_point]
right = data[data[var] > split_point]

# When the data set after segmentation is still larger than the minimum data sample requirements, continue segmentation
if len(left) >= min_sample and position not in [0.0, 1.0]:
cutting_data(left, var, target, min_sample, best_bincut)
else:
pass
if len(right) >= min_sample and position not in [0.0, 1.0]:
cutting_data(right, var, target, min_sample, best_bincut)
else:
pass
return best_bincut
best_bincut = cutting_data(data, var, target, min_sample, best_bincut)

# Fill in the head and tail of the syncopation point
best_bincut.append(data[var].min())
best_bincut.append(data[var].max())
best_bincut_set = set(best_bincut)
best_bincut = list(best_bincut_set)

best_bincut.remove(data[var].min())
best_bincut.append(data[var].min()-1)
# Sort cut point
best_bincut.sort()

return best_bincut
```

### 05 test effect and section

Well, we have also implemented the above three methods of dividing boxes of continuous variables in Python. Let's test the effect right away.

```df['age_bins1'] = pd.cut(df['age'], bins=get_cart_bincut(df, 'age', 'target'))
df['age_bins2'] = pd.cut(df['age'], bins=get_chimerge_bincut(df, 'age', 'target'))
df['age_bins3'] = pd.cut(df['age'], bins=get_bestks_bincut(df, 'age', 'target'))
print("variable age The results are as follows:")
print("age_cart_bins:", get_cart_bincut(df, 'age', 'target'))
print("age_chimerge_bins:", get_chimerge_bincut(df, 'age', 'target'))
print("age_bestks_bins:", get_bestks_bincut(df, 'age', 'target'))
print("IV The values are as follows:")
print("age:", iv_count(df, 'age', 'target'))
print("age_cart_bins:", iv_count(df, 'age_bins1', 'target'))
print("age_chimerge_bins:", iv_count(df, 'age_bins2', 'target'))
print("age_bestks_bins:", iv_count(df, 'age_bins3', 'target'))

df['income_bins1'] = pd.cut(df['income'], bins=get_cart_bincut(df, 'income', 'target'))
df['income_bins2'] = pd.cut(df['income'], bins=get_chimerge_bincut(df, 'income', 'target'))
df['income_bins3'] = pd.cut(df['income'], bins=get_bestks_bincut(df, 'income', 'target'))
print("variable income The results are as follows:")
print("income_cart_bins:", get_cart_bincut(df, 'income', 'target'))
print("income_chimerge_bins:", get_chimerge_bincut(df, 'income', 'target'))
print("income_bestks_bins:", get_bestks_bincut(df, 'income', 'target'))
print("IV The values are as follows:")
print("income:", iv_count(df, 'income', 'target'))
print("income_cart_bins:", iv_count(df, 'income_bins1', 'target'))
print("income_chimerge_bins:", iv_count(df, 'income_bins2', 'target'))
print("income_bestks_bins:", iv_count(df, 'income_bins3', 'target'))
```

We can see from it that the effects of the three different box splitting methods are still somewhat different, but one thing in common is that the IV value is smaller than that in front of the box. After all, it is reasonable to sacrifice some "IV" for efficiency. In the actual modeling, I usually use three methods directly to select the one with the best box separation effect. The above is a relatively simple implementation. You are also welcome to try. You can give random feedback on any problems