Chimerge algorithm and matlab code implementation (data discretization)
1. Algorithm analysis
(chi square test)
Is the deviation degree between the actual observation value and the theoretical inference value of the statistical sample. The deviation degree between the actual observation value and the theoretical inference value determines the chi square value. If the chi square value is large, the deviation is large, on the contrary, the deviation is small; If the square values are equal, the two values are equal.
χ
2
=
∑
i
=
1
m
∑
j
=
1
k
(
A
i
j
−
E
i
j
)
2
E
i
j
\chi^2=\sum_{i=1}^m\sum_{j=1}^k{(A_{ij}-E_{ij})^2\over E_{ij}}
χ2=i=1∑mj=1∑kEij(Aij−Eij)2
Suppose to compare the chi square of two attribute values in the first attribute in iris:
Attribute value | First class | Second class | Third class |
---|---|---|---|
4.3 | 16 | 0 | 0 |
4.9 | 4 | 1 | 1 |
Data in table 16 16 16 indicates the number of instances whose attribute value of the first attribute in iris dataset is 4.3 and whose instance label is 1
Because the chi square between two adjacent intervals is compared
m
m
The value of m is 2
k
k
k is the number of categories of each dataset (the category of iris dataset is 3)
A
i
j
A_{ij}
Aij is the actual frequency. The ith attribute value (4.3 or 4.9) label is the number of times of class j
R
i
R_{i}
Ri , is the number of occurrences of the ith attribute value / / take iris dataset as an example (16 + 0 + 0)
C
j
C_j
Cj is the total number of the j-th category / / take iris dataset as an example (16 + 4)
N
N
N is the amount of all the above data / / take iris data set as an example (16 + 0 + 0 + 4 + 1 + 1)
E
i
j
E_{ij}
Eij is
A
i
j
A_{ij}
Aij # expected value
R
i
∗
C
j
N
{R_i*C_j\over N}
NRi∗Cj
Chimerge algorithm mainly relies on Chi square test: the adjacent intervals with the minimum chi square value are combined until the stopping criterion is met.
Algorithm flow:
1) Initialization: sort the instances according to the discrete attribute values [first attribute, instance label] / [second attribute, instance label] /... [i attribute, instance label] to divide i group of matrices and sort the attribute values of this i group of matrices
2) Perform interval merging from bottom to top. When the conditions are met, stop merging
(there are two kinds of conditions to meet here: the first is that the chi square value is greater than a certain threshold; the second is that the merging will be stopped when the interval reaches a certain number)
. Calculate the chi square distribution of adjacent sections according to the above chi square formula
. Merge the pair of intervals with the lowest chi square value
2. Matlab code
function [new_data,cut_point]=Chimerge(data,data_block); [m,n]=size(data); class=length(unique(data(:,1))); cut_point=zeros(n-1,data_block); new_data=zeros(m,n); new_data(:,1)=data(:,1); att_map=cell(n,1); for i=2:n %traversal attributes pair_att_lable=[data(:,i),data(:,1)]; pair_att_lable=sortrows(pair_att_lable,1); tmp_map=zeros(1,class+1); inner=1; for j=1:m num=pair_att_lable(j,1); loc=find(tmp_map(:,1)==num); if isempty(loc) tmp_map(inner,1)=num; tmp_map(inner,pair_att_lable(j,2)+1)=tmp_map(inner,pair_att_lable(j,2)+1)+1; inner=inner+1; else tmp_map(loc,pair_att_lable(j,2)+1)=tmp_map(loc,pair_att_lable(j,2)+1)+1; end clear loc; %eliminate loc variable end att_map{i}=tmp_map; clear num pair_att_lable tmp_map inner loc end clear num pair_att_lable tmp_map inner loc for i=2:n temp_data=att_map{i}; corrent_interval=length(temp_data(:,1)); while corrent_interval>data_block chi2_mat=zeros(corrent_interval-1,1); for j=1:corrent_interval-1 chi2=chi2test(temp_data(j,2:end),temp_data(j+1,2:end)); chi2_mat(j)=chi2; end [minv,index]=min(chi2_mat); %Find the smallest chi square value merge_loc_1=index; merge_loc_2=index+1; temp_data(merge_loc_1,2:end)=temp_data(merge_loc_1,2:end)+temp_data(merge_loc_2,2:end); %Data merging temp_data(merge_loc_2,:)=[]; corrent_interval=corrent_interval-1; end %------------------------------------------Get split point cut_point(i-1,:)=temp_data(:,1); %------------------------------------------ %------------------------------------------ for j=1:m %Traverse all instances for k=1:data_block if k~=data_block if data(j,i)>=temp_data(k,1) && data(j,i)<temp_data(k+1,1) %temp_data Is a one-dimensional column matrix new_data(j,i)=k; break; end if data(j,i)<temp_data(1,1) new_data(j,i)=1; break; end else if data(j,i)>=temp_data(k,1) new_data(j,i)=k; end end end end %------------------------------------------- end end %Here is u v Finding chi square distribution from two row vectors function chi2=chi2test(u,v) len=length(u); %Get the number of categories N=sum(u)+sum(v); %Get the total number of instances chi2=0; %Initialize the passed parameters box=[u;v]; %Constructed as a two row matrix for i=1:2 for j=1:len Eij=sum(box(i,:))*sum(box(:,j))/N; if Eij==0 tmp_chi=0; else tmp_chi=(box(i,j)-Eij)^2/Eij; end chi2=chi2+tmp_chi; end end end
Link: https://download.csdn.net/download/weixin_44912943/19102941
The chimerge M download file contains the analysis of the above code
If the code is wrong, please leave a message and point it out!
If you have any questions, please leave a message to discuss!