# Chimerge algorithm and matlab code implementation (data discretization)

### 1. Algorithm analysis

#### (chi square test)

Is the deviation degree between the actual observation value and the theoretical inference value of the statistical sample. The deviation degree between the actual observation value and the theoretical inference value determines the chi square value. If the chi square value is large, the deviation is large, on the contrary, the deviation is small; If the square values are equal, the two values are equal.

χ 2 = ∑ i = 1 m ∑ j = 1 k ( A i j − E i j ) 2 E i j \chi^2=\sum_{i=1}^m\sum_{j=1}^k{(A_{ij}-E_{ij})^2\over E_{ij}} χ2=i=1∑m​j=1∑k​Eij​(Aij​−Eij​)2​
Suppose to compare the chi square of two attribute values in the first attribute in iris:

Attribute valueFirst classSecond classThird class
4.31600
4.9411

Data in table 16 16 16 indicates the number of instances whose attribute value of the first attribute in iris dataset is 4.3 and whose instance label is 1

Because the chi square between two adjacent intervals is compared
m m The value of m is 2
k k k is the number of categories of each dataset (the category of iris dataset is 3)
A i j A_{ij} Aij is the actual frequency. The ith attribute value (4.3 or 4.9) label is the number of times of class j
R i R_{i} Ri ， is the number of occurrences of the ith attribute value / / take iris dataset as an example (16 + 0 + 0)
C j C_j Cj is the total number of the j-th category / / take iris dataset as an example (16 + 4)
N N N is the amount of all the above data / / take iris data set as an example (16 + 0 + 0 + 4 + 1 + 1)
E i j E_{ij} Eij is A i j A_{ij} Aij # expected value R i ∗ C j N {R_i*C_j\over N} NRi​∗Cj​​

Chimerge algorithm mainly relies on Chi square test: the adjacent intervals with the minimum chi square value are combined until the stopping criterion is met.

Algorithm flow:
1) Initialization: sort the instances according to the discrete attribute values [first attribute, instance label] / [second attribute, instance label] /... [i attribute, instance label] to divide i group of matrices and sort the attribute values of this i group of matrices
2) Perform interval merging from bottom to top. When the conditions are met, stop merging
(there are two kinds of conditions to meet here: the first is that the chi square value is greater than a certain threshold; the second is that the merging will be stopped when the interval reaches a certain number)
. Calculate the chi square distribution of adjacent sections according to the above chi square formula
. Merge the pair of intervals with the lowest chi square value

### 2. Matlab code

function [new_data,cut_point]=Chimerge(data,data_block);

[m,n]=size(data);
class=length(unique(data(:,1)));
cut_point=zeros(n-1,data_block);
new_data=zeros(m,n);
new_data(:,1)=data(:,1);
att_map=cell(n,1);

for i=2:n  %traversal attributes
pair_att_lable=[data(:,i),data(:,1)];
pair_att_lable=sortrows(pair_att_lable,1);
tmp_map=zeros(1,class+1);

inner=1;
for j=1:m
num=pair_att_lable(j,1);
loc=find(tmp_map(:,1)==num);
if isempty(loc)
tmp_map(inner,1)=num;
tmp_map(inner,pair_att_lable(j,2)+1)=tmp_map(inner,pair_att_lable(j,2)+1)+1;
inner=inner+1;
else

tmp_map(loc,pair_att_lable(j,2)+1)=tmp_map(loc,pair_att_lable(j,2)+1)+1;
end
clear loc; %eliminate loc variable
end
att_map{i}=tmp_map;
clear num pair_att_lable tmp_map inner loc
end
clear num pair_att_lable tmp_map inner loc

for i=2:n
temp_data=att_map{i};
corrent_interval=length(temp_data(:,1));
while corrent_interval>data_block
chi2_mat=zeros(corrent_interval-1,1);
for j=1:corrent_interval-1
chi2=chi2test(temp_data(j,2:end),temp_data(j+1,2:end));
chi2_mat(j)=chi2;
end
[minv,index]=min(chi2_mat);           %Find the smallest chi square value
merge_loc_1=index;
merge_loc_2=index+1;
temp_data(merge_loc_1,2:end)=temp_data(merge_loc_1,2:end)+temp_data(merge_loc_2,2:end); %Data merging
temp_data(merge_loc_2,:)=[];
corrent_interval=corrent_interval-1;
end
%------------------------------------------Get split point

cut_point(i-1,:)=temp_data(:,1);
%------------------------------------------

%------------------------------------------
for j=1:m  %Traverse all instances
for k=1:data_block
if k~=data_block
if data(j,i)>=temp_data(k,1) && data(j,i)<temp_data(k+1,1) %temp_data Is a one-dimensional column matrix
new_data(j,i)=k;
break;
end
if data(j,i)<temp_data(1,1)
new_data(j,i)=1;
break;
end
else
if data(j,i)>=temp_data(k,1)
new_data(j,i)=k;
end
end
end
end
%-------------------------------------------

end

end

%Here is u v Finding chi square distribution from two row vectors
function chi2=chi2test(u,v)
len=length(u);   %Get the number of categories
N=sum(u)+sum(v); %Get the total number of instances
chi2=0;          %Initialize the passed parameters
box=[u;v];       %Constructed as a two row matrix
for i=1:2
for j=1:len
Eij=sum(box(i,:))*sum(box(:,j))/N;
if Eij==0
tmp_chi=0;
else
tmp_chi=(box(i,j)-Eij)^2/Eij;
end
chi2=chi2+tmp_chi;
end
end

end